The result of a regular expression is a collection (MatchCollection) of Match objects (see Figure 1). Within each Match object is a collection (GroupCollection) of Group objects. Each Group object within the GroupCollection represents either the entire match or a sub-match that was defined via parenthesis.
Figure 1: .NET Regular Expression Class Hierarchy
Although you probably won’t encounter too many situations where you work directly with captures (Capture and CaptureCollection), to present a complete picture of regular expressions, this article covers what they are and when they do come into play.
Working with Captures
One reason that you’ll rarely deal with captures is that there’s usually only one capture per group. Therefore, you’ll generally extract your sub-match information from the Group object. This is also why you’ll generally see the terms group and capture used somewhat interchangeably. However, your regular expression will yield groups with multiple captures in a few situations, so this article focuses on that.
Take a simple expression used to extract a time-formatted value:
(?<time>(\d|\:)+)
You have a named group called time that searches for sequences of numbers and colons. Imagine using that expression to parse the following value:
Tom Archer posted at 12:34:56 today
Looking at the expression/input pair, you can see that the parser will yield three groups (including the entire match). However, what is not so obvious is that the second group will contain eight captures. The following function proves this:
using namespace System::Text; using namespace System::Text::RegularExpressions; ... void DisplayGroups(String* input, String* pattern, bool displayCats = false) { try { StringBuilder* results = new StringBuilder(); Regex* rex = new Regex(pattern); // for all the matches for (Match* match = rex->Match(input); match->Success; match = match->NextMatch()) { results->AppendFormat(S"Match {0} at {1}\r\n", match->Value, __box(match->Index)); // for all of THIS match's groups GroupCollection* groups = match->Groups; for (int i = 0; i < groups->Count; i++) { results->AppendFormat(S"\tGroup: Value '{0}' at Index {1}\r\n", groups->Item[i]->Value, __box(groups->Item[i]->Index)); if (displayCats) { // for all of THIS group's captures CaptureCollection* captures = groups->Item[i]->Captures; for (int j = 0; j < captures->Count; j++) { results->AppendFormat(S"\t\tCapture: Value '{0}' at Index {1}\r\n", captures->Item[j]->Value, __box(captures->Item[j]->Index)); } } } } MessageBox::Show(results->ToString()); } catch(Exception* pe) { MessageBox::Show(pe->Message); } }
Calling the function in the following manner will result in what you see in Figure 2:
DisplayGroups(S"Tom Archer posted at 12:34:56 today", S"(?<time>(\\d|\\:)+)", true);
Figure 2: Example of Groups vs. Captures
So, the obvious question is: Why so many captures? The second group (the inner parenthesis in the test expression) states that a sub-match can be made on any single digit or colon:
(\d|\:)
Because the input string contains eight such instances of a number or colon, you get eight captures! Therefore, while I frequently find it more convenient to use the term group to characterize the result of a sub-match, sub-matches technically result in captures.
Having said that, because groups do typically contain a single capture, you can use the two terms interchangeably unless you have a specific reason to differentiate between them—as in the example in this article.