[regex_match: check if text matches with a given pattern]
[regex_search/sregex_iterator: Find matching expressions within a longer text]
[sregex_token_iterator and submatches: Find hyperlinks within a longer text]
[References]
regex_match: check if text matches with a given pattern
#include <regex>
const auto sampleString = "Some text message, counter=1356, var=4711";
std::regex pattern{"Some text message, counter=[0-9]+, var=4711"}; // ignore specific counter value
const bool match = std::regex_match(sampleString, pattern);
assert(match==true);
Rules for defining matches
- . = any char different from new line
- [a-z] = any lowercase char
- [A-Z] = any uppercase char
- [a-zA-Z] = any lowercase or uppercase char
- [0-9a-fA-F] = a hexadecimal char with uppercase or lowercase letter
- ? = preceding char is optional, exists 0 or 1 times
- * = preceding char exists 0 or N times
- + = preceding char exists 1 or N times
- {3} = preceding char exists exactly 3 times
- {3,} = preceding char exists at least 3 times
- {3,5} = preceding char exists minimum 3 times and maximum 5 times
- (this|that) = word is one of the alternatives “this” or “that”
- ^ = begin of expression (or within[] means “not”, e.g. [^abc]= any char different from a,b,c)
- $ = end of expression
- escape special chars “.?+-*[]()^$|” if part of regular text,
e.g. “sample text \\[see somewhere\\)]” or R”(sample text \[see somewhere\])”
Searching for typical patterns
Sample strings:
1: Some text message, counter=4711, this=0122df3455, var=4711
2: Some text message, counter=2, this=0, var=4711
3: Some text message, counter=234, this=023DF, var=4712
4: Some contents
5: XSome contents
6: XXSome contents
7: USome contents
8: UUUSome contents
9: UUUUUSome contents
10: Some prefix to ignore Some contents
11: MyName is
12: MyName is Fred
13: MyName is Jane
14: MyName is Peter
15: MyName is Stephan
16: MyName is Maximilian
17: Some - any embedded not interesting text - contents
18: Anything else
19: Anything else .
20: Expected result: [x+y=3]
Each pattern is applied to all given strings using the following helper function:
void GetMatches(const std::string& in_pattern)
{
std::regex pattern{in_pattern};
std::string result{ in_pattern + " => "};
int i = 0;
for (const auto& str : sampleStrings)
{
++i;
if (std::regex_match(str, pattern))
{
result += " " + std::to_string(i);
}
}
result += "\n";
std::cout << result;
}
For a list of typical patterns find matches using std::regex_match, the matching strings are given by their number:
Ignore not interesting counter values or pointer addresses
Some text message, counter=[0-9]+, this=[0-9a-fA-F]+, var=4711 => 1 2
Some text message, counter=[0-9]+, this=[0-9a-fA-F]+, var=4712 => 3
Expect exact prefix, optional prefix, prefix of min/maxLength, any prefix of any length > 0
XSome contents => 5
X?Some contents => 4 5
U{1,3}Some contents => 7 8
.+Some contents => 5 6 7 8 9 10
Expect any name, name with min length, alternative names
MyName is .* => 11 12 13 14 15 16
MyName is .{7,} => 15 16
MyName is (Fred|Maximilian) => 12 16
Tolerate embedded text of any length, with min length 1
Some .*contents => 4 10 17
Some .+contents => 10 17
Expect any text containing some of the given alternatives
.*(text|is).* => 1 2 3 11 12 13 14 15 16 17
Find text starting with 'Some'
Some.* => 1 2 3 4 10 17
Find text ending with a word of 4 letters
.* .{4} => 12 13 18
Find special chars '[', ']', '+'
Expected result: \[x\+y=3\] => 20
regex_search/sregex_iterator: Find matching expressions within a longer text
Sample text:
const auto longText = R"(
Lorem ipsum dolor sit amet, consetetur sadipscing elitr,
sed diam nonumy eirmod tempor invidunt ut labore et dolore jack-london@america.com magna aliquyam erat,
sed diam _mail@some_where.de voluptua.At vero eos et accusam et justo duo dolores et ea rebum.
Stet clita kasd illegal@bla nk.com gubergren, no sea Peter.Nobody@Nowhere.com takimata sanctus est.
consetetur sadipscing elitr, at joe001@web.de vero eos et accusam.)";
The following helper function stores all matching expressions found within the longer text:
void FindPatternWithinText(const std::string& in_pattern)
{
std::cout << "Search pattern: " << in_pattern << std::endl;
std::vector<std::string> foundMatches;
std::string textToAnalyze = longText;
std::regex pattern{in_pattern};
std::smatch matches;
while (std::regex_search(textToAnalyze, matches, pattern)) {
foundMatches.emplace_back(matches[0]); // first match within text
textToAnalyze = matches.suffix().str(); // rest of text after found match
}
std::cout << "Num entries found: " << foundMatches.size() << std::endl;
int id = 0;
for (const auto& entry : foundMatches)
{
std::cout << std::format("{:>2}: {}", ++id, entry) << std::endl;
}
}
Alternative helper function using sregex_iterator:
void FindPatternWithinTextUsingSRegexIterator(const std::string& in_pattern)
{
std::string textToAnalyze = longText;
std::regex pattern{in_pattern};
auto search_begin = std::sregex_iterator(textToAnalyze.begin(), textToAnalyze.end(), pattern);
auto search_end = std::sregex_iterator();
std::cout << "Found " << std::distance(search_begin, search_end) << " matches:\n";
for (std::sregex_iterator i = search_begin; i != search_end; ++i)
{
std::smatch match = *i;
std::string match_str = match.str();
std::cout << match_str << '\n';
}
}
Code snippet to search both for email addresses and words containing “um”:
void TestRegex_SearchMatchesWithinText()
{
std::cout << longText << std::endl;
// Simple pattern for a typical email address
// Limitation: would allow also "..@__.de" which is technically not correct
const auto emailPatternStr{ R"(([a-zA-Z0-9_\-\.]+)@([a-zA-Z0-9_\-\.]+)\.([a-zA-Z]{2,5}))" };
// Pattern for a regular word containing "um"
const auto wordPatternStr{ "[a-zA-Z]*um[a-zA-Z]*" };
FindPatternWithinText(emailPatternStr);
FindPatternWithinText(wordPatternStr);
FindPatternWithinTextUsingSRegexIterator(emailPatternStr);
FindPatternWithinTextUsingSRegexIterator(wordPatternStr);
}
Output:
Search pattern: ([a-zA-Z0-9_\-\.]+)@([a-zA-Z0-9_\-\.]+)\.([a-zA-Z]{2,5})
Num entries found: 4
1: jack-london@america.com
2: _mail@some_where.de
3: Peter.Nobody@Nowhere.com
4: joe001@web.de
Search pattern: [a-zA-Z]*um[a-zA-Z]*
Num entries found: 3
1: ipsum
2: nonumy
3: rebum
Found 4 matches:
jack-london@america.com
_mail@some_where.de
Peter.Nobody@Nowhere.com
joe001@web.de
Found 3 matches:
ipsum
nonumy
rebum
sregex_token_iterator and submatches: Find hyperlinks within a longer text
Helper function to output both displayed description and the http adress for all found hyper links. The given typename “It” corrresponds to the std::sregex_token_iterator which iterates from submatch to submatch.
template<typename It>
void WriteAllLinks(It it)
{
for (It end_it{}; it != end_it;)
{
const std::string link{*it++};
if (it == end_it) break;
const std::string desc{*it++};
std::cout << std::format("{:.<24} {}\n", Trim(desc), Trim(link));
}
}
Function searching all hyper links within a text file:
const std::string FILE_TO_CHECK = R"(C:\UserData\Gerald\Temp\SomeSrcDir\index.php)";
// Regular expression, submatch 1 = http-link (all chars between the 2 "),
// submatch 2 = description (all between > and </a>), ^ = NOT
const std::regex httpLinkRegEx {"a href=\"([^\"]*)\"[^<]*>([^<]*)</a>"};
void TestRegExTokenIteratorToFindHttpLinks()
{
const auto text = ReadFileContentsToString(FILE_TO_CHECK);
std::sregex_token_iterator it{text.begin(), text.end(),
httpLinkRegEx, { 1,2 }}; // 1/2 relates to 2 submatches
WriteAllLinks(it);
}
Output:
DeepL................... javascript:OpenNewWindow('https://www.deepl.com/de/translator')
Google.................. javascript:OpenNewWindow('http://translate.google.com/translate_t')
Leo..................... javascript:OpenNewWindow('http://dict.leo.org/ende?lang=de&lp=ende')
MVV..................... javascript:OpenNewWindow('http://efa.mvv-muenchen.de/index.html#trip@enquiry')