std::regex – Technical Infos

[regex_match: check if text matches with a given pattern]
[regex_search/sregex_iterator: Find matching expressions within a longer text]
[sregex_token_iterator and submatches: Find hyperlinks within a longer text]
[References]

regex_match: check if text matches with a given pattern

#include <regex>

const auto sampleString = "Some text message, counter=1356, var=4711";
std::regex pattern{"Some text message, counter=[0-9]+, var=4711"}; // ignore specific counter value
const bool match = std::regex_match(sampleString, pattern);
assert(match==true);

Rules for defining matches

. = any char different from new line
[a-z] = any lowercase char
[A-Z] = any uppercase char
[a-zA-Z] = any lowercase or uppercase char
[0-9a-fA-F] = a hexadecimal char with uppercase or lowercase letter
? = preceding char is optional, exists 0 or 1 times
* = preceding char exists 0 or N times
+ = preceding char exists 1 or N times
{3} = preceding char exists exactly 3 times
{3,} = preceding char exists at least 3 times
{3,5} = preceding char exists minimum 3 times and maximum 5 times
(this|that) = word is one of the alternatives “this” or “that”
^ = begin of expression (or within[] means “not”, e.g. [^abc]= any char different from a,b,c)
$ = end of expression
escape special chars “.?+-*[]()^$|” if part of regular text,
e.g. “sample text \\[see somewhere\\)]” or R”(sample text \[see somewhere\])”

Searching for typical patterns

Sample strings:

 1: Some text message, counter=4711, this=0122df3455, var=4711
 2: Some text message, counter=2, this=0, var=4711
 3: Some text message, counter=234, this=023DF, var=4712
 4: Some contents
 5: XSome contents
 6: XXSome contents
 7: USome contents
 8: UUUSome contents
 9: UUUUUSome contents
10: Some prefix to ignore Some contents
11: MyName is
12: MyName is Fred
13: MyName is Jane
14: MyName is Peter
15: MyName is Stephan
16: MyName is Maximilian
17: Some - any embedded not interesting text - contents
18: Anything else
19: Anything else .
20: Expected result: [x+y=3]

Each pattern is applied to all given strings using the following helper function:

void GetMatches(const std::string& in_pattern)
{
    std::regex pattern{in_pattern};
    std::string result{ in_pattern + " => "};
    int i = 0;
    for (const auto& str : sampleStrings)
    {
	++i;
	if (std::regex_match(str, pattern))
	{
            result += " " + std::to_string(i);
	}
    }
    result += "\n";
    std::cout << result;
}

For a list of typical patterns find matches using std::regex_match, the matching strings are given by their number:

Ignore not interesting counter values or pointer addresses
Some text message, counter=[0-9]+, this=[0-9a-fA-F]+, var=4711 =>  1 2
Some text message, counter=[0-9]+, this=[0-9a-fA-F]+, var=4712 =>  3

Expect exact prefix, optional prefix, prefix of min/maxLength, any prefix of any length > 0
XSome contents =>  5
X?Some contents =>  4 5
U{1,3}Some contents =>  7 8
.+Some contents =>  5 6 7 8 9 10

Expect any name, name with min length, alternative names
MyName is .* =>  11 12 13 14 15 16
MyName is .{7,} =>  15 16
MyName is (Fred|Maximilian) =>  12 16

Tolerate embedded text of any length, with min length 1
Some .*contents =>  4 10 17
Some .+contents =>  10 17

Expect any text containing some of the given alternatives
.*(text|is).* =>  1 2 3 11 12 13 14 15 16 17

Find text starting with 'Some'
Some.* =>  1 2 3 4 10 17

Find text ending with a word of 4 letters
.* .{4} =>  12 13 18

Find special chars '[', ']', '+'
Expected result: \[x\+y=3\] =>  20

regex_search/sregex_iterator: Find matching expressions within a longer text

Sample text:

const auto longText = R"( 
  Lorem ipsum dolor sit amet, consetetur sadipscing elitr,
  sed diam nonumy eirmod tempor invidunt ut labore et dolore jack-london@america.com magna aliquyam erat,
  sed diam _mail@some_where.de voluptua.At vero eos et accusam et justo duo dolores et ea rebum.
  Stet clita kasd illegal@bla nk.com gubergren, no sea Peter.Nobody@Nowhere.com takimata sanctus est.
  consetetur sadipscing elitr, at joe001@web.de vero eos et accusam.)";

The following helper function stores all matching expressions found within the longer text:

void FindPatternWithinText(const std::string& in_pattern)
{
	std::cout << "Search pattern: " << in_pattern << std::endl;

	std::vector<std::string> foundMatches;
	std::string textToAnalyze = longText;
	std::regex pattern{in_pattern};
	std::smatch matches;

	while (std::regex_search(textToAnalyze, matches, pattern)) {
		foundMatches.emplace_back(matches[0]); // first match within text
		textToAnalyze = matches.suffix().str(); // rest of text after found match
	}

	std::cout << "Num entries found: " << foundMatches.size() << std::endl;
	int id = 0;
	for (const auto& entry : foundMatches)
	{
	    std::cout << std::format("{:>2}: {}", ++id, entry) << std::endl;
	}
}

Alternative helper function using sregex_iterator:

void FindPatternWithinTextUsingSRegexIterator(const std::string& in_pattern)
{
	std::string textToAnalyze = longText;
	std::regex pattern{in_pattern};
	auto search_begin = std::sregex_iterator(textToAnalyze.begin(), textToAnalyze.end(), pattern);
	auto search_end = std::sregex_iterator();

	std::cout << "Found " << std::distance(search_begin, search_end) << " matches:\n";

	for (std::sregex_iterator i = search_begin; i != search_end; ++i)
	{
		std::smatch match = *i;
		std::string match_str = match.str();
		std::cout << match_str << '\n';
	}
}

Code snippet to search both for email addresses and words containing “um”:

void TestRegex_SearchMatchesWithinText()
{
	std::cout << longText << std::endl;

	// Simple pattern for a typical email address
	// Limitation: would allow also "..@__.de" which is technically not correct
	const auto emailPatternStr{ R"(([a-zA-Z0-9_\-\.]+)@([a-zA-Z0-9_\-\.]+)\.([a-zA-Z]{2,5}))" };

	// Pattern for a regular word containing "um"
	const auto wordPatternStr{ "[a-zA-Z]*um[a-zA-Z]*" };

	FindPatternWithinText(emailPatternStr);
	FindPatternWithinText(wordPatternStr);

	FindPatternWithinTextUsingSRegexIterator(emailPatternStr);
	FindPatternWithinTextUsingSRegexIterator(wordPatternStr);
}

Output:

Search pattern: ([a-zA-Z0-9_\-\.]+)@([a-zA-Z0-9_\-\.]+)\.([a-zA-Z]{2,5})
Num entries found: 4
 1: jack-london@america.com
 2: _mail@some_where.de
 3: Peter.Nobody@Nowhere.com
 4: joe001@web.de

Search pattern: [a-zA-Z]*um[a-zA-Z]*
Num entries found: 3
 1: ipsum
 2: nonumy
 3: rebum

Found 4 matches:
jack-london@america.com
_mail@some_where.de
Peter.Nobody@Nowhere.com
joe001@web.de

Found 3 matches:
ipsum
nonumy
rebum

sregex_token_iterator and submatches: Find hyperlinks within a longer text

Helper function to output both displayed description and the http adress for all found hyper links. The given typename “It” corrresponds to the std::sregex_token_iterator which iterates from submatch to submatch.

template<typename It>
void WriteAllLinks(It it)
{
    for (It end_it{}; it != end_it;)
    {
        const std::string link{*it++};
        if (it == end_it) break;
        const std::string desc{*it++};
        std::cout << std::format("{:.<24} {}\n", Trim(desc), Trim(link));
    }
}

Function searching all hyper links within a text file:

const std::string FILE_TO_CHECK = R"(C:\UserData\Gerald\Temp\SomeSrcDir\index.php)";

// Regular expression, submatch 1 = http-link (all chars between the 2 "),
// submatch 2 = description (all between > and </a>), ^ = NOT
const std::regex httpLinkRegEx {"a href=\"([^\"]*)\"[^<]*>([^<]*)</a>"};

void TestRegExTokenIteratorToFindHttpLinks()
{
	const auto text = ReadFileContentsToString(FILE_TO_CHECK);

	std::sregex_token_iterator it{text.begin(), text.end(),
		httpLinkRegEx, { 1,2 }}; // 1/2 relates to 2 submatches

	WriteAllLinks(it);
}

Output:

DeepL................... javascript:OpenNewWindow('https://www.deepl.com/de/translator')
Google.................. javascript:OpenNewWindow('http://translate.google.com/translate_t')
Leo..................... javascript:OpenNewWindow('http://dict.leo.org/ende?lang=de&amp;lp=ende')
MVV..................... javascript:OpenNewWindow('http://efa.mvv-muenchen.de/index.html#trip@enquiry')