2007.12.13 10:37 AM

Multi-line Regex String Literals in C#

It dawned on me after writing about the use of @-quoted string literal syntax for multi-line SQL string literals in C# that it's also very handy for complex regular expressions, like this:

private static Regex scrubPattern = new Regex(@"
  (?<quote>&quot;|\u201C|\u201D|&[lr]dquo;|&\#(?:8220|8221|34);)|
  (?<squote>&[lr]squo;|&\#(?:8216|8217|39);)|
  (?<win1252squote>\x92)|
  (?<amp>\&amp;)|
  (?<whitespace>\x09|\x0D|\x0A)|
  (?<ignoredEntities>&\#?[\w]{2,7};)|
  (?<html><[^>]+>)|
  (?<badxml>[\x01-\x08]|\x10|[\x0B-\x0C]|[\x0E-\x1F]|[\x80-\x9F])|
  (?<unisymbols>[\uE0AC-\uE0D5]|[\uF041-\uF07A]|[\uF0A0-\uF0FE])
", RegexOptions.IgnorePatternWhitespace | RegexOptions.IgnoreCase);

Or this, from Wes Haggard's very interesting post Matching Balanced Constructs with .NET Regular Expressions:

Regex re = new Regex(string.Format(@"^
  {0}                       # Match first opeing delimiter
  (?<inner>
    (?>
        {0} (?<LEVEL>)      # On opening delimiter push level
      | 
        {1} (?<-LEVEL>)     # On closing delimiter pop level
      |
        (?! {0} | {1} ) .   # Match any char unless the opening   
    )+                      # or closing delimiters are in the lookahead string
    (?(LEVEL)(?!))          # If level exists then fail
  )
  {1}                       # Match last closing delimiter
  $", "<quote>", "</quote>"), 
  RegexOptions.IgnorePatternWhitespace | RegexOptions.IgnoreCase);

re.Match("<quote>inner text</quote>").Groups["inner"].Value == "inner text"
re.Match("<quote>a<quote>b</quote>c</quote>").Groups["inner"].Value == "a<quote>b</quote>c"

This example from Mr. Haggard's post illustrates how to "retrieve the text between a set of tags when there is the possibility of the nesting." Good stuff.

Note that when initializing regular expressions with strings created using this type of multi-line syntax, it's necessary to include the RegexOptions.IgnorePatternWhitespace flag, because the resulting strings will include line breaks and indentation spaces preceding each line. And, as illustrated in the first example above, the RegexOptions.IgnorePatternWhitespace flag makes it necessary to escape #'s in the pattern to prevent them from being interpreted as x-mode comments.


Comments


TrackBack

TrackBack URL:  http://www.typepad.com/services/trackback/6a00d8341c7bd453ef00e54fb656128834

Listed below are links to weblogs that reference Multi-line Regex String Literals in C#: