An Idea can change your life.....

Wednesday, August 06, 2008

Regular Expressions in Dotnet

Quantifiers provide a simple way to specify within a pattern how many times a particular character or set of characters is allowed to repeat itself. There are three non-explicit quantifiers:

  1. *, which describes "0 or more occurrences",
  2. +, which describes "1 or more occurrences", and
  3. ?, which describes "0 or 1 occurrence".

Quantifiers always refer to the pattern immediately preceding (to the left of) the quantifier, which is normally a single character unless parentheses are used to create a pattern group. Below are some sample patterns and inputs they would match.

Pattern

Inputs (Matches)



fo*

foo, foe, food, fooot, "forget it", funny, puffy


fo+

foo, foe, food, foot, "forget it"


fo?

foo, foe, food, foot, "forget it", funny, puffy






In addition to specifying that a given pattern may occur exactly 0 or 1 time, the ? character also forces a pattern or subpattern to match the minimal number of characters when it might match several in an input string.

Explicit quantifiers are positioned following the pattern they apply to, just like regular quantifiers. Explicit quantifiers use curly braces {} and number values for upper and lower occurrence limits within the braces. For example, x{5} would match exactly five x characters (xxxxx). When only one number is specified, it is used as the upper bound unless it is followed by a comma, such as x{5,}, which would match any number of x characters greater than 4. Below are some sample patterns and inputs they would match.

Pattern

Inputs (Matches)

ab{2}c

abbc, aaabbccc

ab{,2}c

ac, abc, abbc, aabbcc

ab{2,3}c

abbc, abbbc, aabbcc, aabbbcc

Metacharacters

The constructs within regular expressions that have special meaning are referred to as metacharacters. You've already learned about several metacharacters, such as the *, ?, +, and { } characters. Several other characters have special meaning within the language of regular expressions. These include the following: $ ^ . [ ( | ) ] and \.

. It matches any single character

^ used to designate the beginning of a string (or line)

$ is used to designate the end of a string

\ is used to "escape" characters from their special meaning

| (pipe) is used for alternation, essentially to specify 'this OR that' within a pattern.

( ) used to group patterns.

Some examples of metacharacter usage are listed below.

Pattern

Inputs (Matches)

.

a, b, c, 1, 2, 3

.*

Abc, 123, any string, even no characters would match

^c:\\

c:\windows, c:\\\\\, c:\foo.txt, c:\ followed by anything else

abc$

abc, 123abc, any string ending with abc

(abc){2,3}

abcabc, abcabcabc

In order to include a literal version of a metacharacter in a regular expression, it must be "escaped" with a backslash.

For instance if you wanted to match strings that begin with "c:\" you might use this: ^c:\\

So something like a|b would match anything with an 'a' or a 'b' in it, and would be very similar to the character class [ab].

Character classes are a mini-language within regular expressions, defined by the enclosing hard braces [ ]. The simplest character class is simply a list of characters within these braces, such as [aeiou].

To specify any numeric digit, the character class [0123456789] could be used. However, since this would quickly get cumbersome, ranges of characters can be defined within the braces by using the hyphen character, -.

Eg: [a-z],[A-Z],[0-9]

If you need a hyphen to be included in your range, specify it as the first character. For example, [-.? ]

You can also match any character except a member of a character class by negating the class using the carat ^ as the first character in the character class. Thus, to match any non-vowel character, you could use a character class of [^aAeEiIoOuU].

Pattern

Inputs (Matches)

^b[aeiou]t$

Bat, bet, bit, bot, but

^[0-9]{5}$

11111, 12345, 99999

^c:\\

c:\windows, c:\\\\\, c:\foo.txt, c:\ followed by anything else

abc$

abc, 123abc, any string ending with abc

(abc){2,3}

abcabc, abcabcabc

^[^-][0-9]$

0, 1, 2, … (will not match -0, -1, -2, etc.)

Metacharacter

Equivalent Character Class

\a

Matches a bell (alarm); \u0007

\b

Matches a word boundary except in a character class, where it matches a backspace character, \u0008

\t

Matches a tab; \u0009

\r

Matches a carriage return; \u000D

\w

Matches a vertical tab; \u000B

\f

Matches a form feed; \u000C

\n

Matches a new line; \u000A

\e

Matches an escape; \u001B

\040

Matches an ASCII character with a three-digit octal. \040 represents a space (Decimal 32).

\x20

Matches an ASCII character using 2-digit hexadecimal. In this case, \x2- represents a space.

\cC

Matches an ASCII control character, in this case ctrl-C.

\u0020

Matches a Unicode character using exactly four hexadecimal digits. In this case \u0020 is a space.

\*

Any character that does not represent a predefined character class is simply treated as that character. Thus \* is the same as \x2A (a literal *, not the * metacharacter).

\p{name}

Matches any character in the named character class 'name'. Supported names are Unicode groups and block ranges. For example Ll, Nd, Z, IsGreek, IsBoxDrawing, and Sc (currency).

\P{name}

Matches text not included in the named character class 'name'.

\w

Matches any word character. For non-Unicode and ECMAScript implementations, this is the same as [a-zA-Z_0-9]. In Unicode categories, this is the same as [\p{Ll}\p{Lu}\p{Lt}\p{Lo}\p{Nd}\p{Pc}].

\W

The negation of \w, this equals the ECMAScript compliant set [^a-zA-Z_0-9] or the Unicode character categories [^\p{Ll}\p{Lu}\p{Lt}\p{Lo}\p{Nd}\p{Pc}].

\s

Matches any white-space character. Equivalent to the Unicode character classes [\f\n\r\t\v\x85\p{Z}]. If ECMAScript-compliant behavior is specified with the ECMAScript option, \s is equivalent to [ \f\n\r\t\v] (note leading space).

\S

Matches any non-white-space character. Equivalent to the Unicode character categories [^\f\n\r\t\v\x85\p{Z}]. If ECMAScript-compliant behavior is specified with the ECMAScript option, \S is equivalent to [^ \f\n\r\t\v] (note space after ^).

\d

Matches any decimal digit. Equivalent to [\p{Nd}] for Unicode and [0-9] for non-Unicode, ECMAScript behavior.

\D

Matches any non-decimal digit. Equivalent to [\P{Nd}] for Unicode and [^0-9] for non-Unicode, ECMAScript behavior.

Sample Expressions

Most people learn best by example, so here are a very few sample expressions.

Pattern

Description

^\d{5}$

5 numeric digits, such as a US ZIP code.

^(\d{5})|(\d{5}-\d{4}$

5 numeric digits, or 5 digits-dash-4 digits. This matches a US ZIP or US ZIP+4 format.

^(\d{5})(-\d{4})?$

Same as previous, but more efficient. Uses ? to make the -4 digits portion of the pattern optional, rather than requiring two separate patterns to be compared individually (via alternation).

^[+-]?\d+(\.\d+)?$

Matches any real number with optional sign.

^[+-]?\d*\.?\d*$

Same as above, but also matches empty string.

^(20|21|22|23|[01]\d)[0-5]\d$

Matches any 24-hour time value.

/\*.*\*/

Matches the contents of a C-style comment /* … */

Regex test;

test = new Regex("testing");

Match m = test.Match("here is a string for testing");

if (m.Success) {

// do whatever you want

}

No comments: