Regular expressions works really wonderful when you want to extract matches of a pattern within a string. You can also use regular expression to replace a pattern with some other text, or you can split the string on matching patterns as delimiter.
Is it confusing? Are you wondering now what pattern mean?
Ok, pattern is just a set of characters and meta-characters to describe all the text you are interested in. set of characters just represents the string (literal pattern) and the meta-characters give your pattern the power to match more than one string with a single pattern. The concept of meta-char is very much similar (but much more than that) to the wildcard characters you must have used to match more than one file in your dir (DOS) or ls (UNIX) command to list out files; for example match all files having extension “exe” you write “*.exe”. Or match all files having ‘a’ and ‘b’ separated by any one character, you write “*a?b*”.
List of Meta-characters used in patterns:
\ general escape character with several uses
^ assert start of subject (or line, in multi-line mode)
$ assert end of subject (or line, in multi-line mode)
. match any character except newline (by default)
[ start of “character class” definition
] end of “character class” definition
| start of alternative branch
( start sub-pattern
) end sub-pattern
? extends the meaning of (, also 0 or 1 quantifier, also quantifier minimizer
* 0 or more quantifier
+ 1 or more quantifier
{ start min/max quantifier
} end min/max quantifier
Note: The part of pattern enclosed by square brackets is called character class
List of meta-characters in a "character class":
\ general escape character
^ negate the class, but only if the first character
- indicates character range
] terminates the character class
Please note if you want to match any of this meta-character as literal then you must escape the meta-character to suppress its meaning as meta-character.
Backslash is further used to specify generic character families:
\d any decimal digit
\D any character not covered by \d
\s any whitespace character
\S any character not covered by \s
\w any "word" character
\W any character not covered by \w
\r carriage return
\n new line
\t tab character
Backslash is also used to specify simple assertions:
\b word boundary
\B not a word boundary
Now we have learned enough concepts about the Regular Expressions so let us start writing our own patterns.
Suppose my string is:
This is a test string in which we will find the matching patterns. This string also contains few numeric text like ITM_2345, and ITM_4321 which represents some useful code of very interesting items. These item codes are also just for testing and all such codes start with ITM_ and then followed by some numeric digits.
Simplest example will be finding exact literal matches.
Example1:
Objective: Find all ‘test’ from the given string.
Pattern: test
The matches found are (underlined and bold faced):
This is a test string in which we will find the matching patterns. This string also contains few numeric text like ITM_2345, and ITM_4321 which represents some useful code of very interesting items. These item codes are also just for testing and all such codes start with ITM_ and then followed by some numeric digits.
Now we apply some meta-characters and extract only test not the testing.
Example2:
Objective: Find only ‘test’ not testing from the given string.
Pattern: \stest\s
The matches found are (underlined and bold faced):
This is a test string in which we will find the matching patterns. This string also contains few numeric text like ITM_2345, and ITM_4321 which represents some useful code of very interesting items. These item codes are also just for testing and all such codes start with ITM_ and then followed by some numeric digits.
Complicating further we will now write a pattern to extract all item codes from this string.
Example3:
Objective: Find all item codes from the given string.
Pattern: \sITM_\d*\s
The matches found are (underlined and bold faced):
This is a test string in which we will find the matching patterns. This string also contains few numeric text like ITM_2345, and ITM_4321 which represents some useful code of very interesting items. These item codes are also just for testing and all such codes start with ITM_ and then followed by some numeric digits.
Improving pattern further so that only valid item codes are extracted
Example4:
Objective: Find all valid item codes from the given string.
Pattern: \sITM_\d+\s
The matches found are (underlined and bold faced):
This is a test string in which we will find the matching patterns. This string also contains few numeric text like ITM_2345, and ITM_4321 which represents some useful code of very interesting items. These item codes are also just for testing and all such codes start with ITM_ and then followed by some numeric digits.
Ok. Enough understanding has been gained on regular expressions to solve real world situations. Now we try to write sample patterns to match some useful information like ZIP Code or Phone Number.
|
1 |
US ZIP Code |
^\d{5}(-\d{4})?$ |
|
2 |
US Phone Number |
\(?\d{3}\)?[-\s.]?\d{3}[-.]\d{4} |
|
3 |
HTML Tag |
?\W+\S*[^>]*>
|
|
4 |
Email Address
|
[A-Za-z0-9._%-]+@[A-Za-z0-9._%-]+\.[A-Za-z]{2,4} |
Note: The patterns provided here are just for sample purpose and need not correctly match all possible information.
I will try to explain these sample patterns:
US ZIP Code:
^ Beginning of line.
\d Any numeric character.
{5} 5 occurrences. Here only one quantifier is used so this will match the exact quantity of preceding character.
( Start sub-pattern.
- Match a ‘–‘ character.
\d Any numeric character.
{4} 4 occurrence.
) End sub-pattern
? 0 or 1 occurrences of sub-pattern
$ End of line
Example strings:
12345
12345-1234
US Phone Number:
\(? Match ‘(‘ 0 or 1 time (making it optional)
\d{3} 3 Numeric Characters
\)? Match ‘)‘ 0 or 1 time (making it optional)
[-\s.]? Match ‘-‘ or space or ‘.’ 0 or 1 time (making it optional), character class is used to provide a set of possible characters.
\d{3} 3 Numeric Characters
[-.] Match 1 occurrence of ‘-‘ or ‘.’
\d{4} 4 Numeric Characters
Example strings:
(123)123-1234
123-456-7890
123.456.7890
123 456-7890
123456-7890
HTML Tag:
< Match 1 occurrence of ‘<’
/? Match 0 or 1 occurrence of ‘/’
\w+ Word characters 1 or more occurrences
\s* Space characters 0 or more occurrences
[^>]* Matches any character other than ‘>’ 0 or more occurrences, ^ negates the character class.
> Match 1 occurrence of ‘>’
Example strings:
<B>
</B>
<img src=”abc.jpg”>
<input type=’text’ value=’Test Value’ >
</script>
Email Address:
[A-Za-z0-9._%-]+ Match any characters from this character class, 1 or more occurrence. ‘-‘ is used to specify the range. ‘A-Z’ means any character from A to Z both including. No need to escape ‘.’ As it is specified under character class and hence has no meta-character significance.
@ Match ‘@’ character
[A-Za-z0-9._%-]+ Match any characters from this character class, 1 or more occurrence.
\. Match ‘.’ Backslash is used to escape the meta character meaning.
[A-Za-z]{2,4} Any alphabet upper case or lower case. Occurrence can be 2 to 4.
Example strings:
abc@abc.com
abc@yahoo.co.in
abc.def_007@test.ca
For more on Regular Expressions >> Regular Expression in C#