Say you want to find every instance of a specific type of tag in an HTML or XML document. The following code is one way to do that. It is, of course, not the most efficient way but its straight-forward and it works. I like that. So anyways, the first version of this code used the string.IndexOf (string, int) method to find image tags by searching for the static string “<img” but that ignores such aberrations as “< img” and this could cause you to miss some tags if the file you’re parsing is not completely well formed. The new code has the ability to find these inconsistencies, but at the expense of having to run every tag in the file through some parsing code.
private List<string> FindAllTags(string tag, string htmlSource)
{
int totalTags = 0;
int index = 0;
int length = 0;
string match;
List<string> matches = new List<string>();
//find all matching tags in the document
while(index > -1 && index < htmlSource.Length)
{
//find an opening chevron
index = htmlSource.IndexOf("<", index);
if(index > -1)
{
//get number of chars until the closing chevron
length = htmlSource.IndexOf(">", index) - index + 1;
if(length > 0)
{
//get a copy of the tag (including chevrons)
match = htmlSource.Substring(index, length);
index += length;
//if the tag matches the one we're looking for...
if(match.TrimStart('<', ' ').StartsWith(tag))
matches.Add(match);
totalTags++;
}
}
}
Debug.WriteLine("Parsed " + totalTags + " tags");
Debug.WriteLine("Found " + matches.Count + " \"" + tag + "\" tags");
return matches;
}
Now, you can use this code in conjunction with the code presented earlier (for getting the text of a web-page) to do all sorts of interesting things. Have fun!