c pound

I reject your reality and substitute my own!
posts - 46, comments - 46, trackbacks - 40

My Links

News

Archives

Image Galleries

Blog Communities

Blog is a stupid word

Lunch Hour

Resources

Finding all the tags of a specified type in an HTML document

Say you want to find every instance of a specific type of tag in an HTML or XML document. The following code is one way to do that. It is, of course, not the most efficient way but its straight-forward and it works. I like that. So anyways, the first version of this code used the string.IndexOf (string, int) method to find image tags by searching for the static string “<img” but that ignores such aberrations as “<  img” and this could cause you to miss some tags if the file you’re parsing is not completely well formed. The new code has the ability to find these inconsistencies, but at the expense of having to run every tag in the file through some parsing code.

private List<string> FindAllTags(string tag, string htmlSource)
{
    int totalTags = 0;
    int index = 0;
    int length = 0;
    string match;
    List<string> matches = new List<string>();

    //find all matching tags in the document
    while(index > -1 && index < htmlSource.Length)
    {
        //find an opening chevron
        index = htmlSource.IndexOf("<", index);
        if(index > -1)
        {
            //get number of chars until the closing chevron
            length = htmlSource.IndexOf(">", index) - index + 1;
            if(length > 0)
            {
                //get a copy of the tag (including chevrons)
                match = htmlSource.Substring(index, length);
                index += length;

               
//if the tag matches the one we're looking for...
                if(match.TrimStart('<', ' ').StartsWith(tag))
                    matches.Add(match);

               
totalTags++;
            }
        }
    }

    Debug.WriteLine("Parsed " + totalTags + " tags");
    Debug.WriteLine("Found " + matches.Count + " \"" + tag + "\" tags");
    return matches;
}

Now, you can use this code in conjunction with the code presented earlier (for getting the text of a web-page) to do all sorts of interesting things. Have fun!

  • Share This Post:
  • Share on Twitter
  • Share on Facebook
  • Share on Technorati

Print | posted on Monday, October 24, 2005 5:32 AM |

Feedback

No comments posted yet.
Post A Comment
Title:
Name:
Email:
Website:
Comment:
Verification:
 
 

Powered by: