Sometimes you need to find out that if the urls on the page exists or not. The following code reads the HTML of the page and extracts all the urls and finally checks if the url exists or not.

Take a look at the following code:

protected void Button1_Click(object sender, EventArgs e)
    {
        WebRequest req = WebRequest.Create("http://localhost:1852/
        LookIntoDoPostBack/UrlList.aspx");
        HttpWebResponse res = (HttpWebResponse) req.GetResponse();
        Stream stream =  res.GetResponseStream();
        ArrayList badUrls = 
new ArrayList(); 

        StreamReader reader = 
new StreamReader(stream);
        
string html = reader.ReadToEnd();
    
        
// Get the links 
        
string pattern = @"((http|ftp|https):\/\/w{3}[\d]*.|(http|ftp|https)
        :\/\/|w{3}[\d]*.)([\w\d\._\-#\(\)\[\]\\,;:]+@[\w\d\._\-#\(\)\[\]\\
        ,;:])?([a-z0-9]+.)*[a-z\-0-9]+.([a-z]{2,3})?[a-z]{2,6}(:[0-9]+)?(\/
        [\/a-z0-9\._\-,]+)*[a-z0-9\-_\.\s\%]+(\?[a-z0-9=%&\.\-,#]+)?";

        Regex r = 
new Regex(pattern);
        MatchCollection mC = r.Matches(html); 

        
// Iterate through the collection and find if the Url Exists or not 

        
foreach (Match m in mC)
        {
            
if (!DoesUrlExists(m.Value))
            {
                
// Add to the broken urls 
                
badUrls.Add(m.Value); 
            }
        } 
      
        
// Display the bad urls in the GridView control 
        
gvBadUrls.DataSource = badUrls;
        gvBadUrls.DataBind(); 
    }

    
private bool DoesUrlExists(string url)
    {
        
bool urlExists = false;
        WebRequest req = WebRequest.Create(url);

        
try
        
{  
            HttpWebResponse response = (HttpWebResponse) req.GetResponse();
            urlExists = 
true
        }
        
catch (System.Net.WebException ex)
        {
          
        }

        
return urlExists; 
    }

When I find a bad url I simply put it in a ArrayList. Later I display the bad urls in the GridView control. The code will not display any bad url if your ISP is tranfering you to a custom page when the Page Not Found exception is thrown. Also, this process of checking the url is very time consuming so I suggest if you use it then try to run this process in a different thread.

powered by IMHO 1.3