Geeks With Blogs
Steal This Code All good programmers borrow at least part of their code. So feel free to borrow some of mine.
Web site scraping is a great tool, and it is so easy in .Net. But there are some snags, especially if you are trying to build a robust system that can run unattended. One of these problems are web errors like '999 web site temporarily unavailable.' .Net views these errors as fatal and boom you are out of the program. Now you can catch them then recall the mentod, but this recursive action can eat up memory very quickly. There is however another way:

public static string ScrapeURL (string URL) {
HttpWebRequest req = (HttpWebRequest)WebRequest.Create(URL);
HttpStatusCode code = new HttpStatusCode();
HttpWebResponse resp = null;

try {
resp = (HttpWebResponse)req.GetResponse();
code = resp.StatusCode;
} catch(WebException err) {
return null;
}

string strResult;

using (StreamReader sr =
new StreamReader(resp.GetResponseStream())) {
strResult = sr.ReadToEnd();
// Close and clean up the StreamReader
sr.Close();
}
return strResult;
}

Then all you have to do is call this procedure in a do..while (response !=null) loop and away you go. Notice that you avoid stacking up multiple calls because all we are doing here is ignoring the WebErrors.

You can also get more specific and use the code = resp.StatusCode in a case satatement to different things based on the specific HttpWebError. (ie you may want to react to a 404 error differently than you do for a 999 error. Posted on Monday, October 17, 2005 11:37 AM Tips Tricks and Frustrating Errors | Back to top


Comments on this post: How to scrape a website without being stumped by 999 errors

# re: How to scrape a website without being stumped by 999 errors
Requesting Gravatar...
Simple and Clean. It worked great for what I needed.
Left by Regman on Jan 23, 2006 8:45 AM

Your comment:
 (will show your gravatar)
 


Copyright © Jeff Tolle | Powered by: GeeksWithBlogs.net | Join free