.NET Nomad

What I've learned along the way

  Home  |   Contact  |   Syndication    |   Login
  12 Posts | 0 Stories | 38 Comments | 0 Trackbacks

News

Archives

Post Categories

Download Solution - OfflineHtml.zip

So, one of the cool controls available to us in WinForms is System.Windows.Forms.WebBrowser.

The WebBrowser control is essentially a managed wrapper around some COM interfaces that bind to Internet Explorer and provides us with several interesting capabilities.  First of all, one can use WebBrowser to easily display a web page in a WinForms application.  All you have to do is set the WebBrowser.Url property and the control takes care of getting the assets from across the wire and rendered on the screen.

WebBrowser also exposes some interesting events that allow a programmer to react when a document is loaded, navigation is peformed, etc.  There are probably a ton of places, including MSDN, where you can get that kind of information so I won't go over it here.  Instead, I am going to show something that isn't immediately obvious, but that I believe I found a clean solution to.

 

The Task

What I want to do is load an HTML page that is on my local computer without causing any network traffic, e.g. it won't load images on the page.  Similar to say, loading a web archive in Internet Explorer.  For our purposes let's use the Google Home Page as an example.

 

The First Attempt

I immediately set upon this task thinking it would be pretty easy.  From what I had gathered on MSDN, after loading a page the WebBrowser control's Document property is populated with an HtmlDocument object.  Similar to System.Xml.XmlDocument, HtmlDocument is a tree like representation of the web page's HTML DOM and it exposes some handy properties for manipulating the HTML elements rendered by the WebBrowser control.  For example, the following code demonstrates setting all of the "src" attributes of the HtmlDocument's img tags to the empty string:

public HtmlDocument StripImageLoading(HtmlDocument document)
{

    foreach (HtmlElement image in document.Images)
       image.SetAttribute("src", string.Empty);
            
    return document;

}

Iterating over the various HtmlElementCollection objects exposed through HtmlDocument's properties allows one to alter, and even add, HTML elements. 

This is great, but how do we actually get the WebBrowser control to load an HtmlDocument for us?  There are three primary methods, each of which I'll demonstrate with a code snippet.

 

Setting WebBrowser.Url:

public void LoadPage()
{

    WebBrowser browser = new WebBrowser();
    browser.Url = new Uri("http://www.google.com");

}

The "primary" way to load a page is to set WebBrowser.Url to a valid Uri object.  When this is done the WebBrowser will get all required data for the page via HTTP and render the results into our HtmlDocument (accessible via the WebBrowser.Document property).

 

Setting WebBrowser.DocumentText:

public void LoadPage()
{

    WebBrowser browser = new WebBrowser();
    browser.DocumentText = @"<html><img src=""http://www.domain.com/someimage.gif""</html>";
}

This is the first method that would enable us to achieve our offline viewing goal.  We simply set WebBrowser.DocumentText with a string of HTML and the control uses that to render the page.  The issue with this method is that any HREFs or SRC attributes will be resolved by the WebBrowser control.  In otherwords, in the above example the image file referenced in our <img> tag will actually be downloaded and rendered into the page on the screen.  This, is clearly not what we want.

 

Setting WebBrowser.DocumentStream:

public void LoadPage()
{

    WebBrowser browser = new WebBrowser();
    FileStream source = new FileStream(@"C:\page.html", FileMode.Open, FileAccess.Read);

    browser.DocumentStream = source;

}

This method allows us to access our page as a Stream.  The WebBrowser control will load the data from the Stream and again, render it into an HtmlDocument object.  Like the DocumentText property, however, it will resolve any HREF or SRC attributes and get the resources from the web.

 

The Hurdle

What we need to do at this point should be clear: we need to some how modify the HtmlDocument prior to the WebBrowser control rendering it on the screen.  I figured there would be an event exposed for this, seemingly obvious, desire.  I looked into the following events, hoping for a quick solution:

WebBrowser.DocumentCompleted - This event is fired AFTER the page is fully rendered, so it unfortunately doesn't help up.  We can still modify the HtmlDocument at this point, but since any referenced resources have already been downloaded, it is of little value in our situation.

WebBrowser.ProgressChanged - This event is fired as the page and its resources are being gathered.  It is fired asynchronously, so be very careful when using it.  That being said, I figured initially that I could wait for progress to be 100% and then I'd modify the document.  Unfortunately, this too did not work.

WebBrowser.FileDownload - Aside from DocumentCompleted, this seemed the most promising.  After all, perhaps I can check to see if the file being downloaded is an image, and if so, simply cancel the download.  No, that won't work because the FileDownload event simply takes an "EventArgs" parameter and therefore gives us no meaningful state on which to operate.

 

So, at this point we have no way of using events to accomplish our task.  We have to find another way.  As most developers do, I scanned the net to find out if this problem had already been cracked.  I didn't find an exact solution, but I did find something that helped at least spark my imagination.  I point you now to the blog of Jim Holmes.  I kind of know Jim a little from when I lived in Ohio and went to a few Dayton .NET Users Group meetings (of which Jim was/is the President).  Now, Jim is a very smart guy (in fact he has a great O'Reilly Book out right now) so I'm not sure what happened, but in his article I think he makes a few mistakes about how the WebBrowser control works and I will point those out when we come to them.  Like  I said though, his article at least sparked something in my mind: How do I get an empty HtmlDocument without going through the WebBrowser control?

 

The Solution

What we want is to load an HTML page from the local system without causing any actual network traffic.  To make our example more simplistic let's just say we don't want images to load at all.  My solution, for .NET 3.0/3.5 at least, is to introduce an Extension Method for the WebBrowser control that allows us to arbitrarily "filter" the HtmlDocument prior to loading it.  The entire solution is available for download at the beginning of this article, so I've chunked it up a bit for display purposes:

public static class WebBrowserExtensions
{

    /// <summary>
    /// Load an HTML document from a Stream and pass the text through a filter before the page is
    /// rendered in the WebBrowser control.
    /// </summary>
    /// <param name="browser">control that renders the filtered HTML</param>
    /// <param name="source">Stream containing the content to filter and render</param>
    /// <param name="filter">Delegate used to filter the source Stream</param>
    public static void ProcessRequest(this WebBrowser browser, Stream source, Func<HtmlDocument, HtmlDocument> filter)

 

As we know, Extension Methods must be defined in static classes, as public static members.  You can see the prototype for the ProcessRequest extension about.  It takes two parameters a Stream object that contains the "source" of the page and a delegate that takes an HtmlDocument and returns a modified HtmlDocument.

    using (WebBrowser tempBrowser = new WebBrowser())
    {

        //all data from the source as a string
        string sourceText = string.Empty;

        try
        {

            //read all the data from the source Stream
            using (StreamReader sourceReader = new StreamReader(source))
            {

                sourceText = sourceReader.ReadToEnd();

            }

        }
        catch (IOException ex)
        {

            throw new Exception("Could not read data from source stream", ex);

        }

It is important to note that the WebBrowser control is an absolute resource hog, so please use a using statement or other disposal pattern to property clean it up.  Also, we could have performed all of the operations in this method using the WebBrowser control we were given, but the drawback to that is the control would fire any registered event handlers.  We want our manipulation of the HtmlDocument to be as seamless as possible, and thus we operate on a temporary WebBrowser control. 

The above chunk of code also performs the mundane task of reading the entire Stream into a string and propagating any exceptions up the stack.

        //process any text we read from the source Stream
        if (!string.IsNullOrEmpty(sourceText))
        {

            HtmlDocument tempDocument = null;
            HtmlElement htmlRoot = null;            
            
            //navigate to "about: blank" to initialize an empty document
            tempBrowser.Navigate("about: blank");

Now, the above code contains something that Jim tells us to do which is navigate our browser to "about: blank".  As Jim states, correctly, this causes the HtmlDocument object to be created and initially empty.  Exactly what we want, in fact.  However, Jim also seems to imply that this step is always necessary prior to setting either the WebBrowser.DocumentText or WebBrowser.DocumentStream properties.  As the MSDN Documentation for DocumentText points out, WebBrowser will automatically navigate to "about: blank" each and every time either of these properties is set.

The reason that we are doing this is that we don't WANT to set DocumentText.  Remember, that will cause all of our resources to be loaded!  All we are trying to do is get an empty HtmlDocument object!

            //load the sourceText into the document.
            tempBrowser.Document.Write(sourceText);

Now that we have navigated to "about: blank", we can use the WebBrowser.Document property to access an empty HtmlDocument.  Further, we can use the HtmlDocument.Write method to populate the document with our HTML.  This is looking pretty nice so far!

            //now filter the document if a filter was specified
            if(filter != null)
                tempDocument = filter(tempBrowser.Document);

            //if the filter did not return a document, or no filter was specified, use the original document
            if (tempDocument == null)
                tempDocument = tempBrowser.Document;

The code from here on out is pretty standard.  We are applying any filter we've been given and keeping track of our temporary HtmlDocument object as it is being modified.

 

            //find the root HTML element, there can be only one!
            var htmlElements = tempDocument.GetElementsByTagName("html");

            if (htmlElements != null && htmlElements.Count > 0)
                htmlRoot = htmlElements[0];

            //now, extract the text and set it on the actual browser
            browser.DocumentText = htmlRoot.OuterHtml;

To wrap this method up, we get the root <html> tag and then set the WebBrowser.DocumentText property of the WebBrowser control we were given to the <html> tag's OutHtml (i.e. everything in the document including HTML tags and content).

By setting the DocumentText property, we are forcing the WebBrowser control to load our modified document.  We have accomplished our goal.  We can now modify the HtmlDocument BEFORE it gets rendered.

The Final Bits

For the sake of completeness, let's use the StripImageLoading method we created earlier to modify a "local" page:

public partial class MainForm : Form
{
    public MainForm()
    {

        InitializeComponent();

        //get google's home page
        FileStream source = new FileStream(@"C:\Development\VS2008\OfflineHtml\google.html", FileMode.Open, FileAccess.Read);

        //process the request
        mainBrowser.ProcessRequest(source, StripImageLoading);

    }

    public HtmlDocument StripImageLoading(HtmlDocument document)
    {

        foreach (HtmlElement image in document.Images)
            image.SetAttribute("src", string.Empty);
        
        return document;

    }

}

The above class opens a saved HTML file that contains the source HTML of the Google home page.  It then uses our ProcessRequest extension method to filter the HtmlDocument using the StripImageLoading method as its delegate.  The result when you run the code should be a missing image on the page. If you want to, go download a network analyzer like WireShark to confirm that no HTTP requests are being made as a result of rendering the page.

 

Summary

WebBrowser control is pretty cool.  It has a lot of useful features out of the box and is quite extensible.  In this article you've seen its basic usage and a slightly more advanced scenario for which new .NET 3.5 capabilities provide an extremely clean solution.  In fact, it is probably the first time I've really gotten an "Oh yeah! This feels right" when using extension methods outside of LINQ.  Of course, pretty much the same code will compile and work in a .NET 2.0 environment, you'll just have to comment out the "this" modifier in front of the first parameter of the extension method along with any code that uses it, i.e. turn ProcessRequest into a vanilla static method.

posted on Tuesday, January 29, 2008 4:57 PM

Feedback

# re: Fun with the WebBrowser Control 2/5/2008 9:48 PM Jim Holmes
You inferred something from my article I didn't imply. :) Indeed, the about:blank page only needs to get loaded ONLY ONCE when you start using the control in order to initialize that empty document. After the initial doc is created you're free and clear to modify its contents -- as you've so clearly documented.

I also make use of the Navigating event to intercept event flow and deal with various circumstances. That, coupled with the DocumentCompleted lets me cover all the convoluted cases I've got for my particular project.

Nice article!

# re: Fun with the WebBrowser Control 3/7/2008 7:58 AM kamran
wonderful
spent whole day while googling but at the end got this page and continued towards solving my problem.
Gr888888888888888888888

# re: Fun with the WebBrowser Control 3/16/2008 11:00 PM dingsea
nice article, thanks. :)

# re: Fun with the WebBrowser Control 12/26/2008 2:29 PM Steve
Excellent! I am using VS2005, and am interested in intercepting the document before it is displayed so I can comment out some javascript before loading the document.

Do you see a way of doing this without 3.0/3.5?

Thanks again.

# re: Fun with the WebBrowser Control 12/29/2008 9:18 AM newmand
@ Steve

There is no reason that this article can't be applied to .NET 2.0. Instead of using an extension method, simply use the WebBrowserExtensions class like a typical static class.

That is, the ProcessRequest static method can be rewritten to eliminate the "this" keyword (which is really all that makes .NET 3.5 think this is an extension method) like so:

public static void ProcessRequest(WebBrowser browser, Stream source, Func<HtmlDocument, HtmlDocument> filter)

Then, when you call it you'd simply say:

WebBrowserExtensions.ProcessRequest(browser, source, filter);

Simple as that!

# re: Fun with the WebBrowser Control 3/10/2009 7:16 PM janet
This is excellent! I am trying to render a PDF file using webbrowser1.DocumentStream = myStream, however, I keep coming up with bit garbage being displayed in my webbrowser. What can be done in this instance? thanks

# re: Fun with the WebBrowser Control 4/13/2009 7:48 AM kami
hi all,


iam doing web testing tool in c#.net 2005, my requirement is use to get a webbrowser Documentl source with controls(eg: TextBox) value passed by the user(eg:Value="abc"). iam getting webbrowser Document source with Controls Vlaue null(eg: value="") . if iam using getAttribute("value") then iam getting the value. i want to get Document source with Controls Vlaue . can anybody help me

# re: Fun with the WebBrowser Control 4/29/2009 5:46 AM VB Maniac/Freak/Addict/Geek/...

hey! i'm just wondering, how did IE, Firefox (and other browsers) can view html codes in notepad? it seems that they can access the "source code" of a html document. how did that happen?

what web browser properties or methods they use to do it? and how can i do that? (^^)

pls help me! you can send your answer @ shinagata@yahoo.com

thankz!





Post Feedback

Title:
Name:
Email: (never displayed)
Url:
Comments: