.NET Nomad

What I've learned along the way

  Home  |   Contact  |   Syndication    |   Login
  13 Posts | 0 Stories | 50 Comments | 0 Trackbacks

News

Archives

Post Categories

Tuesday, January 29, 2008 #

Download Solution - OfflineHtml.zip

So, one of the cool controls available to us in WinForms is System.Windows.Forms.WebBrowser.

The WebBrowser control is essentially a managed wrapper around some COM interfaces that bind to Internet Explorer and provides us with several interesting capabilities.  First of all, one can use WebBrowser to easily display a web page in a WinForms application.  All you have to do is set the WebBrowser.Url property and the control takes care of getting the assets from across the wire and rendered on the screen.

WebBrowser also exposes some interesting events that allow a programmer to react when a document is loaded, navigation is peformed, etc.  There are probably a ton of places, including MSDN, where you can get that kind of information so I won't go over it here.  Instead, I am going to show something that isn't immediately obvious, but that I believe I found a clean solution to.

 

The Task

What I want to do is load an HTML page that is on my local computer without causing any network traffic, e.g. it won't load images on the page.  Similar to say, loading a web archive in Internet Explorer.  For our purposes let's use the Google Home Page as an example.

 

The First Attempt

I immediately set upon this task thinking it would be pretty easy.  From what I had gathered on MSDN, after loading a page the WebBrowser control's Document property is populated with an HtmlDocument object.  Similar to System.Xml.XmlDocument, HtmlDocument is a tree like representation of the web page's HTML DOM and it exposes some handy properties for manipulating the HTML elements rendered by the WebBrowser control.  For example, the following code demonstrates setting all of the "src" attributes of the HtmlDocument's img tags to the empty string:

public HtmlDocument StripImageLoading(HtmlDocument document)
{

    foreach (HtmlElement image in document.Images)
       image.SetAttribute("src", string.Empty);
            
    return document;

}

Iterating over the various HtmlElementCollection objects exposed through HtmlDocument's properties allows one to alter, and even add, HTML elements. 

This is great, but how do we actually get the WebBrowser control to load an HtmlDocument for us?  There are three primary methods, each of which I'll demonstrate with a code snippet.

 

Setting WebBrowser.Url:

public void LoadPage()
{

    WebBrowser browser = new WebBrowser();
    browser.Url = new Uri("http://www.google.com");

}

The "primary" way to load a page is to set WebBrowser.Url to a valid Uri object.  When this is done the WebBrowser will get all required data for the page via HTTP and render the results into our HtmlDocument (accessible via the WebBrowser.Document property).

 

Setting WebBrowser.DocumentText:

public void LoadPage()
{

    WebBrowser browser = new WebBrowser();
    browser.DocumentText = @"<html><img src=""http://www.domain.com/someimage.gif""</html>";
}

This is the first method that would enable us to achieve our offline viewing goal.  We simply set WebBrowser.DocumentText with a string of HTML and the control uses that to render the page.  The issue with this method is that any HREFs or SRC attributes will be resolved by the WebBrowser control.  In otherwords, in the above example the image file referenced in our <img> tag will actually be downloaded and rendered into the page on the screen.  This, is clearly not what we want.

 

Setting WebBrowser.DocumentStream:

public void LoadPage()
{

    WebBrowser browser = new WebBrowser();
    FileStream source = new FileStream(@"C:\page.html", FileMode.Open, FileAccess.Read);

    browser.DocumentStream = source;

}

This method allows us to access our page as a Stream.  The WebBrowser control will load the data from the Stream and again, render it into an HtmlDocument object.  Like the DocumentText property, however, it will resolve any HREF or SRC attributes and get the resources from the web.

 

The Hurdle

What we need to do at this point should be clear: we need to some how modify the HtmlDocument prior to the WebBrowser control rendering it on the screen.  I figured there would be an event exposed for this, seemingly obvious, desire.  I looked into the following events, hoping for a quick solution:

WebBrowser.DocumentCompleted - This event is fired AFTER the page is fully rendered, so it unfortunately doesn't help up.  We can still modify the HtmlDocument at this point, but since any referenced resources have already been downloaded, it is of little value in our situation.

WebBrowser.ProgressChanged - This event is fired as the page and its resources are being gathered.  It is fired asynchronously, so be very careful when using it.  That being said, I figured initially that I could wait for progress to be 100% and then I'd modify the document.  Unfortunately, this too did not work.

WebBrowser.FileDownload - Aside from DocumentCompleted, this seemed the most promising.  After all, perhaps I can check to see if the file being downloaded is an image, and if so, simply cancel the download.  No, that won't work because the FileDownload event simply takes an "EventArgs" parameter and therefore gives us no meaningful state on which to operate.

 

So, at this point we have no way of using events to accomplish our task.  We have to find another way.  As most developers do, I scanned the net to find out if this problem had already been cracked.  I didn't find an exact solution, but I did find something that helped at least spark my imagination.  I point you now to the blog of Jim Holmes.  I kind of know Jim a little from when I lived in Ohio and went to a few Dayton .NET Users Group meetings (of which Jim was/is the President).  Now, Jim is a very smart guy (in fact he has a great O'Reilly Book out right now) so I'm not sure what happened, but in his article I think he makes a few mistakes about how the WebBrowser control works and I will point those out when we come to them.  Like  I said though, his article at least sparked something in my mind: How do I get an empty HtmlDocument without going through the WebBrowser control?

 

The Solution

What we want is to load an HTML page from the local system without causing any actual network traffic.  To make our example more simplistic let's just say we don't want images to load at all.  My solution, for .NET 3.0/3.5 at least, is to introduce an Extension Method for the WebBrowser control that allows us to arbitrarily "filter" the HtmlDocument prior to loading it.  The entire solution is available for download at the beginning of this article, so I've chunked it up a bit for display purposes:

public static class WebBrowserExtensions
{

    /// <summary>
    /// Load an HTML document from a Stream and pass the text through a filter before the page is
    /// rendered in the WebBrowser control.
    /// </summary>
    /// <param name="browser">control that renders the filtered HTML</param>
    /// <param name="source">Stream containing the content to filter and render</param>
    /// <param name="filter">Delegate used to filter the source Stream</param>
    public static void ProcessRequest(this WebBrowser browser, Stream source, Func<HtmlDocument, HtmlDocument> filter)

 

As we know, Extension Methods must be defined in static classes, as public static members.  You can see the prototype for the ProcessRequest extension about.  It takes two parameters a Stream object that contains the "source" of the page and a delegate that takes an HtmlDocument and returns a modified HtmlDocument.

    using (WebBrowser tempBrowser = new WebBrowser())
    {

        //all data from the source as a string
        string sourceText = string.Empty;

        try
        {

            //read all the data from the source Stream
            using (StreamReader sourceReader = new StreamReader(source))
            {

                sourceText = sourceReader.ReadToEnd();

            }

        }
        catch (IOException ex)
        {

            throw new Exception("Could not read data from source stream", ex);

        }

It is important to note that the WebBrowser control is an absolute resource hog, so please use a using statement or other disposal pattern to property clean it up.  Also, we could have performed all of the operations in this method using the WebBrowser control we were given, but the drawback to that is the control would fire any registered event handlers.  We want our manipulation of the HtmlDocument to be as seamless as possible, and thus we operate on a temporary WebBrowser control. 

The above chunk of code also performs the mundane task of reading the entire Stream into a string and propagating any exceptions up the stack.

        //process any text we read from the source Stream
        if (!string.IsNullOrEmpty(sourceText))
        {

            HtmlDocument tempDocument = null;
            HtmlElement htmlRoot = null;            
            
            //navigate to "about: blank" to initialize an empty document
            tempBrowser.Navigate("about: blank");

Now, the above code contains something that Jim tells us to do which is navigate our browser to "about: blank".  As Jim states, correctly, this causes the HtmlDocument object to be created and initially empty.  Exactly what we want, in fact.  However, Jim also seems to imply that this step is always necessary prior to setting either the WebBrowser.DocumentText or WebBrowser.DocumentStream properties.  As the MSDN Documentation for DocumentText points out, WebBrowser will automatically navigate to "about: blank" each and every time either of these properties is set.

The reason that we are doing this is that we don't WANT to set DocumentText.  Remember, that will cause all of our resources to be loaded!  All we are trying to do is get an empty HtmlDocument object!

            //load the sourceText into the document.
            tempBrowser.Document.Write(sourceText);

Now that we have navigated to "about: blank", we can use the WebBrowser.Document property to access an empty HtmlDocument.  Further, we can use the HtmlDocument.Write method to populate the document with our HTML.  This is looking pretty nice so far!

            //now filter the document if a filter was specified
            if(filter != null)
                tempDocument = filter(tempBrowser.Document);

            //if the filter did not return a document, or no filter was specified, use the original document
            if (tempDocument == null)
                tempDocument = tempBrowser.Document;

The code from here on out is pretty standard.  We are applying any filter we've been given and keeping track of our temporary HtmlDocument object as it is being modified.

 

            //find the root HTML element, there can be only one!
            var htmlElements = tempDocument.GetElementsByTagName("html");

            if (htmlElements != null && htmlElements.Count > 0)
                htmlRoot = htmlElements[0];

            //now, extract the text and set it on the actual browser
            browser.DocumentText = htmlRoot.OuterHtml;

To wrap this method up, we get the root <html> tag and then set the WebBrowser.DocumentText property of the WebBrowser control we were given to the <html> tag's OutHtml (i.e. everything in the document including HTML tags and content).

By setting the DocumentText property, we are forcing the WebBrowser control to load our modified document.  We have accomplished our goal.  We can now modify the HtmlDocument BEFORE it gets rendered.

The Final Bits

For the sake of completeness, let's use the StripImageLoading method we created earlier to modify a "local" page:

public partial class MainForm : Form
{
    public MainForm()
    {

        InitializeComponent();

        //get google's home page
        FileStream source = new FileStream(@"C:\Development\VS2008\OfflineHtml\google.html", FileMode.Open, FileAccess.Read);

        //process the request
        mainBrowser.ProcessRequest(source, StripImageLoading);

    }

    public HtmlDocument StripImageLoading(HtmlDocument document)
    {

        foreach (HtmlElement image in document.Images)
            image.SetAttribute("src", string.Empty);
        
        return document;

    }

}

The above class opens a saved HTML file that contains the source HTML of the Google home page.  It then uses our ProcessRequest extension method to filter the HtmlDocument using the StripImageLoading method as its delegate.  The result when you run the code should be a missing image on the page. If you want to, go download a network analyzer like WireShark to confirm that no HTTP requests are being made as a result of rendering the page.

 

Summary

WebBrowser control is pretty cool.  It has a lot of useful features out of the box and is quite extensible.  In this article you've seen its basic usage and a slightly more advanced scenario for which new .NET 3.5 capabilities provide an extremely clean solution.  In fact, it is probably the first time I've really gotten an "Oh yeah! This feels right" when using extension methods outside of LINQ.  Of course, pretty much the same code will compile and work in a .NET 2.0 environment, you'll just have to comment out the "this" modifier in front of the first parameter of the extension method along with any code that uses it, i.e. turn ProcessRequest into a vanilla static method.


A couple of readers (at least one of which was thankfully vocal) complained about the blog's style.  I agree it is/was/will be pretty lame.  I am using the templates provided by geekswithblogs and I don't really have the time to create my own yet.  Let's see how things progress with this new updated style.

-Newman


Back Links

LINQ Overview, part zero

LINQ Overview, part one (Extension Methods)

 

NOTE: This article is dedicated to Keith Elder...even if he never sent me a bologna sandwich.

Apparently, two months is my definition of "very soon".  Let's continue.

Since .NET 1.1 we've had the concept of delegates.  They are the constructs that allow us to call methods on objects via reference such as:

delegate int AddFunc(int x, int y);
public static class MathOps
{

   public static int Add(int x, int y)
   {

      return x + y;

   }
} 
class Program
{
   static void Main(string[] args)
   {
           
      AddFunc f = new AddFunc(MathOps.Add);

      Console.WriteLine("Delegate: 2 + 2 = {0}", f(2, 2));
      Console.ReadLine();

   }
}

There is nothing new and exciting about delegates as calling a function via pointer has been around for a very long time.  In fact, delegates are actually somewhat annoying in terms of syntax.  They must be declared in a class, you must wrap them in an object, etc.  Why can't we have a simpler syntax? After all, most of the time delegates are used to respond to relatively simple events or act as part of a strategy pattern (e.g. in a sort).

Anonymous Methods

In honor of Bill Gates, .NET 2.0 decided to give us a kindler and gentler delegate syntax.  The main method above could easily be rewritten as:

class Program
{
   static void Main(string[] args)
   {
           
      AddFunc f = delegate(int x, int y) { return x + y; };

      Console.WriteLine("Anonymous Method: 2 + 2 = {0}", f(2, 2));
      Console.ReadLine();

   }
}

As is common on the .NET platform the delegate keyword was overloaded to give it additional meaning.  Now one could assign to a delegate variable directly, in the current scope.  The new anonymous method syntax was similar to a method declaration.  The differences are pretty obvious, but I'll list the major ones.  Firstly, anonymous methods don't require identifiers, hence the terms anonymous methods.  Secondly, anonymous methods do not need to specify a return type.  This is due to some rudimentary type inference built into the compiler.  In essence, if we already know that we are assigning to a delegate of type "AddFunc" whose return type is "int", it should be obvious to the compiler that as long as the return statements in the delegate's body return an "int" then our anonymous delegate matches the signature of "AddFunc".  The counterintuitive aspect of this is that we still have to specify the types of the anonymous method's arguments.  After all, shouldn't the compiler be smart enough to also assume the types of our "x" and "y" based on the delegate type we are assigning to?  It should be, but unfortunately it is not.

There is something else I want to say about anonymous methods before moving on. This is something I come across all the time and some developers just don't get: anonymous methods allow for lexical closures. 

 

Lexical Closures

There is a lot of bickering on the net about "does .NET support 'true' closures?"  Well, based on my understanding and in my opinion, they support lexical closures or at least something close enough that for most practical purposes it doesn't matter.  I'll leave the 100% correct definition to the language lawyers and just give a quick example and some reasons why a lot of developers get caught in the lexical closure trap.

delegate int Increment();
static void Main(string[] args)
{
   
   Increment AddOne = AnonInc(0, 1);
   Increment SubOne = AnonInc(10, -1);

   for (int i = 0; i < 10; ++i)
   {

      Console.WriteLine("{0},{1}", AddOne(), SubOne());

   }
   Console.ReadLine();
}
static Increment AnonInc(int start, int by)
{

   return delegate { return start = start + by; };

}

The output of the about code should be:

1,9
2,8
3,7
4,6
5,5,
6,4,
7,3
8,2
9,1
10,0

First, take a look at our delegate "Increment".  It takes no arguments and returns an "int".  The idea is that delegates will somehow increment "a value" and return the next value in the sequence. 

Next, look at the method "AnonInc".  Does it return a delegate? That's crazy!  Further, it returns a delegate that makes use of something commonly referred to as "up values" or "outer variables" depending on the person/system/said person's mood.  An outer variable is simply a variable that exists in the scope that contains the delegate.  In this case, our delegate's scope is the "AnonInc" method in which the "start" and "by" arguments are implicitly defined local variables. 

Now, based on the definition of the delegate returned by "AnonInc" and the output of the program we can tell something interesting is going on here.  The question you should be asking right now is, "How is it that we are modifying the value of a local variable inside a delegate and it is keeping track of the change?"

If you recall delegates, and therefore anonymous methods, are represented by objects.  These objects are instances of classes that are automatically generated for you at compile time.  They have funny, mangled names and you can not really do too much with them.  The thing that one needs to know is that any outer variables used by an anonymous delegate become attributes of this auto-generated class.  So, in our case if we look at the assembly generated by the above program using a tool like Reflector we should find a class like:

[CompilerGenerated]
private sealed class <>c__DisplayClass7
{
    // Fields
    public int by;
    public int start;

    // Methods
    public int <AnonInc>b__6()
    {
        return (this.start += this.by);
    }
}

As you can see, the above class has two attributes with the same names as our outer variables and a method that accesses them.  Looking at the code this way kind of takes the magic out of anonymous methods and we being to realize that it is sort of like what I said about extension methods, it is just syntactic sugar.  Handy, but not magical.

So, what is this trap I was talking about?  Well, it has to do with the garbage collector.  As we all know, in .NET an object lives in memory until it is explicitly disposed of or goes out of scope.  In general perhaps "goes out of scope" is best thought of as "until no other object holds a reference to it".  With lexical closures happening more or less behind the scenes it is very easy to create a memory leak such as the following:

public class ResourceWrapper
{

    public void OpenOnClick(Button btnOpen, string resourcePath)
    {

        SomeResource res = new SomeResource(resourcePath);

        btnOpen.Click += delegate(object sender, EventArgs e) { res.Access(); };

    }
    
}

public class SomeResource
{

    public SomeResource(string path) { }

    public void Access() { }

}

Granted, this example is contrived, but you see similar things all the time.  So, what's going on here? Basically if we look at "OpenOnClick" we can see that an anonymous method is being registered as the Click event for a button.  Further, the anonymous method is using an outer variable "res".  This means that the following class gets generated for us:

[CompilerGenerated]
private sealed class <>c__DisplayClass1
{
    // Fields
    public SomeResource res;

    // Methods
    public void <OpenOnClick>b__0(object sender, EventArgs e)
    {
        this.res.Access();
    }
}

Normally, we'd just assume that since "res" is a local variable in the "OpenOnClick" method that it'd die as soon as it ran out of scope, i.e. at the end of the method.  However, since our anonymous delegate is holding a reference to it, the object "res" is referencing will live until the anonymous delegate itself goes out of scope.  One can easily see how this kind of situation can go bad quickly.  To avoid this situation, be careful to unregister your anonymous methods when you use them as event handlers!

Alright, so why did I get into all of this anonymous method stuff if the post is supposed to be about Lambda Expressions? Well, because Lambda Expressions in C# are just an evolutionary step beyond anonymous methods.  Let's chip away at some of the sugar...

 

Our first Lambda

It is difficult to describe the syntax of a lambda expression since it is very ambiguous and depends on multiple factors.  With that in mind let's look at a quick example:

AddFunc f = (x, y) => x + y;

The above snippet declares a new AddFunc delegate and assigns a lambda expression to it.  Everything to the right of the = operator is the lambda definition. 

Some questions:

  1. Where is the return type?
  2. Where is the identifier?
  3. Does (x, y) denote the parameter list?
  4. What does the => do?
  5. Why isn't there a return statement?

Some answers:

  1. Lambda expressions do not need an explicit return type.  Just like with anonymous methods the compiler is smart enough to infer the return type based on the type of delegate it is being assigned to. In this case AddFunc returns an int, and so the lambda implicitly returns and int.  Obviously it is a compiler error if the lambda does not.
  2. Lambda expressions are by definition anonymous.  They do not have identifiers.
  3. Yes.  Further, you should note that lambda parameters do not need to explicitly state their type.  This, like the return type, is inferred by the compiler based on their order compared to the delegate's parameters list. You can, however, state the types explicitly.  (int x, int y) is a valid lambda expression parameter list.
  4. The new => operator is the start of the expression's body.  Everything after => defines what the lambda expression does.
  5. Lambda expression don't require an explicit return statement.  When a return isn't provided the return value is assumed to be whatever the lambda expression evaluates to.

So, let's take a look at a few other valid ways to write lambda expressions:

(int x, int y) => { return x + y; };

The above is the most explicit way.  We've specified types for the parameters and a real return statement.  Notice how when we use an actual return expression we have to use the { } brackets? This same syntax allows us to create multi-line lambdas and lambdas that declare local variables.

(x, y) => { return x + y; };

This one keeps the return statement and just drops the optional types in the parameter list.

() => x + y;

In the above, we've specified a lambda with an empty parameter list.  In this case we are assuming the existence of x and y as outer variables (yes, lambda expressions support lexical closures just like anonymous methods).

 

A Lambda is what you assign it to

So far we've seen that lambda expressions are compatible with delegates in the sense that you can assign a lambda directly to a delegate, but there are other interesting uses. Take a second and think about writing a program in a text editor.  To the text editor, or for that matter to the compiler, the lines of code your write are just data.  The compiler doesn't execute your program, it simply translates data from one format to another.  It is natural then to ask, "If I can store a program as data, can I load a program as data at run time and then execute it?" With lambda expressions the answer is yes.

If we assign a lambda expression to a delegate it becomes a delegate of that type.

If we assign a lambda expression to an appropriately typed Expression Tree it gets converted at compile time to equivalent Expression objects.

For example:

Expression<Func<int, int, int>> exp = (x, y) => x + y;

This statement simply says, "Convert this lambda expression into an expression tree equivalent to a method that takes two integer parameters and returns the sum as an integer".

There is no resulting compilation of this tree and no execution of code as a result of this statement.  If at runtime we need to execute the function the tree represents, we must say:

Expression<Func<int, int, int>> exp = (x, y) => x + y;
var func = exp.Compile();
Console.WriteLine("{0}", func(1, 1));

Now, it isn't inherently obvious why this is cool so I'll spell it out: If the compiler can represent executable code using Expression objects, so can we.  In fact, we will do exactly that by the end of this series.

As funny as it may sound, this is all you really need to know about lambda expressions.   You can use them in place of anonymous delegates (and you should), they forced the .NET team to provide C# with something approaching real type inference, and they allow us to represent code as data in a statically type checked way.

 

LINQ Tie In

Awesome. How are Lambda Expressions useful in LINQ?  Well, by now you've read the basic LINQ syntax somewhere else as I asked so I'll just show a couple of quick examples:

static void UseLINQ()
{

    var names = new List<GenderedName> { 
        new GenderedName { Name="Bob", Gender=Gender.Boy }
        , new GenderedName { Name="Sally", Gender=Gender.Girl }
        , new GenderedName { Name="Jack", Gender=Gender.Boy }
        , new GenderedName { Name="Sarah", Gender=Gender.Girl }
        , new GenderedName { Name="Philbert", Gender=Gender.Boy }            
    };

    var boyNames = names.Where((n) => n.Gender == Gender.Boy).Select((n) => new { n.Name });

    foreach (var name in boyNames)
        Console.WriteLine("{0}", name.Name);

}

This above function queries a list of names for those that are traditionally used for boys.  In order to make use of the actual lambda expression syntax I used the method based approach to querying with LINQ.  In fact, there are two lambdas in our code:

(n) => n.Gender == Gender.Boy

This lambda is for our selection criteria and simply compares the given name, n, to see if it is used for boys. 

(n) => new { n.Name }

In this expression we are returning a new anonymous type that just contains the Name property of the GenderedName that has passed our selection criteria.

We can simplify, or rather pretty up, this method by using the new LINQ keywords as so:

static void UseLINQ()
{

    var names = new List<GenderedName> { 
        new GenderedName { Name="Bob", Gender=Gender.Boy }
        , new GenderedName { Name="Sally", Gender=Gender.Girl }
        , new GenderedName { Name="Jack", Gender=Gender.Boy }
        , new GenderedName { Name="Sarah", Gender=Gender.Girl }
        , new GenderedName { Name="Philbert", Gender=Gender.Boy }            
    };

    var boyNames = from n in names
                   where n.Gender == Gender.Boy
                   select new { n.Name };

    foreach (var name in boyNames)
        Console.WriteLine("{0}", name.Name);

}

It doesn't look like we are using lambda expressions here, but we really are.  It is just that the compiler needs to turn our pretty code into the same method calls that we just used, and therefore ultimately into an Expression Tree for later execution.

I just want to be very explicit here and point out something.  When we are using LINQ we use lambda expressions as delegates.  We know this because the parameters of the Where method accept arguments of the Func<T> variety.  The Func series of generic types are actually generic delegates.  For example, MSDN has the following definition for Func<T, TResult>:

public delegate TResult Func<T, TResult>(
    T arg
)

This usage of delegates and expression trees is what allows LINQ to support Lazy Evaluation.