Thursday, September 24, 2009

Sorry Johnny - There is NO Garbage Collection in .NET

[Originally Published Apr 2004 - Updated October 2009]

Everyone "understands" that Microsoft's .NET and the CLR is a "garbage collector" based environment; but is it really.

First we must establish what is meant be "garbage" in this context. When an object is created there is (typically) one reference by which it can be accessed (the return value of "new"). While the program executes, there may be other references established to the same item; and established references may terminate. When an object can no longer be referenced, it is deemed to be "Garbage". [note: This is a bit of a simplification but will satisify out needs]

Next we must look at the definition of "collection", Websters dictionary offers the following:

collection: the act or process of collecting.
colllect: to bring together into one body or place.

Now lets look at what happens when a "GC.Collect" occurs.... (For simplicity we will look at generation 0, and ignore the impact of "pinned" objects). The object graph is "walked" starting at the rooted references, and any reachable item that is in Generation 0 is marked. When the walk is complete, the live objects are moved to the Gen1 heap, and the Gen0 heap reset back to the beginning. The result is that the memory occupied by all of the previous Gen0 residents is now available.

This reveals the fundamental problem with calling this process "garbage collection". Absolutely NOTHING is done with the garbage. Specifically there are no operations which involve moving the garbage so it is "brought together in one place".

To see what a "real" garbage collection is, consider an anology. In ones house, there are likely to be multiple wastebaskets; one in the kitchen, one in the bathroom, and other scattered throughout the residence. On trash day (or earlier if the Wife has anything to say about the matter), one goes through the residence and collects all of the garbage from multiple locations, places it in one bag, and brings it outside to the rubbish container. The amount of work is dependant on the number of original locations of garbage, and the amount of garbage in each location. The amount of "precious" (non-garbage) item in the house has absolutely no bearing on the process or the effort it will involve.

But when we look at the .NET situation, the exact opposite is true. It is the number of LIVE objects that impacts the performance as these are what must be scanned and moved. It does not matter if there is a single small "garbage" object on the heap, or if there are tens of thousands (of varying sizes). Once the live (precious) objects have been moved out of harms way, it is a single, constant time operation to reset the heap to be ready to get new objects.

This shows that .NET implements a Live Object Preservation pattern, and NOT a grabage collection pattern.

While this entire post may seem like a "symantic quibble", it has serious ramifications when dealing with .NET architecture/design and implementation. In other environments there is NO overhead (aside from the actual memory) to keeping references to heap based object which will be needed (or even just possibly needed) later. In many cases, the cost of allocating [always higher in a conventional heap than in a CLR heap]  and deleting (updating the freelist) far outwieghs the memory utilization issue, and so references are kept for an extended period of time.

When this approach is taken in a .NET application, these live objects represent a performance hit everytime (neglecting some optimizations) that the GC runs - simply because the GC deals with processing live objects. On the otherhand, allocating a (non-large) object in .NET is typically a simply pointer increment, and abandoning it (assuming no finalizer) is a 0 time issue.

Over the past few years, I have been involved with a number of projects where clients were complaining that ".NET was slow" and could not meet their perfomance demands. In the vast majority of cases, this was directly tracked to the implementation not having proper (for .NET) object lifetime management..

addendum: When one looks at environments such at C/C++, the conventional/standard implementation (pre C++0x) do not include "garbage collection". The heap is (typically, and simplified) implemented as a structure containing the "free blocks" of items that were previously deleted. This means that (a pointer to) memory that is not longer in use [i.e. garbage] IS actually MOVED. Each time there is a call to "delete" or "free(...)" there is a synchronous [i.e. it completes before delete/free() returns] collection of information about the garbage that occurs.

In .NET the large object heap [LOH] is used for items which exceed a threshold size [80,000 bytes]. This particular heap IS operated in a manner nearly identical to a C/C++, in that the "live" objects are NOT moved, and it is a set of references to the avilable memory (garbage) that is manipulated.

 

 

The CLR does not a Virtual Machine Make...

[Originally Written October 2004 - Updated September 2009]

Many people state that Microsoft .Net technology provides a "Virtual Machine" environment via the CLR. However, an examination of various definitions of Virtual Machine shows that this is not the best analogy.

For our first example definition, let us look no further than Microsoft's own site:

Virtual Machine: A software-implemented computer that emulates a complete hardware system in a self-contained, isolated software environment and runs its own operating system.

Clearly this does not apply so lets break down the parts of  "a complete hardware system". The three major categories of devices that make up a system are: Memory (some type of storage),  Processing, and Input/Output.

When a program written in any language uses the Microsoft Implementation of the CLR, it directly utilizes the actual memory presented by the underlying system, All processing is done using native instruction execution on the underlying processor, and all Input/Output is accomplished via the device (Drivers) provided by the underlying operating system.

So, while it is possible (and there are projects attempting to reach this goal) to implement the CLR as a Virtual Machine. It is clear that the current Microsoft implementation provides none of these features.

The feature that the CLR provides is that there is an intermediate stage where the source code has been reduced from the original form into a well defined set of intermediate instructions that are independant of any specific target environment. Additionally a rich library (the BCL) is provided for addressing many common constructs and providing additional abstractions over lower level functionallity.

But remember this intermediate code NEVER executes. It undergoes a second compilation phase to become pure native instructions.

This process actually has a long history. Long, long ago [1970's 1980's] it was quite common for high level language compilers to emit (either by default or as an option) assembly language SOURCE rather than object code. This output could then be copies to various target machines (with differences in capabilities) and run through the assembler (often with differing configurations or external linkages) to produce an executable that was specifically tailored to the targeted machine.

Although the mechanics are different this is completely analogous to what happens with a CLR based program.

The result is that while .Net (the CLR) does provide a level of abstraction from the actual executable code, it does not meet the criteria for a "Virtual Machine".