Search
Close this search box.

What Every Developer Must Know About Fast Garbage Collection (+ more)

.NET is like Java a very productive environment. The increased convenience comes at the cost of less control over your execution environment. Especially the burden of manual memory management is much less of a problem in managed environments. Garbage collection does not prevent memory leaks (e.g. delete pData) but makes it considerable harder to introduce ones (statics, timers, forgotten event unregistrations and others still remain). A Garbage collector is a very nice thing but it can get into your way if you want to provide a responsive application. When your customers complain that the UI is frozen or that the UI redraw causes flickering all over it you need to take a deeper look under the .NET covers.

How does Garbage Collection affect application performance?

As the name collection suggests it needs to to something. You can expect increased CPU usage while a garbage collection is running in the background. .NET 4.5 has made significant progress to perform more tasks during a garbage collection on dedicated garbage collection threads. CPU usage is normally not a problem as it does not eat up all of your cores. The prime reason why people complain about garbage collection is that their application is stopped while a full garbage collection is running. This behavior is called stop world collector where the garbage collector stops all application threads, looks at all objects, finds the garbage and compacts memory to prevent memory fragmentation. For the last step it is very important that the application does not run since the address of the objects can change. It would be a disaster if your application would access a memory location that contains data of completely unrelated objects because the address of your object has changed but the garbage collector did forget to update your pointer while your thread was frozen. This process includes not only addresses on the managed heap but also in the thread stacks and CPU registers. Every object address your application has access to needs to be updated by the GC somehow. The first question you might ask: Is there a way around that the GC must stop my threads to be able to update object addresses? The answer is yes there are garbage collectors out there that do not stop all application threads like the Azul C4. It can collect heaps of hundreds of GB in size with maximal delays of less than 10ms. One secret ingredient to make this possible seems to be a kernel driver that locks portions of memory which are compacted to prevent writes to it. I am not sure how they handle reads but the docs talk about Loaded Value Barriers to ensure that object references are always valid. This looks like some overhead for every object reference access which can be substantial. Besides this it has given up the concept of generations for a fully asynchronous garbage collector.

Without having measured it I would conclude from the docs that you pay for low latency with slower object reference access and slower memory reclamation which increases the working set of your application. Every solution to manual memory management comes with its own set of drawbacks:

  • Manual Memory Management (e.g. C/C++)
    • High Complexity
    • Low CPU Consumption
    • Low Memory Consumption
    • Low Latency
  • Garbage Collected Environment (e.g. Java, .NET)
    • Lower complexity
    • Increased CPU Consumption
    • Increased Memory Consumption
    • Potentially High Latency

Is this conclusion correct? Only under the assumption that you need to allocate memory. What is the most efficient way to allocate memory? No allocation at all. The secret to real time performance in any environment is to make the hot path of your application allocation free. When you need hard real time (e.g. avionics) not only allocations are banned but also exceptions because you cannot predict how long exception handling will take. Very few systems need to cope with hard real time. For the 99% part of all software you can live with small millisecond pauses even if it means that your real time trading system will issue a million dollar order after your competitors. You can do this in C/C++ but there are also ways to become (mostly) allocation free in .NET.

In .NET and Java object pooling is an easy way to prevent future object allocations. This works to some extent but you cannot pool data types like strings which are immutable and can therefore not be changed later. Perhaps we will get some day mutable strings and UTF-8 strings to keep the memory footprint low and the throughput high. Besides becoming allocation free you can also try to allocate less objects (profile your app) or more value types to reduce the memory footprint which results in smaller heaps which can be collected faster. In .NET are also different GC flavors available (server and desktop). The default is the desktop GC. Starting with .NET 4.5 the both GC flavors are concurrent and employ a background GC thread that scans for garbage while your application threads are still running which significantly reduces latency already.

I have seen an improvement in run time behavior of larger apps of ca. 15% only by switching from .NET 4.0 to .NET 4.5. A good portion of it can be attributed to the improved GC. Lets check how the GC behaves on my Windows 8.1 box with .NET 4.5.1 with a simple sample application. Here is the code

class Program {
  static int N =
      25 * 1000 * 1000;  // Execute allocations in a loop 25 million times
  static int Clear =
      1000 * 1000;  // Clear list after every 1 million allocations to give GC
                    // 25 chances to clear things up

  static void Main(string[] args) {
    // do some warmup
    AllocateRefContainer();
    GC.Collect();
    GC.Collect();

    var sw = Stopwatch.StartNew();
    ExecuteParallel(() => AllocateRefContainer(), "AllocateRefContainer");
    sw.Stop();
    Console.WriteLine("RefContainer Allocation {0:F2}s, {1:N0} Allocs/s",
                      sw.Elapsed.TotalSeconds, N / sw.Elapsed.TotalSeconds);

    GC.Collect();
    GC.Collect();
    sw = Stopwatch.StartNew();
    ExecuteParallel(() => AllocateValueContainer(), "AllocateValueContainer");
    sw.Stop();
    Console.WriteLine("ValueContainer Allocation {0:F2}s, {1:N0} Allocs/s",
                      sw.Elapsed.TotalSeconds, N / sw.Elapsed.TotalSeconds);

    GC.Collect();
    GC.Collect();
    sw = Stopwatch.StartNew();
    ExecuteParallel(() => AllocateValueTypes(), "AllocateValueTypes");
    sw.Stop();
    Console.WriteLine("ValueType Allocation {0:F2}s, {1:N0} Allocs/s",
                      sw.Elapsed.TotalSeconds, N / sw.Elapsed.TotalSeconds);

    GC.Collect();
    GC.Collect();
    sw = Stopwatch.StartNew();
    ExecuteParallel(() => PooledRefContainer(), "PooledRefContainer");
    sw.Stop();
    Console.WriteLine("PooledRefContainer Allocation {0:F2}s, {1:N0} Allocs/s",
                      sw.Elapsed.TotalSeconds, N / sw.Elapsed.TotalSeconds);
    Console.WriteLine(
        "Private Bytes: {0:N0} MB",
        Process.GetCurrentProcess().PrivateMemorySize64 / (1024 * 1024));
    Console.ReadLine();
  }

  class ReferenceContainer  // Class with one object reference
  {
    public ReferenceContainer Obj;
  }

  static ThreadLocal<List<ReferenceContainer>> RContainer =
      new ThreadLocal<List<ReferenceContainer>>(
          () => new List<ReferenceContainer>());
  static void AllocateRefContainer() {
    var container = RContainer.Value;
    for (int i = 0; i < N; i++) {
      container.Add(new ReferenceContainer());
      Calculate();
      if (i % Clear == 0) {
        container.Clear();
      }
    }
  }

  class ValueContainer  // Class with one pointer sized value type
  {
    IntPtr Obj;
  }
  static ThreadLocal<List<ValueContainer>> VContainer =
      new ThreadLocal<List<ValueContainer>>(() => new List<ValueContainer>());
  static void AllocateValueContainer() {
    var container = VContainer.Value;
    for (int i = 0; i < N; i++) {
      container.Add(new ValueContainer());
      Calculate();
      if (i % Clear == 0) {
        container.Clear();
      }
    }
  }

  // The object overhead is 8 or 16 bytes on x86 x64
  // We add the payload to it to stay close to be equal in size to the others to
  // get a true comparison when the actual size which needs to be moved is
  // equal.
  struct ValueType {
    public IntPtr SyncBlock;
    public IntPtr MT;
    public IntPtr Payload;
  }

  static ThreadLocal<List<ValueType>> ValueTypeContainer =
      new ThreadLocal<List<ValueType>>(() => new List<ValueType>());
  static void AllocateValueTypes() {
    var container = ValueTypeContainer.Value;
    for (int i = 0; i < N; i++) {
      container.Add(new ValueType());
      Calculate();
      if (i % Clear == 0) {
        container.Clear();
      }
    }
  }

  // Uses Reference Container but allocates the objects only once and does not
  // clear the array but overwrites the contents
  static void PooledRefContainer() {
    bool bAllocate = true;
    var container = RContainer.Value;
    for (int i = 0; i < N; i++) {
      if (bAllocate) {
        container.Add(new ReferenceContainer());
      } else {
        var tmp = container[i % container.Count];  // Grab on item from array
                                                   // inside the array bounds
        tmp.Obj = null;                            // set new value
      }

      Calculate();
      if (i % Clear == 0) {
        bAllocate = false;
      }
    }
  }

  // Simulate some CPU only calculation
  static void Calculate() {
    long lret = 0;
    for (int i = 0; i < 100; i++) {
      lret++;
    }
  }

  // Execute a scenario on 4 different threads
  // The strange Directory.GetFiles calls are there track to track the start and
  // stop of one scenario when File access ETW tracing is enabled
  static void ExecuteParallel(Action acc, string scenario) {
    Directory.GetFiles(
        "C:\\", String.Format("ETW ExecuteParallel Start {0}", scenario));
    Parallel.Invoke(Enumerable.Repeat(acc, 4).ToArray());
    Directory.GetFiles("C:\\",
                       String.Format("ETW ExecuteParallel Stop {0}", scenario));
  }
}

The tests were performed on this machine:

We do basically allocate 100 million objects on 4 threads concurrently with various object designs.

  • Class with one object reference
  • Class with one pointer sized value
  • Struct of the same size like the class
  • Pooled class with one object reference

To enable the server GC you need to add to your app.config

<?xml version="1.0" encoding="utf-8" ?>
<configuration>
  <runtime>
    <gcServer enabled="true"/>
  </runtime>
</configuration>

GC Behaviors (Server, Desktop in x86/x64)

When you measure the timings you get some interesting graphs.

The first obvious difference is that x64 is faster than x86. For value types up to a factor 2. In the profiler we can see that the main differences comes from the relocate phase where 64 bit achieves much higher throughput. The numbers on the x axis are the number of samples taken by the kernel.

A net win of 30% only by switching from x86 to x64 is nice. But it comes at the cost of higher memory footprint (factor 2). When you look closer you see that the effect of server GC is much higher than 30%. The same method did run nearly 6 times faster (e.g. check out RefContainer x64 Default GC vs x64 Server GC)! As with every miracle there are also some hidden costs associated with that. In this case it is memory consumption. When server GC is enabled the CLR allocates a managed heap for each logical core. If you have a 40 core machine with many processes running things concurrently you quickly will run out of memory if you are not careful in your application design.

The main difference between workstation and server GC are

  • One GC Thread per core
  • One GC Heap per core

The allocations which did happen previously on one heap are now happening on 4 of them. Each heap is compacted independently by its own GC thread. In theory we should have 4 times less GCs since each heap could become as big as the original one and since we now have 4 GC threads we should be able to allocate 16 times faster. This is not the reality since the memory consumption has some limits which seems to be with this workload a factor 3. Besides this the GC must also cross check other heaps if the objects have dependencies between different heaps. If no GC Heap dependencies would be checked we should be 12 times faster but we are “only” 6 times faster which shows that the heap cross checks are not free but necessary.

The test suite did consume nearly three times more memory while running nearly 6 times faster. Enabling server GC can be a good thing to do if you have plenty of free memory and you do not allocate huge amounts of data on many different threads.

When you look at the CPU consumption with default GC enabled in x64 Release

The GC suspension times have dropped dramatically and at the same time the summed CPU consumption has dropped by 30%. Except for the memory consumption this is a big win in terms of GC induced pause times and overall application performance. But you need to be aware that your now very scalable app will spike with 100% CPU from time to time perform full GCs which can become a problem if you have other work waiting to be done which also needs all cores.

A very useful visualization is to select the GC Suspension events and then right click in the graph to enable highlighting of the current selection (the green bars). Now you can search in the other graphs for e.g. no CPU activity to find out why your application was doing nothing without ever looking at GC induced latency issues. This helps a lot to differentiate between GC issues and other problems in your application logic.

I have beefed up my original GC regions file with GC suspension visualization which now looks like this:

GC_SuspendRegions
    .xml

    <? xml version = '1.0' encoding = 'utf-8' standalone = 'yes'?>
    <? Copyright(c) Microsoft Corporation.All rights reserved.?>
    <InstrumentationManifest><Instrumentation><Regions>
    <RegionRoot Guid = "{d8d639a0-cf4c-45fb-976a-0000DEADBEEF}" Name =
         "GC" FriendlyName = "GC Activity">
    <Region Guid = "{d8d639a1-cf4c-45fb-976a-100000000101}" Name =
         "GCs" FriendlyName = "GCs">
    <Region Guid = "{d8d639a1-cf4c-45fb-976a-000000000001}" Name =
         "GCStart" FriendlyName = "GC Start"><Start>
    <Event Provider = "{e13c0d23-ccbc-4e12-931b-d9cc2eee27e4}" Id =
         "1" Version = "1" /></ Start><Stop>
    <Event Provider = "{e13c0d23-ccbc-4e12-931b-d9cc2eee27e4}" Id =
         "2" Version = "1" /></ Stop></ Region></ Region>
    <Region Guid = "{d8d639a2-cf4c-45fb-976a-000000000003}" Name =
         "GCSuspends" FriendlyName = "GC Suspensions">
    <Region Guid = "{d8d639a2-cf4c-45fb-976a-000000000002}" Name =
         "GCSuspend" FriendlyName = "GC Suspension">

    <Start><Event Provider = "{e13c0d23-ccbc-4e12-931b-d9cc2eee27e4}" Id =
                "9" Version = "1" /></ Start><Stop>
    <Event Provider = "{e13c0d23-ccbc-4e12-931b-d9cc2eee27e4}" Id =
         "3" Version = "1" /></ Stop></ Region></ Region></ RegionRoot>
    </ Regions></ Instrumentation>
    </ InstrumentationManifest> To be able to record the GC events with the easy
        to use WPRUI tool I have created an extra profile

            GC.wprp

    <? xml version = "1.0" encoding = "utf-8"?>
    <WindowsPerformanceRecorder Version = "1.0"><Profiles>
    <EventCollector Id = "EventCollector_MyEventSource" Name = "GC">
    <BufferSize Value = "2048" /><Buffers Value = "1024" /></ EventCollector>
    <EventProvider Id = ".NETCommonLanguageRuntimeGCTracking" Name =
         "e13c0d23-ccbc-4e12-931b-d9cc2eee27e4"><Keywords>
    <!--GC 0x1 and 0x8000 Exceptions--><Keyword Value = "8001" /></ Keywords>
    </ EventProvider><Profile Id = "GC.Verbose.File" Name = "GC" Description =
                          ".NET GC Tracking" Base = "" LoggingMode =
                              "File" DetailLevel = "Verbose">
    <Collectors Operation = "Add">
    <SystemCollectorId Value = "SystemCollector_WPRSystemCollectorInFile">
    <SystemProviderId Value = "SystemProvider_Base" /></ SystemCollectorId>
    <EventCollectorId Value = "EventCollector_MyEventSource"><EventProviders>
    <EventProviderId Value = ".NETCommonLanguageRuntimeGCTracking" />
    <EventProviderId Value = "EventProvider_DotNetProvider" /></ EventProviders>
    </ EventCollectorId></ Collectors></ Profile>
    <Profile Id = "GC.Verbose.Memory" Name = "GC" Description =
         ".NET GC Tracking" Base = "GC.Verbose.File" LoggingMode =
             "Memory" DetailLevel = "Verbose" /></ Profiles>
    </ WindowsPerformanceRecorder>

When you load this file into WPRUI you enable the necessary GC ETW events to track GC pause times with the region file above.

Armed with these tools and visualization capabilities it should be easy to see in your own application if enabling server GC does help at critical points of your application. You will not measure any difference if your scenario is IO or network bound and very few GCs are happening. But if your scenario is a GC heavy operation you can check out if these spots can be improved by simply switching a configuration flag in your App.config or if you need to rewrite your allocation strategy.

Did I forget anything? Yes. How do you navigate in an ETL file when your application has no ETW provider enabled? You only need to know that the kernel has already ETW provider for all file IO operations.  If you want to issue a ETW event at the start/stop of a scenario you are interested in it is sufficient to do a

Directory.GetFiles("C:\\",
                   String.Format("ETW ExecuteParallel Start {0}", scenario));

to get with File IO tracing nice ETW events which you can search for. Searching for a not existing file is a quite fast operation which takes about 20us per call on my machine. The ETW collection overhead was within the usual differences that occur from measurement to measurement and not a visible effect at all. This is of course not always true. If you have a network intensive app running and you use PerfView you get also network correlation tracing enabled by default which adds quite some overhead to all socket operations. In the CPU profiling data networking operations look much more expensive than they really are. You need always to take a sharp look at your data if it looks reasonable.

When you are doing sampling profiling you should have a look at the article from Vance to be sure to have enough samples in your data to be sure that you are seeing a real effect and you are not chasing some ghost. I hope that I have convinced you that the CLR emits so much more useful data you were never be able to collect from simple performance counters which can give you only very limited insights.

posted on Monday, March 24, 2014 11:24 AM

This article is part of the GWB Archives. Original Author: Alois Kraus

Related Posts