Alois Kraus

blog

  Home  |   Contact  |   Syndication    |   Login
  104 Posts | 8 Stories | 292 Comments | 162 Trackbacks

News



Article Categories

Archives

Post Categories

Image Galleries

Programming

Last time we did have a look into the issues you get if you create many class instances. The more class instances you have the more complex is your object graph which means more work for the GC. If the GC has more work there is less CPU time for your application left which does make it slower. An easy way to fix this issue is to create a struct which is a value type which is copied by value and does not add complexity to your object graph. Common wisdom suggests that your structs should not get bigger much than the actual reference type size (16 bytes on 32 bit, 24 bytes for 64 bit). This quite big limitation renders structs useless as general purpose data access object containers. Really?

The key insight is that we do not need to store the actual data inside the value type. We must only be able to access it from our struct. This way we can employ the struct as forwarder which does not contain the actual data but uses indexes to retrieve the actual instance data from a DataProvider object.

    public class DataProviderSimple : IList<ForwardingEmptyValue>
    {
        internal List<double> Value1 = new List<double>();
        internal List<int> Value2 = new List<int>();

        #region IList<ForwardingEmptyValue> Members
        public ForwardingEmptyValue this[int index]
        {
            get
            {
                return new ForwardingEmptyValue(this, index); // return forwarding struct to save memory
            }
            set
            {
                throw new NotImplementedException();
            }
        }
    }
    public struct ForwardingEmptyValue
    {
        public ForwardingEmptyValue(DataProviderSimple provider, int row)
        {
            _Provider = provider;
            _Row = row;
        }

        public double V1
        {
            get { return _Provider.Value1[_Row]; }
        }

        public int V2
        {
            get { return _Provider.Value2[_Row]; }
        }

        DataProviderSimple _Provider;
        int _Row;
    }
 

Ok this looks better but we have not gained much since we still have a reference to our provider object. The best solution I came up with so far was to replace the reference to the data provider with a byte which is an index to a static list of data provider objects. We still need to know from which row to read our data from which will result in a 5 byte value type object which can give you an arbitrary amount of data.

Here is my current version of a forwarding struct:

    [StructLayout(LayoutKind.Sequential, Pack=1)]
    public struct ForwardingEmptyValueStruct
    {
        int _Row;
        byte _ProviderIdx;
        

        public ForwardingEmptyValueStruct(byte providerIdx, int row)
        {
            _ProviderIdx = providerIdx;
            _Row = row;
        }

        public double V1
        {
            get { return DataProvider._DataProviders[_ProviderIdx].Value1[_Row];  }
        }

        public int V2
        {
            get { return DataProvider._DataProviders[_ProviderIdx].Value2[_Row];  }
        }
    }

I had to tweak the layout of the struct a bit since .NET has the habit to fill any gaps between natural alignment with zeros which would have resulted in an 8 byte struct instead of 5 bytes. As you can see there are no more object references left which will come in handy if we have millions of instances hanging around. As a general rule are index based data structures more more GC friendly than reference (read pointer) based structures. The Microsoft guys do know this of course which explains why highly efficient data structures like Dictionary<T> stores its values internally in index based structs.

[StructLayout(LayoutKind.Sequential)]
private struct Entry
{
    public int hashCode;
    public int next;
    public TKey key;
    public TValue value;
}

The differences can be substantial if your application is in a high throughput scenarios where you read large amounts of data to calculate some intermediate value (statistics, filter, …) and discard the data immediately after it had been processed. To illustrate this fact I have written a DataProvider which uses a forwarding struct as its DAO object and another one which uses a forwarding class object.

***** UnitTests.DataProviderTests.Read_Via_Class
Did read 10000000 lines in 2,96s Managed Memory: 320.017.224


***** UnitTests.DataProviderTests.Read_Via_Struct
Did read 10000000 lines in 0,84s, Managed Memory: 170.004.248

There is over a factor 3 difference between these two if you are in a high GC scenario. The actual benefit may range from 10% up to factors.

The test I did was

        [Test]
        public void Read_Via_Struct()
        {
            var sw = Stopwatch.StartNew();
            long start = GC.GetTotalMemory(true);
            using (var prov = new DataProvider(NumericData))
            { 
                prov.ReadData();
                Console.WriteLine("Reading did take: {0:F2}", sw.Elapsed.TotalSeconds);
                sw = Stopwatch.StartNew();
                var array = CopyToArray(prov);
                DoGCHeavyThings();
                sw.Stop();
                Console.WriteLine("Did read {0} lines in {1:F2}s, Managed Memory: {2:N0}", prov.Count, sw.Elapsed.TotalSeconds, GC.GetTotalMemory(true)-start);
            }
        }

        T[] CopyToArray<T>(IList<T> data)
        {
            T[] ret = new T[data.Count];
            data.CopyTo(ret, 0);
            return ret;
        }

        private void DoGCHeavyThings()
        {
            List<List<string>> lists = new List<List<string>>();
            for (int i = 0; i < 200 * 1000; i++)
            {
                var lst =  new List<string>(new string [] { new string("Test".ToCharArray()) } );
                lists.Add(lst);
            }
        }

Besides the much better speed we need with the forwarding struct solution needed nearly half of the memory since for values type the 12 (20 on x64) byte object header for class instances does not exist. This may be a major factor if you read millions of rows from a large data file.

 

Below is the source code for a real world DataProvider which reads data from a file which has one double and int per row.

1,1 0
2,1 1
3,1 2
4,1 3
5,1 4

….

I have found it useful to implement IList so I can read the n-th object via the index operator in an intuitive manner. Besides this I can change the underlying data structures much easier if I use the .Count and [] as prime data access method. So far we have only dealt with value types like int or double inside our test files. If we do know more about the actual structure of the data we do read we can save much more memory than already. I have written a simple (okok but it is fast as hell) trace viewer which uses this method and I could beat an already optimized viewer in terms of reading speed by a factor 6 (thanks to the simplified object model and other tricks) and memory consumption by a factor 3. Next time we have a look at the data we read and how to reduce the memory consumption with bit vectors, reference caches and indexed lists.

Below is the implementation of a simple DataProvider which does use the defined ForwardingEmptyValueStruct as data access object. You can store the structs inside an array, sort them and do everything you like like with usual classes. The forwarding struct has reference type semantics which means if you mutate the data through the struct all others will see the change as well since the data is not part of the struct.

 

    /// <summary>
    /// Read a file which has a double and integer value in each line separated by a space.
    /// The read data can be accessed via the index operator. The returned values are forwarding
    /// structs to use as little memory as possible. After you are done you should dispose the dataprovider as usual.
    /// </summary>
    public class DataProvider : IList<ForwardingEmptyValueStruct>, IDisposable
    {
        internal List<double> Value1 = new List<double>();
        internal List<int> Value2 = new List<int>();

        static object _Lock = new object();
        internal static DataProvider[] _DataProviders = new DataProvider[0]; // static list of data providers
        byte _ThisIdx;                                                       // index to _Dataproviders array of current instance
        FileStream _dataSource;                                              // Opened file

        public DataProvider(string file)
        {
            _dataSource = new FileStream(file, FileMode.Open, FileAccess.Read, FileShare.ReadWrite);
            AddThis();
        }

        unsafe public void ReadData()
        {
            string line;
            double v1;
            int v2;
            using (StreamReader reader = new StreamReader(_dataSource))
            {
                while ((line = reader.ReadLine()) != null)
                {
                    int space = 0;
                    fixed (char* pStr = line)
                    {
                        for (int i = 0; i < line.Length; i++)
                        {
                            if (pStr[i] == ' ')
                            {
                                space = i;
                                break;
                            }
                            else if (pStr[i] == ',')
                            {
                                pStr[i] = '.';
                            }
                        }

                        if (space != line.Length)
                        {
                            v1 = double.Parse(new string(pStr,0, space));
                            v2 = IntParseFast(pStr+space, line.Length-(space+1));
                            Value1.Add(v1);
                            Value2.Add(v2);
                        }
                    }

         
                }
            }
            Value1.TrimExcess();
            Value2.TrimExcess();
        }


        unsafe static int IntParseFast(char *pStr, int len)
        {
            int result = 0;
            for (int i = 0; i < len; i++)
            {
                char letter = pStr[i];
                result = 10 * result + (letter - 48);
            }
            return result;
        }

        private void AddThis()
        {
            lock (_Lock)
            {
                var prov = new List<DataProvider>(_DataProviders);

                if (prov.Count != byte.MaxValue)
                {
                    _ThisIdx = (byte)prov.Count;
                    prov.Add(this);
                }
                else // more than 256 files open 
                {
                    int tmpIdx = prov.FindIndex(x => x == null);
                    if (tmpIdx == -1)
                    {
                        throw new NotSupportedException("The dataprovider does support only 256 simultanously open files. If you did not expect this error search for missing dispose calls for unused provider objects.");
                    }
                    _ThisIdx = (byte)tmpIdx;
                    prov[_ThisIdx] = this;

                }
                _DataProviders = prov.ToArray();
            }
        }

        #region IDisposable Members

        public void Dispose()
        {
            lock (_Lock)
            {
                _dataSource.Close();
                _DataProviders[_ThisIdx] = null;
                this.Value1 = null;
                this.Value2 = null;
                if (_DataProviders.All(x => x == null))  // no more files open, reset byte array
                {
                    _DataProviders = new DataProvider[0];
                }
            }
        }

        #endregion



        #region IList<ForwardingEmptyValueStruct> Members

        public int IndexOf(ForwardingEmptyValueStruct item)
        {
            throw new NotImplementedException();
        }

        public void Insert(int index, ForwardingEmptyValueStruct item)
        {
            throw new NotImplementedException();
        }

        public void RemoveAt(int index)
        {
            throw new NotImplementedException();
        }

        public ForwardingEmptyValueStruct this[int index]
        {
            get
            {
                return new ForwardingEmptyValueStruct(_ThisIdx, index); // return forwarding struct to save memory
            }
            set
            {
                throw new NotImplementedException();
            }
        }

        #endregion

        #region ICollection<ForwardingEmptyValueStruct> Members

        public void Add(ForwardingEmptyValueStruct item)
        {
            throw new NotImplementedException();
        }

        public void Clear()
        {
            Value1.Clear();
            Value2.Clear();
        }

        public bool Contains(ForwardingEmptyValueStruct item)
        {
            throw new NotImplementedException();
        }

        public void CopyTo(ForwardingEmptyValueStruct[] array, int arrayIndex)
        {
            for (int i = arrayIndex; i < this.Count; i++)
            {
                array[i] = this[i];
            }
        }

        public int Count
        {
            get { return Value1.Count; }
        }

        public bool IsReadOnly
        {
            get { return true; }
        }

        public bool Remove(ForwardingEmptyValueStruct item)
        {
            throw new NotImplementedException();
        }

        #endregion

        #region IEnumerable<ForwardingEmptyValueStruct> Members

        public IEnumerator<ForwardingEmptyValueStruct> GetEnumerator()
        {
            for (int i = 0; i < this.Count; i++)
            {
                yield return this[i];
            }
        }

        #endregion

        #region IEnumerable Members

        System.Collections.IEnumerator System.Collections.IEnumerable.GetEnumerator()
        {
            throw new NotImplementedException();
        }

        #endregion
    }
posted on Thursday, August 18, 2011 11:24 AM