Alois Kraus

blog

  Home  |   Contact  |   Syndication    |   Login
  104 Posts | 8 Stories | 292 Comments | 162 Trackbacks

News



Article Categories

Archives

Post Categories

Image Galleries

Programming

In my last post I did show how you can consume much less memory if you read values into a list which have often the same value. The IndexedList class uses internally a reference cache similar to the String.Intern method. But what if we do read value types such as int, enum, int64, … into a list which do repeat themself also very often? Can we store an enum which can have four values in less memory than an integer? Sure we can. Lets have a look at the following data

Row Values
0 1
1 2
2 3
3 1
4 2
….  

 

The enum in our sample data can take only 3 distinct values so we can use bit arrays to represent each value by its own bit array

Row BitArray 1 BitArray 2 BitArray 3
0 1 0 0
1 0 1 0
2 0 0 1
3 1 0 0
4 0 1 0
….      

 

This way we consume for each distinct value only one bit for each row. In the BCL there is a BitArray class implemented but I did find a Word Aligned BitArray implementation which claims to be much faster than the one of the BCL so I did use this one. If we try to compress a list of integers which has more than 32 distinct values in the data we do read then we would consume more memory as the original data. It is therefore a good idea to limit the number of bit arrays in use and switch back to a conventional list if more distinct values are found.

The best thing about this CompressedIndex is that although there may be many different values possible. Only for the actually occurring values we do allocate a bitvector. If you read e.g. a 500 MB data file which do contain the logs for a process with the id 5100 then we will allocate only one bit array for the value 5100. In the end we use only one bit for each row instead of 32 bits if we would have stored the process id as integer.

I do think you cannot get much better than this. But we have still plenty of room to optimize…

During my journey into the BCL I have found BitVector32 which was unkown to me until now. It does wrap an uint value which allows you to store up to 32 boolean flags in one BitVector32 instance. On the other hand it is no wonder why have never found this class. I did not care so deeply about memory until recently. The old C/C++ tricks are still the best and can be used in C# as well where necessary.

Here is the code for the CompressedIndex

    /// <summary>
    /// Create an index which uses for each distinct value a bitarray. More different values are stored than he has bits
    /// we switch to a plain list.
    /// </summary>
    /// <typeparam name="TKey">Storage type. Must be implementing GetHashCode and equals properly!</typeparam>
    class CompressedIndex<TKey> where TKey : struct
    {
        public CompressedIndex()
        {
            Type t = typeof(TKey);
            if (t.IsEnum) // Marshal can only get the size of primitive types. 
            {
                t = Enum.GetUnderlyingType(t);
            }

            _MaxBits = Marshal.SizeOf(t) * 8;
        }

        public void Add(int row, TKey key)
        {
            // we have more distinct values than the value has bits we need to swicht to a plain list
            if (_Cache != null && _Cache.Count > _MaxBits) 
            {
                ConvertCache();
            }

            if (_Cache != null)
            {
                // get bitarray for this value
                WAHBitArray bitArray;
                if (!_Cache.TryGetValue(key, out bitArray))
                {
                    bitArray = new WAHBitArray();
                    _Cache[key] = bitArray;
                }
                bitArray.Set(row, true);  // set at row index for this value the bitarray value to 1
                _MaxIndex++;
            }
            else
            {
                TKey value = key;
                _ListCache.Add(value);  // store for the same value the same reference to the list
            }
        }

        public TKey[] UniqueValues()
        {
            if (_Cache != null)
            {
                return _Cache.Keys.ToArray();
            }
            else
            {
                HashSet<TKey> valueTypeValues = new HashSet<TKey>();
                List<TKey> uniqueValues = new List<TKey>();
                foreach (var val in _ListCache)
                {
                    if( valueTypeValues.Add(val) )
                    {
                        uniqueValues.Add(val);
                    }
                }

                return uniqueValues.ToArray();
            }
        }

        public TKey this[int index]
        {
            get
            {
                if (_Cache != null)
                {
                    foreach (var kvp in _Cache) // get all bitarrays
                    {
                        if (kvp.Value.Get(index) == true) // The bitarray which has 1 marks the actual value
                        {
                            return kvp.Key;
                        }
                    }

                    return default(TKey);
                }
                else
                {
                    return _ListCache[index];
                }
            }
        }

        /// <summary>
        /// We have reached the limit. Switch back to List
        /// </summary>
        void ConvertCache()
        {
            var newCache = new List<TKey>(_MaxIndex + 1);
            for (int i = 0; i < _MaxIndex; i++)
            {
                newCache.Add(this[i]);
            }
            _ListCache = newCache;
            _Cache = null;
            _MaxIndex = 0;
        }

        /// <summary>
        /// Trim internal buffers to the exact count. After this
        /// operation it can be accessed read only by multiple threads.
        /// </summary>
        public void TrimExcess()
        {
            if (_ListCache != null)
            {
                _ListCache.TrimExcess();
            }

            if (_Cache != null)
            {
                foreach(var kvp in _Cache )
                {
                    // get last value which can cause the bitvector to resize itself
                    // to its final size. Only then the bitvector is thread safe for read only 
                    // access.
                    kvp.Value.Get(_MaxIndex); 
                    kvp.Value.TrimExcess();
                }
            }
        }

        Dictionary<TKey, WAHBitArray> _Cache = new Dictionary<TKey, WAHBitArray>();
        int _MaxIndex = 0;

        List<TKey> _ListCache;
        static int _MaxBits;  // threshold value 
    }

posted on Friday, August 19, 2011 2:36 AM