Alois Kraus

blog

  Home  |   Contact  |   Syndication    |   Login
  111 Posts | 8 Stories | 296 Comments | 162 Trackbacks

News



Article Categories

Archives

Post Categories

Image Galleries

Programming

When you want to store millions of strings in a list you will notice in your memory profiler that the biggest memory consumers strings and lists are. If you e.g read many lines from a file with repetitive data inside it you can use a reference cache to consume the actual memory for the same string only once. The CLR offers String.Intern to do exactly that but the string reference cache is cross AppDomain static which is mostly never a good choice. Where could you use this? If you parse log files where many rows of data are often the same like type name, method name, … it can save you a lot of memory.

If you use a List<string> and intern all of your read strings you still store references in your list which will make the object graph more complicated and GC cycles slower. Besides this a reference does cost you 4/8 bytes in 32/64 bit processes. A generic read only index based list to which you simply add your data can make a huge difference. I could read a 200 MB log file in 17s by using a List<string> and it did drop to 10s after switching to IndexedList. Not bad for such a simple thing. But be warned it does only help in your specific case you you now that you are adding redundant data to it. IndexedList will consume twice as much memory compared to List<string> if all strings are different!

As added bonus you just have made searching inside your list a lot faster (at least for strings) by taking advantage of the fact that the same value is the same object. It is therefore enough to check for reference equality. String.Equals does this already so there is no need to use a custom comparer. But for custom objects you should check out if they do implement Equals properly.

As usual her comes the code:

    /// <summary>
    /// Memory conserving list which helps to spare memory if many identical values are added.
    /// Internally it uses integers to lookup its values from a hash table. The base structure is an integer list which 
    /// performs much better if millions of entries with only a few thousand distinct values are added. 
    /// </summary>
    /// <typeparam name="T">List Type</typeparam>
    class IndexedList<T> : IList<T> where T : class
    {
        Dictionary<T, T> _ReferenceCache = new Dictionary<T, T>();            // Use the same reference if the value is the same
        Dictionary<object, int> _Value2Index = new Dictionary<object, int>(); // use reference equality for dictionary
        Dictionary<int, T> _Index2Value = new Dictionary<int, T>();
        List<int> _List = new List<int>();                                    // The list holds at i the index which is the key to _Index2Value 

        public IndexedList()
        {
        }

        #region IList<T> Members

        public int IndexOf(T item)
        {
            throw new NotImplementedException();
        }

        public void Insert(int index, T item)
        {
            throw new NotImplementedException();
        }

        public void RemoveAt(int index)
        {
            throw new NotImplementedException();
        }

        public T this[int index]
        {
            get
            {
                return _Index2Value[_List[index]];
            }
            set
            {
                throw new NotImplementedException();
            }
        }

        #endregion

        #region ICollection<T> Members

        public void Add(T item)
        {
            T cachedReference = null;
            int idx = -1;
            if (!_ReferenceCache.TryGetValue(item, out cachedReference))
            {
                _ReferenceCache[item] = item;
                idx = _ReferenceCache.Count;
                _Index2Value[idx] = item;
                _Value2Index[item] = idx;
            }
            else
            {
                idx = _Value2Index[cachedReference];
            }
            _List.Add(idx);
        }

        public void Clear()
        {
            _Index2Value.Clear();
            _ReferenceCache.Clear();
            _Value2Index.Clear();
        }

        public bool Contains(T item)
        {
            return _ReferenceCache.ContainsKey(item);
        }

        public void CopyTo(T[] array, int arrayIndex)
        {
            throw new NotImplementedException();
        }

        public int Count
        {
            get { return _List.Count;  }
        }

        public bool IsReadOnly
        {
            get { return true; }
        }

        public bool Remove(T item)
        {
            throw new NotImplementedException();
        }

        #endregion

        #region IEnumerable<T> Members

        public IEnumerator<T> GetEnumerator()
        {
            throw new NotImplementedException();
        }

        #endregion

        #region IEnumerable Members

        System.Collections.IEnumerator System.Collections.IEnumerable.GetEnumerator()
        {
            throw new NotImplementedException();
        }

        #endregion

        internal IEnumerable<T> UniqueValues()
        {
            foreach (var key in _ReferenceCache.Keys)
            {
                yield return key;
            }
        }

        internal void TrimExcess()
        {
            _List.TrimExcess();
        }
    }
posted on Thursday, August 18, 2011 8:02 PM