So, one of our profilers has a problem. Red Gate produces two .NET profilers - ANTS Performance Profiler (APP) and ANTS Memory Profiler (AMP). Both products help .NET developers solve problems they are virtually guaranteed to encounter at some point in their careers - slow code, and high memory usage, respectively.
Everyone understands slow code - the symptoms are very obvious (an operation takes 2 hours when it should take 10 seconds), you know when you've solved it (the same operation now takes 15 seconds), and everyone understands how you can use a profiler like APP to help solve your particular problem. High memory usage is a much more subtle and misunderstood concept.
How can .NET have memory leaks?
The garbage collector, and how the CLR uses and frees memory, is one of the most misunderstood concepts in .NET. There's hundreds of blog posts out there covering various aspects of the GC and .NET memory, some of them helpful, some of them confusing, and some of them are just plain wrong. There's a lot of misconceptions out there. And, if you have got an application that uses far too much memory, it can be hard to wade through all the contradictory information available to even get an idea as to what's going on, let alone trying to solve it.
That's where a memory profiler, like AMP, comes into play. Unfortunately, that's not the end of the issue. .NET memory management is a large, complicated, and misunderstood problem. Even armed with a profiler, you need to understand what .NET is doing with your objects, how it processes them, and how it frees them, to be able to use the profiler effectively to solve your particular problem.
And that's what's wrong with AMP - even with all the thought, designs, UX sessions, and research we've put into AMP itself, some users simply don't have the knowledge required to be able to understand what AMP is telling them about how their application uses memory, and so they have problems understanding & solving their memory problem.
Ricky Leeks
This is where Ricky Leeks comes in. Created by one of the many...colourful...people in Red Gate, he headlines and promotes several tutorials, pages, and articles all with information on how .NET memory management actually works, with the goal to help educate developers on .NET memory management. And educating us all on how far you can push various vegetable-based puns. This, in turn, not only helps them understand and solve any memory issues they may be having, but helps them proactively code against such memory issues in their existing code.
Ricky's latest outing is an interview on .NET Rocks, providing information on the Top 5 .NET Memory Management Gotchas, along with information on a free ebook on .NET Memory Management. Don't worry, there's loads more vegetable-based jokes where those came from...
Cross posted from Simple Talk.
I came across ThreadLocal<T> while I was researching ConcurrentBag. To look at it, it doesn't really make much sense. What's all those extra Cn classes doing in there? Why is there a GenericHolder<T,U,V,W> class? What's going on? However, digging deeper, it's a rather ingenious solution to a tricky problem.
Thread statics
Declaring that a variable is thread static, that is, values assigned and read from the field is specific to the thread doing the reading, is quite easy in .NET:
[ThreadStatic]
private static string s_ThreadStaticField;
ThreadStaticAttribute is not a pseudo-custom attribute; it is compiled as a normal attribute, but the CLR has in-built magic, activated by that attribute, to redirect accesses to the field based on the executing thread's identity.
TheadStaticAttribute provides a simple solution when you want to use a single field as thread-static. What if you want to create an arbitary number of thread static variables at runtime? Thread-static fields can only be declared, and are fixed, at compile time. Prior to .NET 4, you only had one solution - thread local data slots. This is a lesser-known function of Thread that has existed since .NET 1.1:
LocalDataStoreSlot threadSlot = Thread.AllocateNamedDataSlot("slot1");
string value = "foo";
Thread.SetData(threadSlot, value);
string gettedValue = (string)Thread.GetData(threadSlot);
Each instance of LocalStoreDataSlot mediates access to a single slot, and each slot acts like a separate thread-static field.
As you can see, using thread data slots is quite cumbersome. You need to keep track of LocalDataStoreSlot objects, it's not obvious how instances of LocalDataStoreSlot correspond to individual thread-static variables, and it's not type safe. It's also relatively slow and complicated; the internal implementation consists of a whole series of classes hanging off a single thread-static field in Thread itself, using various arrays, lists, and locks for synchronization. ThreadLocal<T> is far simpler and easier to use.
ThreadLocal
ThreadLocal provides an abstraction around thread-static fields that allows it to be used just like any other class; it can be used as a replacement for a thread-static field, it can be used in a List<ThreadLocal<T>>, you can create as many as you need at runtime. So what does it do? It can't just have an instance-specific thread-static field, because thread-static fields have to be declared as static, and so shared between all instances of the declaring type. There's something else going on here.
The values stored in instances of ThreadLocal<T> are stored in instantiations of the GenericHolder<T,U,V,W> class, which contains a single ThreadStatic field (s_value) to store the actual value. This class is then instantiated with various combinations of the Cn types for generic arguments.
In .NET, each separate instantiation of a generic type has its own static state. For example, GenericHolder<int,C0,C1,C2> has a completely separate s_value field to GenericHolder<int,C1,C14,C1>. This feature is (ab)used by ThreadLocal to emulate instance thread-static fields.
Every time an instance of ThreadLocal is constructed, it is assigned a unique number from the static s_currentTypeId field using Interlocked.Increment, in the FindNextTypeIndex method. The hexadecimal representation of that number then defines the specific Cn types that instantiates the GenericHolder class. That instantiation is therefore 'owned' by that instance of ThreadLocal.
This gives each instance of ThreadLocal its own ThreadStatic field through a specific unique instantiation of the GenericHolder class. Although GenericHolder has four type variables, the first one is always instantiated to the type stored in the ThreadLocal<T>. This gives three free type variables, each of which can be instantiated to one of 16 types (C0 to C15). This puts an upper limit of 4096 (163) on the number of ThreadLocal<T> instances that can be created for each value of T. That is, there can be a maximum of 4096 instances of ThreadLocal<string>, and separately a maximum of 4096 instances of ThreadLocal<object>, etc.
However, there is an upper limit of 16384 enforced on the total number of ThreadLocal instances in the AppDomain. This is to stop too much memory being used by thousands of instantiations of GenericHolder<T,U,V,W>, as once a type is loaded into an AppDomain it cannot be unloaded, and will continue to sit there taking up memory until the AppDomain is unloaded. The total number of ThreadLocal instances created is tracked by the ThreadLocalGlobalCounter class.
So what happens when either limit is reached? Firstly, to try and stop this limit being reached, it recycles GenericHolder type indexes of ThreadLocal instances that get disposed using the s_availableIndices concurrent stack. This allows GenericHolder instantiations of disposed ThreadLocal instances to be re-used. But if there aren't any available instantiations, then ThreadLocal falls back on a standard thread local slot using TLSHolder. This makes it very important to dispose of your ThreadLocal instances if you'll be using lots of them, so the type instantiations can be recycled.
The previous way of creating arbitary thread-static variables, thread data slots, was slow, clunky, and hard to use. In comparison, ThreadLocal can be used just like any other type, and each instance appears from the outside to be a non-static thread-static variable. It does this by using the CLR type system to assign each instance of ThreadLocal its own instantiated type containing a thread-static field, and so delegating a lot of the bookkeeping that thread data slots had to do to the CLR type system itself! That's a very clever use of the CLR type system.
Cross posted from Simple Talk.
So, you want to obfuscate your .NET application. My question to you is:
Why?
What are your aims when your obfuscate your application? To protect your IP & algorithms? Prevent crackers from breaking your licensing? Your boss says you need to? To give you a warm fuzzy feeling inside? Obfuscating code correctly can be tricky, it can break your app if applied incorrectly, it can cause problems down the line. Let me be clear - there are some very good reasons why you would want to obfuscate your .NET application. However, you shouldn't be obfuscating for the sake of obfuscating.
Security through Obfuscation?
Once your application has been installed on a user’s computer, you no longer control it. If they do not want to pay for your application, then nothing can stop them from cracking it, even if the time cost to them is much greater than the cost of actually paying for it. Some people will not pay for software, even if it takes them a month to crack a $30 app. And once it is cracked, there is nothing stopping them from putting the result up on the internet.
There should be nothing suprising about this; there is no software protection available for general-purpose computers that cannot be cracked by a sufficiently determined attacker. Only by completely controlling the entire stack – software, hardware, and the internet connection, can you have even a chance to be uncrackable. And even then, someone somewhere will still have a go, and probably succeed. Even high-end cryptoprocessors have known vulnerabilities that can be exploited by someone with a scanning electron microscope and lots of free time.
So, then, why use obfuscation? Well, the primary reason is to protect your IP. What obfuscation is very good at is hiding the overall structure of your program, so that it’s very hard to figure out what exactly the code is doing at any one time, what context it is running in, and how it fits in with the rest of the application; all of which you need to do to understand how the application operates. This is completely different to cracking an application, where you simply have to find a single toggle that determines whether the application is licensed or not, and flip it without the rest of the application noticing.
However, again, there are limitations. An obfuscated application still has to run in the same way, and do the same thing, as the original unobfuscated application. This means that some of the protections applied to the obfuscated assembly have to be undone at runtime, else it would not run on the CLR and do the same thing. And, again, since we don’t control the environment the application is run on, there is nothing stopping a user from undoing those protections manually, and reversing some of the obfuscation.
It’s a perpetual arms race, and it always will be. We have plenty of ideas lined about new protections, and the new protections added in SA 6.6 (method parent obfuscation and a new control flow obfuscation level) are specifically designed to be harder to reverse and reconstruct the original structure.
So then, by all means, obfuscate your application if you want to protect the algorithms and what the application does. That’s what SmartAssembly is designed to do. But make sure you are clear what a .NET obfuscator can and cannot protect you against, and don’t expect your obfuscated application to be uncrackable. Someone, somewhere, will crack your application if they want to and they don’t have anything better to do with their time. The best we can do is dissuade the casual crackers and make it much more difficult for the serious ones.
Cross posted from Simple Talk.
So, day 1 of DevWeek. Lots and lots of Windows 8 and WinRT, as you would expect. The keynote had some actual content in it, fleshed out some of the details of how your apps linked into the Metro infrastructure, and confirmed that there would indeed be an enterprise version of the app store available for Metro apps.)
However, that's, not what I want to focus this post on. What I do want to focus on is this:
Windows 8 does not make .NET developers obsolete.
Phew!
.NET in the New Ecosystem
In all the hype around Windows 8 the past few months, a lot of developers have got the impression that .NET has been sidelined in Windows 8; C++ and COM is back in vogue, and HTML5 + JavaScript is the New Way of writing applications. You know .NET? It's yesterday's tech. Enter the 21st Century and write <div>! However, after speaking to people at the conference, and after a couple of talks by Dave Wheeler on the innards of WinRT and how .NET interacts with it, my views on the coming operating system have changed somewhat.
To summarize what I've picked up, in no particular order (none of this is official, just my sense of what's been said by various people):
- Metro apps do not replace desktop apps. That is, Windows 8 fully supports .NET desktop applications written for every other previous version of Windows, and will continue to do so in the forseeable future. There are some apps that simply do not fit into Metro. They do not fit into the touch-based paradigm, and never will. Traditional desktop support is not going away anytime soon.
- The reason Silverlight has been hidden in all the Metro hype is that Metro is essentially based on Silverlight design principles. Silverlight developers will have a much easier time writing Metro apps than desktop developers, as they would already be used to all the principles of sandboxing and separation introduced with Silverlight. It's desktop developers who are going to have to adapt how they work.
- .NET + XAML is equal to HTML5 + JS in importance. Although the underlying WinRT system is built on C++ & COM, most application development will be done either using .NET or HTML5. Both systems have their own wrapper around the underlying WinRT infrastructure, hiding the implementation details.
- The CLR is unchanged; it's the same .NET 4.5 CLR, running IL in .NET assemblies, as on the traditional desktop. The thing that changes between desktop and Metro is the class libraries, which have more in common with the Silverlight libraries than the desktop libraries. In Metro, although all the types look and behave the same to callers, some of the core BCL types are now wrappers around their WinRT equivalents. These wrappers are then enhanced using standard .NET types and code to produce the Metro .NET class libraries.
- You can't simply port a desktop app into Metro. The underlying file IO, network, timing and database access is either completely different or simply missing. Similarly, although the UI is programmed using XAML, the behaviour of the Metro XAML is different to WPF or Silverlight XAML. Furthermore, the new design principles and touch-based interface for Metro applications demand a completely new UI. You will be able to re-use sections of your app encapsulating pure program logic, but everything else will need to be written from scratch. Microsoft has taken the opportunity to remove a whole raft of types and methods from the Metro framework that are obsolete (non-generic collections) or break the sandbox (synchronous APIs); if you use these, you will have to rewrite to use the alternatives, if they exist at all, to move your apps to Metro.
- If you want to write public WinRT components in .NET, there are some quite strict rules you have to adhere to. But the compilers know about these rules; you can write them in C# or VB, and the compilers will tell you when you do something that isn't allowed and deal with the translation to WinRT metadata rather than .NET assemblies.
- It is possible to write a class library that can be used in Metro and desktop applications. However, you need to be very careful not to use types that are available in one but not the other. One can imagine developers writing their own abstraction around file IO and UIs (MVVM anyone?) that can be implemented differently in Metro and desktop, but look the same within your shared library.
So, if you're a .NET developer, you have a lot less to worry about. .NET is a viable platform on Metro, and traditional desktop apps are not going away. You don't have to learn HTML5 and JavaScript if you don't want to. Hurray!
Cross posted from Simple Talk.
Unlike the other concurrent collections, ConcurrentBag does not really have a non-concurrent analogy. As stated in the MSDN documentation, ConcurrentBag is optimised for the situation where the same thread is both producing and consuming items from the collection. We'll see how this is the case as we take a closer look. Again, I recommend you have ConcurrentBag open in a decompiler for reference.
Thread Statics
ConcurrentBag makes heavy use of thread statics - static variables marked with ThreadStaticAttribute. This is a special attribute that instructs the CLR to scope any values assigned to or read from the variable to the executing thread, not globally within the AppDomain.
This means that if two different threads assign two different values to the same thread static variable, one value will not overwrite the other, and each thread will see the value they assigned to the variable, separately to any other thread. This is a very useful function that allows for ConcurrentBag's concurrency properties.
You can think of a thread static variable:
[ThreadStatic]
private static int m_Value;
as doing the same as:
private static Dictionary<Thread, int> m_Values;
where the executing thread's identity is used to automatically set and retrieve the corresponding value in the dictionary.
In .NET 4, this usage of ThreadStaticAttribute is encapsulated in the ThreadLocal class.
Lists of lists
ConcurrentBag, at its core, operates as a linked list of linked lists:
Each outer list node is an instance of ThreadLocalList, and each inner list node is an instance of Node. Each outer ThreadLocalList is owned by a particular thread, accessible through the thread local m_locals variable:
private ThreadLocal<ThreadLocalList<T>> m_locals
It is important to note that, although the
m_locals variable is thread-local, that only applies to accesses through that variable. The objects referenced by the thread (each instance of the
ThreadLocalList object) are normal heap objects that are not specific to any thread. Thinking back to the Dictionary analogy above, if each value stored in the dictionary could be accessed by other means, then any thread could access the value belonging to other threads using that mechanism. Only reads and writes to the variable defined as thread-local are re-routed by the CLR according to the executing thread's identity.
So, although m_locals is defined as thread-local, the m_headList, m_nextList and m_tailList variables aren't. This means that any thread can access all the thread local lists in the collection by doing a linear search through the outer linked list defined by these variables.
Adding items
So, onto the collection operations. First, adding items. This one's pretty simple. If the current thread doesn't already own an instance of ThreadLocalList, then one is created (or, if there are lists owned by threads that have stopped, it takes control of one of those). Then the item is added to the head of that thread's list.
That's it. Don't worry, it'll get more complicated when we account for the other operations on the list!
Taking & Peeking items
This is where it gets tricky. If the current thread's list has items in it, then it peeks or removes the head item (not the tail item) from the local list and returns that. However, if the local list is empty, it has to go and steal another item from another list, belonging to a different thread. It iterates through all the thread local lists in the collection using the m_headList and m_nextList variables until it finds one that has items in it, and it steals one item from that list. Up to this point, the two threads had been operating completely independently. To steal an item from another thread's list, the stealing thread has to do it in such a way as to not step on the owning thread's toes.
Recall how adding and removing items both operate on the head of the thread's linked list? That gives us an easy way out - a thread trying to steal items from another thread can pop in round the back of another thread's list using the m_tail variable, and steal an item from the back without the owning thread knowing anything about it. The owning thread can carry on completely independently, unaware that one of its items has been nicked.
However, this only works when there are at least 3 items in the list, as that guarantees there will be at least one node between the owning thread performing operations on the list head and the thread stealing items from the tail - there's no chance of the two threads operating on the same node at the same time and causing a race condition.
If there's less than three items in the list, then there does need to be some synchronization between the two threads. In this case, the lock on the ThreadLocalList object is used to mediate access to a thread's list when there's the possibility of contention.
Thread synchronization
In ConcurrentBag, this is done using several mechanisms:
- Operations performed by the owner thread only take out the lock when there are less than three items in the collection. With three or greater items, there won't be any conflict with a stealing thread operating on the tail of the list.
- If a lock isn't taken out, the owning thread sets the list's
m_currentOp variable to a non-zero value for the duration of the operation. This indicates to all other threads that there is a non-locked operation currently occuring on that list.
- The stealing thread always takes out the lock, to prevent two threads trying to steal from the same list at the same time.
- After taking out the lock, the stealing thread spinwaits until
m_currentOp has been set to zero before actually performing the steal. This ensures there won't be a conflict with the owning thread when the number of items in the list is on the 2-3 item borderline. If any add or remove operations are started in the meantime, and the list is below 3 items, those operations try to take out the list's lock and are blocked until the stealing thread has finished.
This allows a thread to steal an item from another thread's list without corrupting it. What about synchronization in the collection as a whole?
Collection synchronization
Any thread that operates on the collection's global structure (accessing anything outside the thread local lists) has to take out the collection's global lock - m_globalListsLock. This single lock is sufficient when adding a new thread local list, as the items inside each thread's list are unaffected. However, what about operations (such as Count or ToArray) that need to access every item in the collection?
In order to ensure a consistent view, all operations on the collection are stopped while the count or ToArray is performed. This is done by freezing the bag at the start, performing the global operation, and unfreezing at the end:
- The global lock is taken out, to prevent structural alterations to the collection.
m_needSync is set to true. This notifies all the threads that they need to take out their list's lock irregardless of what operation they're doing.
- All the list locks are taken out in order. This blocks all locking operations on the lists.
- The freezing thread waits for all current lockless operations to finish by spinwaiting on each
m_currentOp field.
The global operation can then be performed while the bag is frozen, but no other operations can take place at the same time, as all other threads are blocked on a list's lock. Then, once the global operation has finished, the locks are released, m_needSync is unset, and normal concurrent operation resumes.
Concurrent principles
That's the essence of how ConcurrentBag operates. Each thread operates independently on its own local list, except when they have to steal items from another list. When stealing, only the stealing thread is forced to take out the lock; the owning thread only has to when there is the possibility of contention. And a global lock controls accesses to the structure of the collection outside the thread lists. Operations affecting the entire collection take out all locks in the collection to freeze the contents at a single point in time.
So, what principles can we extract here?
- Threads operate independently
Thread-static variables and ThreadLocal makes this easy. Threads operate entirely concurrently on their own structures; only when they need to grab data from another thread is there any thread contention.
- Minimised lock-taking
Even when two threads need to operate on the same data structures (one thread stealing from another), they do so in such a way such that the probability of actually blocking on a lock is minimised; the owning thread always operates on the head of the list, and the stealing thread always operates on the tail.
- Management of lockless operations
Any operations that don't take out a lock still have a 'hook' to force them to lock when necessary. This allows all operations on the collection to be stopped temporarily while a global snapshot is taken. Hopefully, such operations will be short-lived and infrequent.
That's all the concurrent collections covered. I hope you've found it as informative and interesting as I have. Next, I'll be taking a closer look at ThreadLocal, which I came across while analyzing ConcurrentBag. As you'll see, the operation of this class deserves a much closer look.
Cross posted from Simple Talk
For those interested, myself and a few other people from Red Gate will
be going to DevWeek 2012 in London this week. I'll be mostly around the .NET and C# talks, but may drop into the architecture and agile talks as well. I'll be blogging interesting stuff I come across as well. If you want to meet up, do
feel free to contact me via the blog or on twitter at
@simonmcooper.
See you there!
Cross posted from Simple Talk.
It is an oft-repeated maxim that you shouldn't add methods to a publically-released interface in an API. Recently, I was hit hard when this wasn't followed.
As part of the work on ApplicationMetrics, I've been implementing auto-reporting of MVC action methods; whenever an action was called on a controller, ApplicationMetrics would automatically report it without the developer needing to add manual ReportEvent calls. Fortunately, MVC provides easy hook when a controller is created, letting me log when it happens - the IControllerFactory interface.
Now, the dll we provide to instrument an MVC webapp has to be compiled against .NET 3.5 and MVC 1, as the lowest common denominator. This MVC 1 dll will still work when used in an MVC 2, 3 or 4 webapp because all MVC 2+ webapps have a binding redirect redirecting all references to previous versions of System.Web.Mvc to the correct version, and type forwards taking care of any moved types in the new assemblies. Or at least, it should.
IControllerFactory
In MVC 1 and 2, IControllerFactory was defined as follows:
public interface IControllerFactory
{
IController CreateController(RequestContext requestContext, string controllerName);
void ReleaseController(IController controller);
}
So, to implement the logging controller factory, we simply wrap the existing controller factory:
internal sealed class LoggingControllerFactory : IControllerFactory
{
private readonly IControllerFactory m_CurrentController;
public LoggingControllerFactory(IControllerFactory currentController)
{
m_CurrentController = currentController;
}
public IController CreateController(
RequestContext requestContext, string controllerName)
{
// log the controller being used
FeatureSessionData.ReportEvent("Controller used:", controllerName);
return m_CurrentController.CreateController(requestContext, controllerName);
}
public void ReleaseController(IController controller)
{
m_CurrentController.ReleaseController(controller);
}
}
Easy. This works as expected in MVC 1 and 2. However, in MVC 3 this type was throwing a TypeLoadException, saying a method wasn't implemented. It turns out that, in MVC 3, the definition of IControllerFactory was changed to this:
public interface IControllerFactory
{
IController CreateController(RequestContext requestContext, string controllerName);
SessionStateBehavior GetControllerSessionBehavior(
RequestContext requestContext, string controllerName);
void ReleaseController(IController controller);
}
There's a new method in the interface. So when our MVC 1 dll was redirected to reference System.Web.Mvc v3, LoggingControllerFactory tried to implement version 3 of IControllerFactory, was missing the GetControllerSessionBehaviour method, and so couldn't be loaded by the CLR.
Implementing the new method
Fortunately, there was a workaround. Because interface methods are normally implemented implicitly in the CLR, if we simply declare a virtual method matching the signature of the new method in MVC 3, then it will be ignored in MVC 1 and 2 and implement the extra method in MVC 3:
internal sealed class LoggingControllerFactory : IControllerFactory
{
...
public virtual SessionStateBehaviour GetControllerSessionBehaviour(
RequestContext requestContext, string controllerName) {}
...
}
However, this also has problems - the SessionStateBehaviour type only exists in .NET 4, and we're limited to .NET 3.5 by support for MVC 1 and 2.
This means that the only solutions to support all MVC versions are:
- Construct the
LoggingControllerFactory type at runtime using reflection
- Produce entirely separate dlls for MVC 1&2 and MVC 3.
Ugh. And all because of that blasted extra method!
Another solution?
Fortunately, in this case, there is a third option - System.Web.Mvc also provides a DefaultControllerFactory type that can provide the implementation of GetControllerSessionBehaviour for us in MVC 3, while still allowing us to override CreateController and ReleaseController.
However, this does mean that LoggingControllerFactory won't be able to wrap any calls to GetControllerSessionBehaviour. This is an acceptable bug, given the other options, as very few developers will be overriding GetControllerSessionBehaviour in their own custom controller factory.
So, if you're providing an interface as part of an API, then please please please don't add methods to it. Especially if you don't provide a 'default' implementing type. Any code compiled against the previous version that can't be updated will have some very tough decisions to make to support both versions.
Cross-posted from Simple Talk.
Recently, I've been doing some work involving cryptography, and encountered the standard .NET CryptographicException: 'Padding is invalid and cannot be removed.' Searching on StackOverflow produces 57 questions concerning this exception; it's a very common problem encountered. So I decided to have a closer look.
To test this, I created a simple project that decrypts and encrypts a byte array:
// create some random data
byte[] data = new byte[100];
new Random().NextBytes(data);
// use the Rijndael symmetric algorithm
RijndaelManaged rij = new RijndaelManaged();
byte[] encrypted;
// encrypt the data using a CryptoStream
using (var encryptor = rij.CreateEncryptor())
using (MemoryStream encryptedStream = new MemoryStream())
using (CryptoStream crypto = new CryptoStream(
encryptedStream, encryptor, CryptoStreamMode.Write))
{
crypto.Write(data, 0, data.Length);
encrypted = encryptedStream.ToArray();
}
byte[] decrypted;
// and decrypt it again
using (var decryptor = rij.CreateDecryptor())
using (CryptoStream crypto = new CryptoStream(
new MemoryStream(encrypted), decryptor, CryptoStreamMode.Read))
{
byte[] decrypted = new byte[data.Length];
crypto.Read(decrypted, 0, decrypted.Length);
}
Sure enough, I got exactly the same CryptographicException when trying to decrypt the data even in this simple example. Well, I'm obviously missing something, if I can't even get this single method right! What does the exception message actually mean? What am I missing?
Well, after playing around a bit, I discovered the problem was fixed by changing the encryption step to this:
// encrypt the data using a CryptoStream
using (var encryptor = rij.CreateEncryptor())
using (MemoryStream encryptedStream = new MemoryStream())
{
using (CryptoStream crypto = new CryptoStream(
encryptedStream, encryptor, CryptoStreamMode.Write))
{
crypto.Write(data, 0, data.Length);
}
encrypted = encryptedStream.ToArray();
}
Aaaah, so that's what the problem was. The CryptoStream wasn't flushing all it's data to the MemoryStream before it was being read, and closing the stream causes it to flush everything to the backing stream. But why does this cause an error in padding?
Cryptographic padding
All symmetric encryption algorithms (of which Rijndael is one) operates on fixed block sizes. For Rijndael, the default block size is 16 bytes. This means the input needs to be a multiple of 16 bytes long. If it isn't, then the input is padded to 16 bytes using one of the padding modes. This is only done to the final block of data to be encrypted.
CryptoStream has a special method to flush this final block of data - FlushFinalBlock. Calling Stream.Flush() does not flush the final block, as you might expect. Only by closing the stream or explicitly calling FlushFinalBlock is the final block, with any padding, encrypted and written to the backing stream. Without this call, the encrypted data is 16 bytes shorter than it should be.
If this final block wasn't written, then the decryption gets to the final 16 bytes of the encrypted data and tries to decrypt it as the final block with padding. The end bytes don't match the padding scheme it's been told to use, therefore it throws an exception stating what is wrong - what the decryptor expects to be padding actually isn't, and so can't be removed from the stream.
So, as well as closing the stream before reading the result, an alternative fix to my encryption code is the following:
// encrypt the data using a CryptoStream
using (var encryptor = rij.CreateEncryptor())
using (MemoryStream encryptedStream = new MemoryStream())
using (CryptoStream crypto = new CryptoStream(
encryptedStream, encryptor, CryptoStreamMode.Write))
{
crypto.Write(data, 0, data.Length);
// explicitly flush the final block of data
crypto.FlushFinalBlock();
encrypted = encryptedStream.ToArray();
}
Conclusion
So, if your padding is invalid, make sure that you close or call FlushFinalBlock on any CryptoStream performing encryption before you access the encrypted data. Flush isn't enough. Only then will the final block be present in the encrypted data, allowing it to be decrypted successfully.
Cross posted from Simple Talk.
Using locks to implement a thread-safe collection is rather like using a sledgehammer - unsubtle, easy to understand, and tends to make any other tool redundant. Unlike the previous two collections I looked at, ConcurrentStack and ConcurrentQueue, ConcurrentDictionary uses locks quite heavily. However, it is careful to wield locks only where necessary to ensure that concurrency is maximised.
This will, by necessity, be a higher-level look than my other posts in this series, as there is quite a lot of code and logic in ConcurrentDictionary. Therefore, I do recommend that you have ConcurrentDictionary open in a decompiler to have a look at all the details that I skip over.
The problem with locks
There's several things to bear in mind when using locks, as encapsulated by the lock keyword in C# and the System.Threading.Monitor class in .NET (if you're unsure as to what lock does in C#, I briefly covered it in my first post in the series):
- Locks block threads
The most obvious problem is that threads waiting on a lock can't do any work at all. No preparatory work, no 'optimistic' work like in ConcurrentQueue and ConcurrentStack, nothing. It sits there, waiting to be unblocked. This is bad if you're trying to maximise concurrency.
- Locks are slow
Whereas most of the methods on the Interlocked class can be compiled down to a single CPU instruction, ensuring atomicity at the hardware level, taking out a lock requires some heavy lifting by the CLR and the operating system. There's quite a bit of work required to take out a lock, block other threads, and wake them up again. If locks are used heavily, this impacts performance.
- Deadlocks
When using locks there's always the possibility of a deadlock - two threads, each holding a lock, each trying to aquire the other's lock. Fortunately, this can be avoided with careful programming and structured lock-taking, as we'll see.
So, it's important to minimise where locks are used to maximise the concurrency and performance of the collection.
Implementation
As you might expect, ConcurrentDictionary is similar in basic implementation to the non-concurrent Dictionary, which I studied in a previous post. I'll be using some concepts introduced in that post, so I recommend you have a quick read of it.
So, if you were implementing a thread-safe dictionary, what would you do? The naive implementation is to simply have a single lock around all methods accessing the dictionary. This would work, but doesn't allow much concurrency.
Fortunately, the bucketing used by Dictionary allows a simple but effective improvement to this - one lock per bucket. This allows different threads modifying different buckets to do so in parallel. Any thread making changes to the contents of a bucket takes the lock for that bucket, ensuring those changes are thread-safe. The method that maps each bucket to a lock is the GetBucketAndLockNo method:
private void GetBucketAndLockNo(
int hashcode, out int bucketNo, out int lockNo, int bucketCount) {
// the bucket number is the hashcode (without the initial sign bit)
// modulo the number of buckets
bucketNo = (hashcode & 0x7fffffff) % bucketCount;
// and the lock number is the bucket number modulo the number of locks
lockNo = bucketNo % m_locks.Length;
}
However, this does require some changes to how the buckets are implemented. The 'implicit' linked list within a single backing array used by the non-concurrent Dictionary adds a dependency between separate buckets, as every bucket uses the same backing array. Instead, ConcurrentDictionary uses a strict linked list on each bucket:
This ensures that each bucket is entirely separate from all other buckets; adding or removing an item from a bucket is independent to any changes to other buckets.
Modifying the dictionary
All the operations on the dictionary follow the same basic pattern:
void AlterBucket(TKey key, ...) {
int bucketNo, lockNo;
1: GetBucketAndLockNo(
key.GetHashCode(), out bucketNo, out lockNo, m_buckets.Length);
2: lock (m_locks[lockNo]) {
3: Node headNode = m_buckets[bucketNo];
4: Mutate the node linked list as appropriate
}
}
For example, when adding another entry to the dictionary, you would iterate through the linked list to check whether the key exists already, and add the new entry as the head node. When removing items, you would find the entry to remove (if it exists), and remove the node from the linked list. Adding, updating, and removing items all follow this pattern.
Performance issues
There is a problem we have to address at this point. If the number of buckets in the dictionary is fixed in the constructor, then the performance will degrade from O(1) to O(n) when a large number of items are added to the dictionary. As more and more items get added to the linked lists in each bucket, the lookup operations will spend most of their time traversing a linear linked list.
To fix this, the buckets array has to be resized once the number of items in each bucket has gone over a certain limit. (In ConcurrentDictionary this limit is when the size of the largest bucket is greater than the number of buckets for each lock. This check is done at the end of the TryAddInternal method.)
Resizing the bucket array and re-hashing everything affects every bucket in the collection. Therefore, this operation needs to take out every lock in the collection. Taking out mutiple locks at once inevitably summons the spectre of the deadlock; two threads each hold a lock, and each trying to acquire the other lock.
How can we eliminate this? Simple - ensure that threads never try to 'swap' locks in this fashion. When taking out multiple locks, always take them out in the same order, and always take out all the locks you need before starting to release them. In ConcurrentDictionary, this is controlled by the AcquireLocks, AcquireAllLocks and ReleaseLocks methods. Locks are always taken out and released in the order they are in the m_locks array, and locks are all released right at the end of the method in a finally block.
At this point, it's worth pointing out that the locks array is never re-assigned, even when the buckets array is increased in size. The number of locks is fixed in the constructor by the concurrencyLevel parameter. This simplifies programming the locks; you don't have to check if the locks array has changed or been re-assigned before taking out a lock object. And you can be sure that when a thread takes out a lock, another thread isn't going to re-assign the lock array. This would create a new series of lock objects, thus allowing another thread to ignore the existing locks (and any threads controlling them), breaking thread-safety.
Consequences of growing the array
Just because we're using locks doesn't mean that race conditions aren't a problem. We can see this by looking at the GrowTable method. The operation of this method can be boiled down to:
private void GrowTable(Node[] buckets) {
try {
1: Acquire first lock in the locks array
// this causes any other thread trying to take out
// all the locks to block because the first lock in the array
// is always the one taken out first
// check if another thread has already resized the buckets array
// while we were waiting to acquire the first lock
2: if (buckets != m_buckets) return;
3: Calculate the new size of the backing array
4: Node[] array = new array[size];
5: Acquire all the remaining locks
6: Re-hash the contents of the existing buckets into array
7: m_buckets = array;
}
finally {
8: Release all locks
}
}
As you can see, there's already a check for a race condition at step 2, for the case when the GrowTable method is called twice in quick succession on two separate threads. One will successfully resize the buckets array (blocking the second in the meantime), when the second thread is unblocked it'll see that the array has already been resized & exit without doing anything.
There is another case we need to consider; looking back at the AlterBucket method above, consider the following situation:
- Thread 1 calls
AlterBucket; step 1 is executed to get the bucket and lock numbers.
- Thread 2 calls
GrowTable and executes steps 1-5; thread 1 is blocked when it tries to take out the lock in step 2.
- Thread 2 re-hashes everything, re-assigns the buckets array, and releases all the locks (steps 6-8).
- Thread 1 is unblocked and continues executing, but the calculated bucket and lock numbers are no longer valid.
Between calculating the correct bucket and lock number and taking out the lock, another thread has changed where everything is. Not exactly thread-safe. Well, a similar problem was solved in ConcurrentStack and ConcurrentQueue by storing a local copy of the state, doing the necessary calculations, then checking if that state is still valid. We can use a similar idea here:
void AlterBucket(TKey key, ...) {
while (true) {
Node[] buckets = m_buckets;
int bucketNo, lockNo;
GetBucketAndLockNo(
key.GetHashCode(), out bucketNo, out lockNo, buckets.Length);
lock (m_locks[lockNo]) {
// if the state has changed, go back to the start
if (buckets != m_buckets) continue;
Node headNode = m_buckets[bucketNo];
Mutate the node linked list as appropriate
}
break;
}
}
TryGetValue and GetEnumerator
And so, finally, we get onto TryGetValue and GetEnumerator. I've left these to the end because, well, they don't actually use any locks.
How can this be? Whenever you change a bucket, you need to take out the corresponding lock, yes? Indeed you do. However, it is important to note that TryGetValue and GetEnumerator don't actually change anything. Just as immutable objects are, by definition, thread-safe, read-only operations don't need to take out a lock because they don't change anything. All lockless methods can happily iterate through the buckets and linked lists without worrying about locking anything.
However, this does put restrictions on how the other methods operate. Because there could be another thread in the middle of reading the dictionary at any time (even if a lock is taken out), the dictionary has to be in a valid state at all times. Every change to state has to be made visible to other threads in a single atomic operation (all relevant variables are marked volatile to help with this).
This restriction ensures that whatever the reading threads are doing, they never read the dictionary in an invalid state (eg items that should be in the collection temporarily removed from the linked list, or reading a node that has had it's key & value removed before the node itself has been removed from the linked list).
Fortunately, all the operations needed to change the dictionary can be done in that way. Bucket resizes are made visible when the new array is assigned back to the m_buckets variable. Any additions or modifications to a node are done by creating a new node, then splicing it into the existing list using a single variable assignment. Node removals are simply done by re-assigning the node's m_next pointer.
Because the dictionary can be changed by another thread during execution of the lockless methods, the GetEnumerator method is liable to return dirty reads - changes made to the dictionary after GetEnumerator was called, but before the enumeration got to that point in the dictionary. It's worth listing at this point which methods are lockless, and which take out all the locks in the dictionary to ensure they get a consistent view of the dictionary:
Lockless:
TryGetValue
GetEnumerator
- The indexer getter
ContainsKey
Takes out every lock (lockfull?):
Count
IsEmpty
Keys
Values
CopyTo
ToArray
Concurrent principles
That covers the overall implementation of ConcurrentDictionary. I haven't even begun to scratch the surface of this sophisticated collection. That I leave to you. However, we've looked at enough to be able to extract some useful principles for concurrent programming:
Partitioning
When using locks, the work is partitioned into independant chunks, each with its own lock. Each partition can then be modified concurrently to other partitions.
Ordered lock-taking
When a method does need to control the entire collection, locks are taken and released in a fixed order to prevent deadlocks.
Lockless reads
Read operations that don't care about dirty reads don't take out any lock; the rest of the collection is implemented so that any reading thread always has a consistent view of the collection.
That leads us to the final collection in this little series - ConcurrentBag. Lacking a non-concurrent analogy, it is quite different to any other collection in the class libraries. Prepare your thinking hats!
ConcurrentQueue is, like ConcurrentStack, a lockless collection, in that it is implemented without using any locks at all. However, the semantics required for a queue impose a quite different approach; unlike ConcurrentStack, which has a single point of concurrent contention, a queue can be changed at both the head and tail. This means that at least two variables are involved, most likely more. The simple approach of atomically modifying a single variable won't work here.
What does System.Collections.Generic.Queue do?
Well, let's have a look at the non-concurrent queue. This is implemented as a circular buffer. However, this design is very hard to use in a lockless way; when a new item is enqueued and the backing array is full, the array is doubled in size to accommodate the new item.
To resize the array, the entire contents of the queue has to be read sequentially without using locks and copied into a brand new array by a single thread, during which other threads can enqueue and dequeue items many times. This is very tricky to implement without being subject to race conditions and still ensure high performance.
So, instead, ConcurrentQueue forgoes the circular buffer, and uses a linear array instead. More specifically, it needs to be a linear array with a (conceptually) infinite capacity - although there will only be a finite number of items stored in the queue, items can be enqueued and dequeued an unlimited number of times. Such an array can't be created as a single object.
However, you don't need to keep the entire infinite array in memory, you only need the section of the array that is actually storing items. By splitting the array into finite segments you can create and discard segments as items get enqueued and dequeued to the queue. This is what ConcurrentQueue does.
Queue Segments
To demonstrate, I'll be using a queue with a segment size of 4. To start off with, three items are enqueued. These are added to the leftmost available slot of the tail segment:
Three more items are then enqueued. A new segment is created and assigned to the tail:
And another three:
Five items are then dequeued from the head. The head segment is now empty, so it is discarded:
Using this method, the illusion of an infinite linear array can be created using a finite number of segments.
Achieving thread-safety
ConcurrentQueue uses a segment size of 32, and along with the backing array each segment stores the index items are to be added (m_high) and where they are to be removed (m_low):
private class Segment {
volatile T[] m_array;
volatile int m_high;
volatile int m_low;
}
Thread-safety, in this case, depends on the operation of both enqueuing and dequeuing. So we'll look at the operations in isolation, then put them both together.
Enqueuing
To enqueue an item, you need to:
- Increment the
m_high variable. This is now the index the item should be inserted at.
- Assign the item to the specified slot in
m_array.
The key to this is step 1. If step 1 can be done atomically, returning the new incremented value, that gives each thread trying to enqueue an item at the same time separate slots. Step 2 can then be performed concurrently by each thread without having to worry about conflicts.
Interlocked.Increment performs this increment atomically:
public void Enqueue(T item) {
int insertIndex = Interlocked.Increment(ref m_high);
m_array[insertIndex] = item;
}
That's it! Simple, really...
Dequeuing
Dequeuing acts in a similar way to enqueuing:
- Increment the
m_low variable. The previous index is the index of the item to be dequeued.
- Return the item stored at the specified slot in
m_array.
And, similarly, if each thread trying to dequeue an item can be atomically 'assigned' a slot by performing step 1 using
Interlocked.Increment, then step 2 can be performed concurrently by several threads:
public T Dequeue() {
// Increment returns the incremented value of the variable
int index = Interlocked.Increment(ref m_low);
return m_array[index-1];
}
But hang on, if we're dequeuing items, we need to check whether the queue is empty first. So checks need to be added to determine when a queue is empty. This depends on the segment state that represents an empty segment. In the case of ConcurrentQueue, this state is when m_low > m_high.
The exact reasons for this specific state representing an empty segment are a gory implementation detail; if you're wondering the reasons for this, then start by working out exactly what m_high and m_low represent.
public bool TryDequeue(out T item) {
if (m_low > m_high) {
item = default(T);
return false;
}
int index = Interlocked.Increment(ref m_low);
item = array[index-1];
return true;
}
All good? Well, no. We've actually introduced a subtle race condition. If we've got a queue with one item in it, and two threads trying to dequeue that one item, then if both threads perform the if (m_low > m_high) check before m_low has been incremented, one will dequeue the item and one will successfully dequeue on an empty queue.
In ConcurrentStack, a similar race condition was solved by using Interlocked.CompareExchange; taking a snapshot, doing some work, then atomically making that work visible to other threads only if the state is still valid. By judicious use of local variables and loops, we can do the same thing here:
public bool TryDequeue(out T item) {
int index;
do {
index = m_low;
// check if the queue is empty
if (index > m_high) {
item = default(T);
return false;
}
// I think m_low has the same value as index. If it is, replace it with index+1.
// Return to me what m_low actually was.
if (Interlocked.CompareExchange(ref m_low, index+1, index) == index) {
// success - index is now a valid index to dequeue, specific to this thread
item = m_array[index]
return true;
}
// m_low has been changed in the meantime. Go back to the start and try again.
}
while (true);
}
In this case, CompareExchange is acting as a conditional Increment.
Combining the two
This is where it gets complicated. The most obvious issue is another race condition, between enqueuing and dequeuing. To reiterate, the steps performed are:
Enqueuing:
- Increment the
m_high variable. This is now the index the item should be inserted at.
- Assign the item to the specified slot in
m_array.
Dequeuing:
- Check if the queue is empty (
m_low > m_high). If it is, return false.
- Increment
m_low (if m_low hasn't changed) and return the item.
Start with an empty queue. Thread 1 executes step 1 of enqueuing; m_high is incremented. Thread 2 then tries to dequeue an item; step 1 of dequeuing thinks the queue has items because m_low <= m_high. Step 2 of dequeuing then executes, and the item dequeued is the null value, because thread 1 hasn't yet assigned the enqueued item to the backing array.
Hmm, how to solve this? There's more than one variable involved here; the race condition involves the relationship between two variables, m_high and m_low, not a single variable being changed behind the scenes, which was the problem solved previously using CompareExchange and loops.
Well, we can't swap round the order of operations for enqueuing, because step 1 is what atomically assigns a slot when several threads are enqueuing items at once.
What we need is some sort of notification that enqueuing has finished assigned the item to the backing array. Then the dequeuing thread can wait for that notification before dequeuing the item from the queue:
Enqueuing
- Increment the
m_high variable. This is now the index the item should be inserted at.
- Assign the item to the specified slot in
m_array.
- Set state indicating the item at that index has been enqueued.
Dequeuing:
- Check if the queue is empty (
m_high >= m_low). If it is, return false.
- Increment
m_low (if m_low hasn't changed).
- Wait for the state to be set at the index to dequeue.
- Return the item.
This is the role played by the m_state array in ConcurrentQueue.Segment. Each segment has an int[] m_state alongside the T[] m_array; a 1 in a particular slot indicates the item has been set, and a 0 (the default value) indicates it isn't set. This turns the Enqueue method into this:
public void Enqueue(T item) {
int insertIndex = Interlocked.Increment(ref m_high);
m_array[insertIndex] = item;
m_state[insertIndex] = 1;
}
and TryDequeue into this:
public bool TryDequeue(out T item) {
int index;
do {
index = m_low;
if (index > m_high) {
item = default(T);
return false;
}
if (Interlocked.CompareExchange(ref m_low, index+1, index) == index) {
// busywait until m_state[index] is set
while (m_state[index] == 0) {}
item = m_array[index]
return true;
}
}
while (true);
}
By marking both m_array and m_state as volatile, this stops the JIT reordering accesses to those arrays, ensuring that the invariant implicit in this code (m_state[index] == 1 if and only if m_array[index] has an item in it) doesn't get broken by the JIT or processors reordering instructions.
Tweaking the control flow a bit, and sprinkling calls to SpinWait.SpinOnce() around as the abstraction of a busywait, and you've got the code of the enqueue and dequeue methods of ConcurrentQueue.
Final notes
There's a couple of things I need to briefly mention. Firstly, I haven't yet covered growing the queue. Creating new segments when the current segment gets full is largely handled by enqueuing. When an item is added to the last slot in a segment, the Enqueue method creates a new segment and assigns it as the next segment in the queue using the m_next variable. When the dequeue method reaches the end of the current segment, it waits for the m_next variable to be set before looking inside that segment to see if there's an item to dequeue.
Secondly, the eagle-eyed among you might have noticed that dequeued items don't get removed from the backing array; they are still referenced from the array when items are dequeued. I can't see any specific threading reason for this, other than giving one less location for possible threading issues. However, more importantly, this guarantees that the invariant I mentioned above is never be broken, which can be important if you're reasoning about the collection.
Keeping references around longer than they need isn't so big an issue; the items referenced in the array are dereferenced all together when the containing segment is dereferenced. However it is something to bear in mind if you have large objects referenced in a ConcurrentQueue for a long time; they might not be garbage collected when they otherwise could be.
Concurrent principles
So what principles can we extract from this examination of ConcurrentQueue?
Separating threads
Increment and CompareExchange are used to atomically assign each thread performing an action a separate slot. Each thread can then do work on that slot concurrently without conflicting with each other.
Invariants
An invariant is used to eliminate a race condition between enqueuing and dequeuing, in this case, m_state[index] == 1 iff m_array[index] has an item in it. This ensures enqueuing and dequeuing on the same slot works as it should. Note that this invariant is maintained even at the cost of keeping things in memory longer than they should.
volatile used as a memory barrier
Furthermore, volatile is used to add memory barriers ensuring the JIT doesn't reorder m_state and m_array array accesses and inadvertantly break the invariant.
Well, that's the core of the ConcurrentQueue class. I haven't even begun to cover the supporting methods, and glossed over most of the gory details. These I leave for you to explore and figure out. It's time to move on! We finally get some locks involved, and peer under the covers at ConcurrentDictionary.
Cross posted from Simple Talk