Note: Joe's stated goal at the beginning of this talk was to melt our brains in the following hour. From the volume of notes, you can see he worked hard at it.
Concurrency and State (The Pitfall of Shared Memory)
Threads in a process share access to memory. This makes it easy -- even enticing -- to share information. However, it can be had to tell what is shared -- private variables are not shared, but shared (static) variables are explicitly shared.
.NET prevents you from shared ByRefs across thread stacks.
Managing State
1. Isolation/confinement
- memory space is partitioned
- no two threads ever access the same state
- good: no overhead, easy to reason about
- bad: sharing is often needed, leading to message passing
2. Immutability
- data is read only
- good: no overhead
- bad: C# and VB.NET encourage mutability [lineage]
- bad: copying means efficiency can be a challenge
- good: see F# for promising advances
3. Synchronization
- lock access to shared state
- good: flexible, programming technigues remain similar
- bad: perf overhead, deadlocks, races, etc
Combinations of these techniques can be used. Synchronization is most often used.
Sharing Hazards
1. unrepeatable read: multiple threads accessing the same thread could result in reading different values at different times, even within the same context and when that data is not changed within that context
2. dirty read: a variable's value could be read from another thread before it is intended to be used
3. write/write: two (or more) threads trying to write to the same variable could do so simultaneously, resulting in an unexpected state
On Serialization and Linearizability
- ensuring A happens before (->) B
- transactions are useful concepts
- for some method M: ...
- Atomic: M happens as a single unit and appears instantaneous
- Consistent: M never moves the program into an inconsistent state
- Isolated: M's intermediary work is isolated
- Durable: M's effects persist fot the lifetime of the context in which the effects take place
- the effect? Serializability!
- given two properly serialized methods M and N, we say either M -> N or N -> M
- Linearization points
- mostly concerned with lock-free programming
- the place where M's effects take place
- locks are common implementation techniques
Concurrency and Time
- quick example of what happens at the hardware (assembly) level when incrementing a shared variable from 3 different threads with pre-emptive multi-threading, page faults, etc
Concurrency Begets Complexity
- behavior, memory, locks, invariants, deadlock, testing, debugging all prove challenging
Interlocked Operations (Hardware Synchronization)
- Interlocked are lowest-level sync primitive
- compare and swap (CAS)
- XCHG, LOCK CMPXCHG, LOCK CMPXCHG8B
- Atomic sequence of read and write
- Locks, events, etc are all built out of them
- System.Threading.Interlocked static class
- int Add(ref int l, int v);
- CompareExchange (ref int l, int v, int cmp);
- Decrement(ref int l);
- Increment (ref int l);
- Exchange (ref int l, int v);
- Fairly expensive
- cache coherency -- bus traffic
- cost is >100 cycles on non-NUMA, >500 cycles for NUMA (uncontended... [only gets worse from there])
- they will kill scalability
Kernel Synchronization Objects: The Foundation on Top of Which All Else Exists
- kernel objects are basic primitives
- signaled/non-signaled state
- WaitForSingleObject (aka WaitAny), WaitForMulitpleObjects (aka WaitAll)
- *Msg variety: pump messages (STAs, GUIs)
- *Ex varietty: alertable waits (dispatch, Asynchronous Procedure Calls)
- CLR always pumps and does alertable waits
- data synchronization kinds
- mutex
- semaphore
- auto-reset event
- manual-reset event
- exposed through System.Threading.WaitHandle
Mutex and Sempaphore in .NET
- useful for ACLs, Win32 API, inter-AppDomain synchronization
- otherwise too heavyweight for most use
Manual and Auto-Reset Events
- manual: once set, all threads are awoken, reset must happen manually
- auto: one thread is woken, reset happens automatically
- useful in the same spots as mutex and semaphore, but are "sticky"
CLR Monitors (Locking)
- Monitor.Enter/Exit
- more efficient since they're in user mode, not in the kernel
- Lock (C#); SyncLock (VB)
- any .NET object can be used as the target
- recursive and timeout-based acquires
- spins briefly to reduce frequency of context switches
- thin and fat locks
- header word before fields in object layout used for thin lock
- acquire incurs a single interlocked operation
- fat lock on contention: sync-block (+event) reclaimed by GC
- calling (the default implementation of) GetHashCode (including putting somethign into a dictionary) on an object will require escalation to fat lock
Sync-block has a linked list of waiting threads/events; using PulseAll causing escalation as well
Condition Variables
- Monitor.Wait/Pulse/PulseAll
- waiter releases all locks on target
- .NET releases all recursive locks
- Pulse wakes one, PulseAll wakes all
- efficient mechanism used internally to pool events
- inflates target to fat lock to register thread/event pair
- event is one per thread
- number of kernel objects attached to waiting threads is bounded by number of threads, since there is one kernel object per thread
Reader/Writer Locks
- useful if there is much more reading than writing
- allows many simultaneous readers, but a single writer
- ReaderWriterLock
- Upgrades can have a window of opportunity for race conditions
- 3.5 ReaderWriterLockSlim
- Read/Write/UpgradableRead
- however, scalability can still suffer for fine-grained work due to CAS domination
Locks and Fairness
- should locks be fair?
- Thread A arrives, gets the lock
- Thread B arrives, must wait
- Thread C arrives, must wait
- is it guaranteed that B gets the lock when A exits?
- pre-Windows Server 2003 SP1: yes; post: no
- fairness exacerbates convoys
- A leaves lock, pulses B
- time for B to awaken is at least C, where C = cost of context switch ( > 10, 000 cycles) -- could be 2C
- if Thread D arrives in this window, it must wait
- effectively extends lock hold time
- system would scale nicely until the first time a thread needs to wait... then the convoy would cripple the system
Thread Local Storage (TLS)
- ensures a reference is local to one thread
- static (preferred): use ThreadStaticAttribute on the variable; provides best performace
- dynamic: Thread.SetData(k, v), only needed when per-object TLS is needed
- in both cases, initialization is tricky
- class constructors only invoked once per thread static
- all uses need to check for initialization
- within framework, one global hashtable with one global lock... lends itself to problems
Memory Models (Danger!) -- Architecture and Platform Guarantees
- sometimes locks are unnecessary
- native pointer-sized writes are atomic (32-bit/64-bit)
- but compilers (JIT compiler) and processors are free to reorder reads and writes
- CLR memory models
- data dependence
- all writes are store/release (no instructions can be moved to be after)
- all volatile reads are load/acquire (no instructions can be moved to be before)
- adjacent writes can be coalesced
- fence ensures nothing moves direction (lock, interlocked operations, Thread.MemoryBarrier)
- Sophisticated code can exploit this; warning!
- e.g. double checked locking
- hard to test, since real re-ordering differs per platform
- often requires so much cleverness that locks win out
Word Tearing :Accessing Nonatomic Locations w/out Proper Synchronization
- writing primitive types that span more than an architecture-primitive space (e.g. Int64 on a 32-bit machine) can cause strange results
Lock Freedom: Double Edged Sword
- interlocked provides hardware atomicity, so scalable, wait-free algorithms are possible
- very limited in what you can do
Double-Checked Locking
- CLR 2.0 memory model guarantees safety (ECMA 1.1/, C++, Java don't)
- "volatile" keyword necesary to prevent speculative loads (IA64)
Building a Spin-Lock: The Brain-Melting Details
Needs to:
- yield immediately on a 1-CPU machine
- don't use CompareExchange in a loop
- too much cache contention
- back-off to add randomization
- reread and try CAS if seen as 1 [TTAS]
- possibly consider queueing [MCS]
- avoid priority starvation (Sleep(0) issue)
- issue PAUSE instructions on Intel HT machines
- mark BeginCriticalRegion/EndCriticalRegion
- use managed thread ID to avoid OS thread affinity