Concurrent, Multi-Core Programming on Windows and .NET (Part V -- Designs and Algorithms: Joe Duffy)

BlockingQueue: common producer/consumer data structure
- manual reset events: simple to write; poor performance if many more consumers than producers (or slow producer and fast consumers); all threads wake up
- auto reset events: trickier logic; however, sets don't keep track of the number of sets (as semaphores do); oh... and this doesn't work
- monitors: simplest code that can work... and it quickly becomes non-trivial when bounded blocking queues come into play

[I think I've got a problem that this fits as a nice solution for it. Since I won Joe's latest book and it covers blocking queues, I'm sure I will be posting more about this.]

.NET 4.0 will also have a BlockingCollection type.

Coarse vs Fine-Grained Concurrency
- server programs "have it easy"
- have been doing this for years
- steady stream of incoming work, W
- typically, W > #P, thus...
- one W per P isn't such a bad strategy
- multi-core is only a problem once P > W; then the server needs to worry about fine-grained
- clients, not so much
- user-centric and responsiveness-oriented
- "user clicks a button" -> now what?
- faster? sure!
- need to go fine-grained
- client developers don't tend to thing in parallel or concurrent terms

Going fine-grained
Two basic approaches
1. It's the task (i.e. code)
2. It's the data
- partitioning: split data source into n partitions, then process each partition in parallel (database query approach)

Ideally, app developers choose whatever is most appropriate for the task at hand

A Taxonomy of Concurrency
Agents -> task parallelism -> data parallelism

Metrics Worth Considering
Goal: speeding up the application... ts/tp ~ P
ts = the time required to run sequentially
tp = time to run in parallel
P = number or processors used
if ts/tp = P then we're linear; if ts/tp > P then we have super-linear improvement

Design challenge: cheap problem decomposition
Parallelism isn't free: there's overhead associated with synchronization, inter-thread communications, cache effects, threading aren't major factors in sequential code

Algorithms

Parallel for loops
- independent loop iterations may be run in parallel
- only if no loop-carried dependencies
- no shared state mutation
- static decomposition
- loop iterations are assigned to workers a priori
- contiguous chunks, striping, etc
- good: simple, predictable, efficient
- bad: can't tolerate iteration imbalance, blocking
- dynamic decomposition
- loop iterations are assigned on-demad, usually in chunks
- good: tolerates load imbalance, blocking
- bad: more difficult, communications overhead

Parallel For Each loops
- we don't know the size up front, so static decomposition is a bit of a challenge
- works best if the body (work) is much more significant than the cost of getting the iterator

Divide and Conquer - Recursion
- process separate halves of the problem separately, until some threshold is reached

Reductions
- associative and commutative reductions can be done in parallel by breaking apart the input into chunks, computing partial results, then combining

When to "Go Parallel"?
- there is a cost; it's only worthwhile when work per task/element is large and/or the number of tasks/elements is large

Speedup Inhibitors
- synchronous I/O
- synchronization
- load imbalance: exacerbates Amdahl's Law

Other miscellaneous algorithms
- pipelining
- speculation and search
- dataflow

Print | posted @ Sunday, October 26, 2008 10:11 PM

Comments on this entry:

No comments posted yet.

Post A Comment
Title:
Name:
Email:
Website:
Comment:
Verification: