Financial Apps feel the need for speed – this can come via parallelization, and via infrastructure - fast messaging and non-blocking distributed memory management. This blogpost gives an overview + examples of various technologies that can squeeze performance out of your trading apps and clock cycles out of your modeling apps.
Low Latency via Infrastructure
ZeroMQ
· ZeroMQ is a messaging library - ‘messaging middleware’ , ‘TCP on steroids’ , ‘new layer on the networking stack’. not a complete messaging system , but is a simple messaging library to be used programmatically. Gives the flexibility and performance of low level socket interface plus ease of implementation of high level. It is designed for simplicity.
· Performance - ZeroMQ is orders of magnitude faster than most AMQP messaging systems as it doesn’t have the overhead. It leverages efficient transports such as reliable Multicast and makes use of intelligent message batching, minimizing not only protocol overhead but also system calls. You can choose the message encoding format such as BSON or ProtoBuff.
o ZeroMQ sockets can connect to multiple end points and automatically load balance messages over them. It is brokerless and thus has no single point of failure.
· ZeroMQ provides 4 different transports:
- o INPROC an In-Process communication
- o IPC an Inter-Process communication
- o MULTICAST multicast via PGM, possibly encapsulated in UDP
- o TCP a network based transport
· ZeroMQ provides message routing devices that can bind to different ports and forward messages from them according to pre-caned logic. ZeroMQ provides three kinds of devices:
- o QUEUE, a forwarder for the request/response messaging pattern
- o FORWARDER, a forwarder for the publish/subscribe messaging pattern
- o STREAMER, a forwarder for the pipelined messaging pattern
· 0MQ supports the following message patterns:
· Clrzmq provides a C# Binding for the ZeroMQ API
o Get it from Source - http://github.com/zeromq/clrzmq or from Nuget package - http://packages.nuget.org/packages/clrzmq
o server
// ZMQ Context, server socket
using (ZmqContext context = ZmqContext.Create())
using (ZmqSocket server = context.CreateSocket(SocketType.REP))
{
server.Bind("tcp://*:5432");
while (true)
{
// Wait for next request from client
var message = server.Receive(Encoding.Unicode);
// Do Some 'work'
Thread.Sleep(1000);
// Send reply back to client
server.Send("blah", Encoding.Unicode);
}
}
client
using (ZmqContext context = ZmqContext.Create())
using (ZmqSocket client = context.CreateSocket(SocketType.REQ))
{
client.Connect("tcp://localhost:5432");
var request = "blah";
for (int requestNum = 0; requestNum < 10; requestNum++)
{
client.Send(request, Encoding.Unicode);
var reply = client.Receive(Encoding.Unicode);
}
}
Redis
Redis (http://redis.io/) is a high performance NoSQL solution that provides network accessible shared memory – it provides a key-value / data structure store with support for lists & sets, as well as a non-blocking event bus. The performance is principally down to Redis keeping the entire dataset in memory, and only periodically syncing to disk. Redis supports 5 data structures: strings, lists, hashes, sets, & sorted sets.
BookSleeve (http://code.google.com/p/booksleeve/) is a .NET client (available as a Nuget package) for Redis provides pipelined, asynchronous, multiplexed and thread-safe access to redis:
using (var conn = new RedisConnection("localhost"))
{
conn.Open();
conn.Set(0, "foo", "bar");
var value = conn.GetString(0, "foo");
...
string s = await value;
LMAX Disruptor
Disruptor (http://code.google.com/p/disruptor-net/) is ring buffer architecture for efficiently sending messages between threads without relying on shared queues (which CPU memory-access architecture causes contention). It leverages and improves upon some features of SEDA (which was serial) and the Actor model. It can handle 6 million TPS on a single thread. A Business Logic Processor runs in-memory using event sourcing and is surrounded by Disruptors - concurrency components that implements a network of queues that operate without needing locks. It allow consumers to wait on the results of other consumers without an intermediate queue.
LMAX Disruptor functions as a structured, ordered memory barrier set – multiple producer write barriers and consumer read barriers. There is no concept of entry deletion, just append.
readers can read concurrently and independently, and can optionally have dependencies. LMAX Disruptor uses a large pre-allocated ring of entries. Entry objects are pre-allocated adjacently and never get garbage collected.
Pre-allocating entries means adjacent entries are (very likely) located in adjacent memory cells, and because readers read entries sequentially, this is important to utilize CPU caches. And lots of efforts to avoid lock, CAS, even memory barrier (e.g. use a non-volatile sequence variable if there's only one writer). Different annotating readers write to different fields / cache lines, to avoid write contention.
· The .NET consumer interface IBatchHandler<T> specifies an OnAvailable method and an OnEndOfbatch method (reminds me of RX) - The consumer runs on a separate thread receiving entries as they become available.
public interface IBatchHandler<T>
{
void OnAvailable(T sequence, T value);
void OnEndOfBatch();
}
Setup the RingBuffer and barriers.
ringBuffer = new RingBuffer<ValueEntry>(
()=>new MyDataFactory(),
1024, // Size of the RingBuffer
ClaimStrategyFactory.ClaimStrategyOption.SingleThreaded,
WaitStrategyFactory.WaitStrategyOption.Yielding);
handler = new MyHandler(count) as IBatchHandler<long>;
ringBuffer.ConsumeWith(_handler);
producerBarrier = ringBuffer.CreateProducerBarrier();
//run the consumer in a new Thread
ringBuffer.StartConsumers();
Publish messages to the disruptor
long sequence = _producerBarrier.NextEntry(out data);
// … do something to the data …
// append data
producerBarrier.Commit(sequence);
tear down the RingBuffer and stop consumer) threads:
ringBuffer.Halt();
Low Latency via Parallelization
The following parallelization technologies were investigated to determine latency capabilities given an indicator and a data set:
- · C++ AMP
- · C++ PPL Concrt
- · TPL
The indicator under investigation was: standard deviation with a sliding window – a common measure of variance in stochastic calculus , used in Ito’s Lemma from Quant Finance
The data was made up of 50 million iterations - 21 days of x ticks (seconds).
TPL (.NET parallel CPU)
The Task Parallel Library (TPL) is a.NET Framework 4 API that simplifies parallelism and concurrency in applications. The TPL scales the degree of concurrency dynamically to most efficiently use all the processors that are available. In addition, the TPL handles the partitioning of the work, the scheduling of threads on the ThreadPool, cancellation support, state management, and other low-level details. By using TPL, you can maximize the performance of your code while focusing on the work that your program is designed to accomplish. The TPL is the preferred way to write multithreaded and parallel code in .NET.
ConCrt ( C++ parallel CPU)
The Concurrency Runtime programming framework for C++ abstracts the details of high performance parallelism. uses a cooperative task scheduler that implements a work-stealing algorithm to efficiently distribute work among computing resources. The Concurrency Runtime also provides synchronization primitives that use cooperative blocking to synchronize access to resources. By blocking cooperatively, the runtime can use the remaining quantum to perform another task as the first task waits for the resource. This mechanism promotes maximum usage of computing resources.
C++ AMP ( C++ parallel GPGPU)
C++ Accelerated Massive Parallelism (C++ AMP) accelerates execution of C++ code by executing it on data-parallel hardware of the graphics processing unit (GPU) found on a DirectX 11 graphics card. The C++ AMP programming model includes multidimensional arrays, indexing, memory transfer, tiling, and a mathematical function library and allows you to control how data is moved from the CPU to the GPU and back, so that you can improve performance. General-purpose computing on graphics processing units (GPGPU) allows you to perform computation in applications traditionally handled by the CPU, but across many more cores.
You can see my previous post about C++ AMP here: http://geekswithblogs.net/JoshReuben/archive/2011/12/04/c-amp.aspx
Here is the AMP moving average code – it performed best (I wont show the TPL code because it was trivial and I wont show the ConCrt code – as I actually leveraged this API in preloading the array in the AMP code sample):
#include "stdafx.h"
#include <vector>
#include <random>
#include <iostream>
#include <amp.h>
#include <concrt.h>
#include <amp_math.h>
#include<array>
#include "timer.h"
#define SAMPLES 500000000 // for tiles to avoid invalid_compute_domain: 16777216
#define WindowSize 21 // for tiles to avoid invalid_compute_domain: 256
using namespace std;
using namespace concurrency;
using namespace concurrency::fast_math;
int _tmain(int argc, _TCHAR* argv[])
{
auto &samples = *new std::array<float, SAMPLES>();
auto &results = *new std::array<float, SAMPLES + WindowSize>();
// tr1 uniform distrib
std::tr1::uniform_real<float> unif(1.2000F, 1.5555F);
mt19937 eng;
// concrt preload
parallel_for(
0, SAMPLES, 1,
[&] (int i) {samples[i]=unif(eng);});
// GPU IO
array_view<const float,1> av(SAMPLES, samples);
array_view<float,1> rv(SAMPLES + WindowSize, results);
Timer tAll;
tAll.Start();
parallel_for_each(rv.extent, [=] (index<1> idx) restrict (amp)
{
float r = 0;
for (int n = 0; n < WindowSize; ++n)
{
r+= av[idx + n];
}
auto avg = r / WindowSize;
float sum = 0;
for (int n = 0; n < WindowSize; n++)
{
auto dif = av[idx + n] - avg;
sum += dif * dif; // powf(dif,2);
}
rv[idx + WindowSize] = rsqrtf(r) / WindowSize;
});
cout << results[1000] << endl;
tAll.Stop();
std::cout << tAll.Elapsed() << " ms" << endl;
Benchmark Results
tested on DELL laptop i7 64 bit 8 Giga RAM with SSD drive.
C# TPL (Standard deviation):
- Data Size 50M: Window size 21: 650 millisec
- Data Size 200M: Window size 21: 2.56 sec
C++ ConCrt
- Data Size 50M: Window size 21: 486 millisec
- Data Size 200M: Window size 21: 1.8 sec
C++ AMP
- Data Size 50M: Window size 21: 285 millisec
- Data Size 200M: Window size 21: 1.15 sec
Redis
(win 64 bit fork, not including serialization / deserialization time):
o 5M objects:
- § Write: 40 sec
- § Read: 9.1 sec
- § Redis memory 1.7 G
o 10M objects:
- § Write: 209 sec
- § Read: 234 sec
- § Redis memory 2.7 G
Conclusions
As can be seen from the benchmarks, C++ AMP outperformed C++ Concrt, and C++ Concrt outperformed C# TPL. However, It was revealed that C# TPL provided adequate performance for the given indicator and data set. Using C# TPL also avoids the development productivity overhead associated with C++. By architecting correctly, the processing engine C++ / C# can be swapped out to suite your needs.