Monday, March 26, 2012 #

Low-Latency High-Performant Financial App Infrastructures

 

Financial Apps feel the need for speed – this can come via parallelization, and via infrastructure - fast messaging and non-blocking distributed memory management. This blogpost gives an overview + examples of various technologies that can squeeze performance out of your trading apps and clock cycles out of your modeling apps.

Low Latency via Infrastructure

ZeroMQ

· ZeroMQ is a messaging library - ‘messaging middleware’ , ‘TCP on steroids’ , ‘new layer on the networking stack’. not a complete messaging system , but is a simple messaging library to be used programmatically. Gives the flexibility and performance of low level socket interface plus ease of implementation of high level. It is designed for simplicity.

· Performance - ZeroMQ is orders of magnitude faster than most AMQP messaging systems as it doesn’t have the overhead. It leverages efficient transports such as reliable Multicast and makes use of intelligent message batching, minimizing not only protocol overhead but also system calls. You can choose the message encoding format such as BSON or ProtoBuff.

o ZeroMQ sockets can connect to multiple end points and automatically load balance messages over them. It is brokerless and thus has no single point of failure.

· ZeroMQ provides 4 different transports:

  • o INPROC an In-Process communication
  • o IPC an Inter-Process communication
  • o MULTICAST multicast via PGM, possibly encapsulated in UDP
  • o TCP a network based transport

· ZeroMQ provides message routing devices that can bind to different ports and forward messages from them according to pre-caned logic. ZeroMQ provides three kinds of devices:

  • o QUEUE, a forwarder for the request/response messaging pattern
  • o FORWARDER, a forwarder for the publish/subscribe messaging pattern
  • o STREAMER, a forwarder for the pipelined messaging pattern

· 0MQ supports the following message patterns:

· Clrzmq provides a C# Binding for the ZeroMQ API

o Get it from Source - http://github.com/zeromq/clrzmq or from Nuget package - http://packages.nuget.org/packages/clrzmq

o server

// ZMQ Context, server socket

using (ZmqContext context = ZmqContext.Create())

using (ZmqSocket server = context.CreateSocket(SocketType.REP))

{

    server.Bind("tcp://*:5432");

    while (true)

    {

        // Wait for next request from client

        var message = server.Receive(Encoding.Unicode);

        // Do Some 'work'

        Thread.Sleep(1000);

        // Send reply back to client

        server.Send("blah", Encoding.Unicode);

    }

}

 

client

using (ZmqContext context = ZmqContext.Create())

using (ZmqSocket client = context.CreateSocket(SocketType.REQ))

{

    client.Connect("tcp://localhost:5432");

    var request = "blah";

    for (int requestNum = 0; requestNum < 10; requestNum++)

    {

        client.Send(request, Encoding.Unicode);

        var reply = client.Receive(Encoding.Unicode);

    }

}

Redis

Redis (http://redis.io/) is a high performance NoSQL solution that provides network accessible shared memory – it provides a key-value / data structure store with support for lists & sets, as well as a non-blocking event bus. The performance is principally down to Redis keeping the entire dataset in memory, and only periodically syncing to disk. Redis supports 5 data structures: strings, lists, hashes, sets, & sorted sets.

BookSleeve (http://code.google.com/p/booksleeve/) is a .NET client (available as a Nuget package) for Redis provides pipelined, asynchronous, multiplexed and thread-safe access to redis:

using (var conn = new RedisConnection("localhost"))

{
    conn.Open();

    conn.Set(0, "foo", "bar");

var value = conn.GetString(0, "foo");

...
string s = await value;

LMAX Disruptor

Disruptor (http://code.google.com/p/disruptor-net/) is ring buffer architecture for efficiently sending messages between threads without relying on shared queues (which CPU memory-access architecture causes contention). It leverages and improves upon some features of SEDA (which was serial) and the Actor model. It can handle 6 million TPS on a single thread. A Business Logic Processor runs in-memory using event sourcing and is surrounded by Disruptors - concurrency components that implements a network of queues that operate without needing locks. It allow consumers to wait on the results of other consumers without an intermediate queue.

LMAX Disruptor functions as a structured, ordered memory barrier set – multiple producer write barriers and consumer read barriers. There is no concept of entry deletion, just append.

readers can read concurrently and independently, and can optionally have dependencies. LMAX Disruptor uses a large pre-allocated ring of entries. Entry objects are pre-allocated adjacently and never get garbage collected.

Pre-allocating entries means adjacent entries are (very likely) located in adjacent memory cells, and because readers read entries sequentially, this is important to utilize CPU caches. And lots of efforts to avoid lock, CAS, even memory barrier (e.g. use a non-volatile sequence variable if there's only one writer). Different annotating readers write to different fields / cache lines, to avoid write contention.

· The .NET consumer interface IBatchHandler<T> specifies an OnAvailable method and an OnEndOfbatch method (reminds me of RX) - The consumer runs on a separate thread receiving entries as they become available.

public interface IBatchHandler<T>

{
    void OnAvailable(T sequence, T value);

    void OnEndOfBatch();

}

 

Setup the RingBuffer and barriers.

ringBuffer = new RingBuffer<ValueEntry>(

    ()=>new MyDataFactory(),

    1024, // Size of the RingBuffer

    ClaimStrategyFactory.ClaimStrategyOption.SingleThreaded,

    WaitStrategyFactory.WaitStrategyOption.Yielding);

handler = new MyHandler(count) as IBatchHandler<long>;

ringBuffer.ConsumeWith(_handler);
producerBarrier = ringBuffer.CreateProducerBarrier();

//run the consumer in a new Thread

ringBuffer.StartConsumers();

 

Publish messages to the disruptor

long sequence = _producerBarrier.NextEntry(out data);

// … do something to the data …

// append data

producerBarrier.Commit(sequence);

 

tear down the RingBuffer and stop consumer) threads:

ringBuffer.Halt();

Low Latency via Parallelization

The following parallelization technologies were investigated to determine latency capabilities given an indicator and a data set:

  • · C++ AMP
  • · C++ PPL Concrt
  • · TPL

The indicator under investigation was: standard deviation with a sliding window – a common measure of variance in stochastic calculus , used in Ito’s Lemma from Quant Finance

The data was made up of 50 million iterations - 21 days of x ticks (seconds).

TPL (.NET parallel CPU)

The Task Parallel Library (TPL) is a.NET Framework 4 API that simplifies parallelism and concurrency in applications. The TPL scales the degree of concurrency dynamically to most efficiently use all the processors that are available. In addition, the TPL handles the partitioning of the work, the scheduling of threads on the ThreadPool, cancellation support, state management, and other low-level details. By using TPL, you can maximize the performance of your code while focusing on the work that your program is designed to accomplish.  The TPL is the preferred way to write multithreaded and parallel code in .NET.

ConCrt ( C++ parallel CPU)

The Concurrency Runtime programming framework for C++ abstracts the details of high performance parallelism. uses a cooperative task scheduler that implements a work-stealing algorithm to efficiently distribute work among computing resources. The Concurrency Runtime also provides synchronization primitives that use cooperative blocking to synchronize access to resources. By blocking cooperatively, the runtime can use the remaining quantum to perform another task as the first task waits for the resource. This mechanism promotes maximum usage of computing resources.

C++ AMP ( C++ parallel GPGPU)

C++ Accelerated Massive Parallelism (C++ AMP) accelerates execution of C++ code by executing it on data-parallel hardware of the graphics processing unit (GPU) found on a DirectX 11 graphics card. The C++ AMP programming model includes multidimensional arrays, indexing, memory transfer, tiling, and a mathematical function library and allows you to control how data is moved from the CPU to the GPU and back, so that you can improve performance. General-purpose computing on graphics processing units (GPGPU) allows you to perform computation in applications traditionally handled by the CPU, but across many more cores.

You can see my previous post about C++ AMP here: http://geekswithblogs.net/JoshReuben/archive/2011/12/04/c-amp.aspx

Here is the AMP moving average code – it performed best (I wont show the TPL code because it was trivial and I wont show the ConCrt code – as I actually leveraged this API in preloading the array in the AMP code sample):

#include "stdafx.h"

#include <vector>

#include <random>

#include <iostream>

#include <amp.h>

#include <concrt.h>

#include <amp_math.h>

#include<array>

#include "timer.h"

#define SAMPLES 500000000 // for tiles to avoid invalid_compute_domain: 16777216

#define WindowSize 21 // for tiles to avoid invalid_compute_domain: 256

using namespace std;

using namespace concurrency;

using namespace concurrency::fast_math;

int _tmain(int argc, _TCHAR* argv[])

{

    auto &samples = *new std::array<float, SAMPLES>();

    auto &results = *new std::array<float, SAMPLES + WindowSize>();

// tr1 uniform distrib

    std::tr1::uniform_real<float> unif(1.2000F, 1.5555F);

mt19937 eng;

// concrt preload

parallel_for(

    0, SAMPLES, 1,

    [&] (int i) {samples[i]=unif(eng);});

// GPU IO

array_view<const float,1> av(SAMPLES, samples);

array_view<float,1> rv(SAMPLES + WindowSize, results);

Timer tAll;

tAll.Start();

parallel_for_each(rv.extent, [=] (index<1> idx) restrict (amp)

{

    float r = 0;

    for (int n = 0; n < WindowSize; ++n)

    {

        r+= av[idx + n];

    }

    auto avg = r / WindowSize;

    float sum = 0;

    for (int n = 0; n < WindowSize; n++)

    {

        auto dif = av[idx + n] - avg;

        sum += dif * dif; // powf(dif,2);

    }

    rv[idx + WindowSize] = rsqrtf(r) / WindowSize;

});

cout << results[1000] << endl;

tAll.Stop();

std::cout << tAll.Elapsed() << " ms" << endl;

Benchmark Results

tested on DELL laptop i7 64 bit 8 Giga RAM with SSD drive.

C# TPL  (Standard deviation):
  • Data Size 50M:  Window size 21: 650 millisec
  • Data Size 200M: Window size 21: 2.56 sec
C++ ConCrt
  • Data Size 50M: Window size 21: 486 millisec
  • Data Size 200M:  Window size 21: 1.8 sec
C++ AMP
  • Data Size 50M: Window size 21: 285 millisec
  • Data Size 200M:  Window size 21: 1.15 sec
Redis

(win 64 bit fork, not including serialization / deserialization time):

o 5M objects:

  • § Write: 40 sec
  • § Read: 9.1 sec
  • § Redis memory  1.7 G

o 10M objects:

  • § Write: 209 sec
  • § Read: 234  sec
  • § Redis memory  2.7 G

Conclusions

As can be seen from the benchmarks, C++ AMP outperformed C++ Concrt, and C++ Concrt outperformed C# TPL. However, It was revealed that C# TPL provided adequate performance for the given indicator and data set. Using C# TPL also avoids the development productivity overhead associated with C++. By architecting correctly, the processing engine C++ / C# can be swapped out to suite your needs.

Posted On Monday, March 26, 2012 9:05 PM | Feedback (1)

Wednesday, March 14, 2012 #

Direct3D 11 Programming in a Nutshell

Programming Direct3D requires understanding of where different types of resources are bound to the shader pipeline. The shader pipeline consists of configurable fixed function stages (Input Shader, Tessellator, Stream Output, Rasterizer, Output merger), and opt-in HLSL programmable shader stages (Vertex Shader, Hull Shader, Domain Shader, Geometry Shader, Pixel Shader, Compute Shader). Passing data into shaders involves creating & binding resources to the pipeline in C++ on CPU, so that HLSL can manipulate them on GPU, in parallel and at multiple stages. 3 main types of inputs can be passed in to HLSL: buffers, shader resource views and sampler state objects. The graphics pipeline programmable shaders are written in HLSL – a C / C++ derived language with a simplified API and semantics for specifying how data is passed between stages.

Shader Pipeline components

The shader pipeline consists of configurable fixed function stages and HLSL programmable shader stages. Different pipeline stages can perform multiple ops at the same time. Depending upon the use case scenario, different shader stages can be leveraged. Data is passed between HLSL programmable shader stages via matching output & input struct fields decorated with HLSL semantics, and by HLSL global system value semantics. Different shader stages are designed for processing different data granularities - Data can be processed at 3 levels: whole primitive, vertex, and pixel. The rendering pipeline programmable shaders provide opt-in functionality, but at the bare minimum, you should implement a Vertex Shader and a Pixel Shader.

Try to avoid amplification –the Pixel Shader works against interpolated fragments, of which there will be many more than there are vertices – so for performance do calculations earlier in the pipeline, for fewer parallelized ops. For example, the tessellation shaders can increase the number of vertices to be rasterized. In the Rasterizer stage, culling, clipping and scissors test can eliminate unnecessary fragments being passed to the Pixel Shader stage.

Pipeline Stage

type

functionality

HLSL input

HLSL output

Input Assembler

configurable

Read bound input vertex buffers from CPU

   

Vertex Shader

programmable

Transform vertices

Vertex

Vertex

Hull Shader

programmable

Determine tessellation LOD &

process convex hull patch control points

Primitive

control points

Tesselator

configurable

Determine barycentric coordinates to be sampled from primitive

   

Domain Shader

programmable

Vertex generation

control points & barycentric coordinates

Vertex

Geometry Shader

programmable

Modify primitive vertices

Primitive

primitive

Rasterizer

configurable

Interpolate fragments from vertices

Determine fragment depth value

   

Stream Output

configurable

Buffer resource output to CPU

   

Pixel Shader

programmable

Determine pixel color from texture, lights & norm

Pixel

pixel

Output Merger

configurable

Write Pixels to bound render target output

Depth / stencil visibility tests

blending

   

Computation pipeline – A single stage programmable shader for GPGPU. It provides structured threading with group shared memory granularity for intermediary calculation rollup. Compute Shader is called by invoking Dispatch op instead of Draw op.

Input Assembler

· IO: This stage inputs up to 16 vertex buffers & an index buffer, and outputs streams of individual vertices (for input into the Vertex, Hull & Domain shaders) and primitives assembled from vertices (for input into Hull, Domain & Geometry shaders).

· The array of input structures must match the vertex shader's input parameter struct field datatype layout via its vertex shader input semantics (A typical canonical input could match a vertex shader input struct with a float3 POSITION, a float3 NORMAL, and a float2 TEXCOORDS - see below in HLSL section). Note that vertex data can be split between multiple vertex buffers, each with a different struct (e.g. float[3] positions in one, float[3] norms in another).

· Each buffer is configured via a D3D11_INPUT_ELEMENT_DESC - specify SemanticName for binding, Format – int or float datatype, InputSlot : 0-15, AlignedByteOffset – where to start reading data, InputSlotCLass & InstanceDataStepRate – for drawing multiple varied instances of a model mesh.

· PRIMITIVE_TOPOLOGY – specifies how the primatives are organized within the vertex buffer. Strips are more compact than lists and allow indexing of shared vertices

  • o Point list – for particles
  • o Line list – for hair
  • o Line strip
  • o Triangle list
  • o Triangle strip – given a mesh, you are most likely to use this.
  • o Control point patch list (max 32 points) – for input to Hull Shader

Vertex Shader (HLSL)

· Invoked for every vertex inputted from the Input Assembler vertex stream output – each vertex is processed in isolation (this won't deform mesh symmetry because standard transforms rotate, scale, translate are affine).

· 3 types of matrices are often combined in the Vertex Shader to perform geometric affine transforms:

  • o World Matrix - convert the object space vertices into global vertices in the 3D scene, and apply rotate, translate, and scale transforms to them.
  • o view matrix - calculate the camera position
  • o projection matrix - translate the 3D scene into the 2D viewport clip space.

· Vertex shaders are used for:

  • o geometric affine transforms on vertex positions
  • o vertex skinning (object space bone based transforms for posturing)
  • o per vertex light calculations – vertex reflectivity, for later use in Pixel Shader
  • o control point manipulation

· Input can come from HLSL intrinsics generated from input assembler output stream & bound resources:

  • o SV_InstanceID – for variations
  • o SV_VertexID – each vertex has one!

· Output – dependent on next pipeline stage used:

  • o Some data must be passed by stage prior to Rasterizer stage - For passing to Pixel Shader, and optionally pass through for Hull & Geometry shaders to pass to Pixel Shader: provide SV_Position (vertex clip space position - ie post projection) and optionally provide SV_ClipDistance[n] and SV_CullDistance[n] (to Rasterizer stage). If using Hull shader, SV_Position instead provides control points for a patch primitive (a convex hull).

Hull Shader (HLSL)

· 2 required functions (unlike other programmable shaders which have only 1):

  • o Hull shading function: Invoked once for each control point - add [amplify] / remove[filter] / modify control points – input from the Vertex shader SV_Position combined with Input Assembler stage primitive stream output; output to the Domain Shader. 3 inputs: InputPatch<HSControlPointIn, #>, SV_OutputControlPointID, SV_PrimitiveID
  • o Patch constant function: Invoked for the entire control point patch - Configure Tessellator LOD (level of detail) heuristic – e.g. less triangles needed if far from camera, more triangles needed near silhouette edge. Inputs SV_PrimitiveID, outputs SV_TessFactor and SV_InsideTessFactor.

· The Hull shader HLSL program is adorned with several config attributes for the Tessellator & Domain Shader stages:

  • o [domain] – input primitive type, typically "tri" for triangle, can also be isoline or quad
  • o [partitioning] – tessellation algorithm for Tessellator Stagecan be: fractional_even, fractional_odd, integer, pow
  • o [outputtopology] - output primitive typecan be: triangle_cw, triangle_ccw (clockwise / anticlockwise – note that Rasterizer Stage culls triangles facing away from camera ) or line
  • o [outputcontrolpoints(n)] – output size (max 32)
  • o [patchconstantfunc] – control point algorithm for Domain Shader
  • o [maxtessfactor(n)] – driver hint for memory pre-allocation

Tessellator

· Subdivision of input geometry – generates points specifying where vertices are to be created by domain shader. Point count amplification / de-amplification is based upon edge factor to interior factor ratio.

· Receives input from Hull Shader patch constant function output and Hull Shader main function attributes.

· Outputs SV_DomainLocation[n] for input into Domain Shader. Outputs the output topology to Geometry Shader stage if used , or straight to Rasterizer stage.

Domain Shader (HLSL)

· Inputs

  • o from Hull Shader stage: entire control point patch and config attributes – remains constant for the invocation sweep
  • o from Tessellator stage: SV_DomainLocation[n] coordinate points – varies for the invocation sweep à Domain Shader is invoked once for each point.

· Creates vertices from sampled Tessellator stage coordinate points, positioned according to the surface curves defined by the Hull Shader stage control point patch output and the patch constant function config.

· Outputs SV_Position to Rasterizer stage. Whilst the Vertex Shader could provide this to the Rasterizer stage, optionally using the tessellation stages allow mesh morphing and LOD amplification.

Geometry Shader (HLSL)

· Explicitly modify / add / remove geometry vertices – takes vertices from input stream and invokes Append() to add it to output stream.

· Usages:

  • o partial model discard
  • o instance amplification & variation
  • o shadow volumes -discard non-edge primitives to generate just a silhouette
  • o point sprite particle generation – convert points to quads for texture application

· Inputs up to 6 vertices for adjacent triangles (SV_PrimitiveID) from either the Vertex Shader or Domain Shader stages. If no tessellation is used, it connects adjacent primitives according to Index Buffer stream from Input Assembler. An additional input is SV_GSInstanceID for primitive amplification copies. Note that because vertices can be shared by multiple primitives a vertex may be processed multiple times – performance implications.

· adorned with 2 config attributes:

  • o maxvertexcount – for vertex output, specify driver hint for memory pre-allocation
  • o instance – create up to 32 copies, each of which can be vary-transformed according to SV_GSInstanceID

· It has 3 possible output types: PointStream<T> (point list), LineStream<T> (line strip) or TriangleStream<T> (triangle strip). Up to 4 streams can be output – only one needs to be passed to Rasterizer stage as SV_Position, the optional others can be passed to Stream Output stage to send back to CPU. If multiple output streams are used, they must all be PointStream<T>. Geometry Shader optionally outputs a SV_RenderTargetIndex for determining which Texture2D bound render target a stream sent to Stream Output stage should utilize. For split screen rendering, the Geometry Shader can optionally specify SV_ViewportArrayIndex to Rasterizer to specify a target subregion within its render target Texture2D.

Stream Output

· Requires that the Geometry Shader be created with stream output via ID3D11Device::CreateGeometryShaderWithStreamOutput and that output buffers are bound via ID3D11DeviceContext::SOSetTargets

· Used to pass geometry data back to the CPU via Unordered Access Views bound to output buffer streams – e.g. for debugging, testing, offline inspection, or post-processing between multi-passing (via DrawAuto). Different streams can receive different output – e.g. a geometry's back & front faces. Up to 4 output buffer resource slots are available.

Rasterizer

· Primitive culling - cull primitives completely outside the normalized clip space cube according to SV_CullDistance received from Vertex Shader stage. ID3D11RasterizerState config values control culling: FrontCounterClockwise & CullMode – determine triangle vertex ordering, identify front & back faces, and specify which non-contributing face to cull. DepthBias, SlopeScaleDepthBias, DepthBiasClamp – apply depth mapping technique to identify scene objects visible to light sources as well as shadowed artifacts for discarding. Culling is important to reduce amplification in interpolation.

· Primitive clipping - clip parts of primitives partially outside the normalized clip space cube according to SV_ClipDistance received from Vertex Shader stage. . ID3D11RasterizerState config value DepthClipEnable – specifies to remove primitives outside the frustrum (near & far clip planes).

· Normalize the viewport target - The C++ code must provide at least one D3D11_ViewPort defining render target sub-region rectangle and near & far planes. The Viewport is normalized from (-1,-1, 0) to (1, 1,,1) for clip space mapping.

· Multiple render targets - Rasterizer can optionally identify a specific render target viewport and texture slice according to SV_ViewportArrayIndex and SV_RenderTargetArrayIndex – up to 8 render targets are supported.

· Interpolation - Interpolate SV_Position vertex data received from a previous stage into generated fragments (pixels with interpolated attributes) to pass to Pixel Shader stage. ID3D11RasterizerState config values control interpolation: FillMode – specify solid fill (default) or wireframe (fragments are only interpolated for edge polygons); AntiAliasedLineEnable varies edge pixel color according to the percent that a pixel is covered by a line. By default interpolation mode is perspective-correct linear that takes depth into account, but other modes can be set: centroid (considers pixel coverage), no-interpolation (pass constants to appear faceted), no-perspective (ignores depth), sample (MSAA).

· Multi-sample anti-aliasing (MSAA) - ID3D11RasterizerState MultiSampleEnable reduce edge aliasing in a performant manner, buy utilizing multiple sub-samples stored in a depth-stencil buffer instead of increasing the Pixel Shader resolution.

· Scissor test – discard any generated fragments outside the render target's viewport (rectangular windows region). Applied if ID3D11RasterizerState ScissorsEnable is set

Pixel Shader (HLSL)

· Invoked for each fragment - processes each fragment independently. Receives SV_Position processed by Rasterizer stage to hold the X,Y render target coordinates and normalized depth value.

· SV_Depth can optionally be used to substitute an arbitrary depth value. conservative depth output can clamp minimum / maximum depth values, and the bill-boarding technique allows a textured quad perpendicular to the camera to have a depth complexity that enables partial occlusion.

· If the Rasterizer stage was configured for MSAA, SV_SampleIndex is passed into Pixel Shader to be invoked for each subsample of each pixel. If the rasterize passes specific render target texture slice data via SV_RenderTargetArrayIndex the Pixel Shader can evaluate whether to process the fragment. The Pixel Shader can be executed once per pixel, according to which sample passes the coverage test, or it can run once per sample per pixel (supersampling).

· For each fragment, calculates the pixel out color to pass to the Output Merger stage. The calculation uses basic trigonometry & linear algebra & is based on:

  • o Sampled texture – external resource file loaded & bound via shader resource view.
  • o light - type, color, directional vector - bound via constant buffer.
  • o material reflectivity factor
  • o vertex normal vectors - specified in vertex buffer
  • o color - bound via constant buffer.

· The Pixel Shader stage passes SV_Target[n] (color) and SV_Depth (depth) to the Output Merger stage. For MSAA, SV_Coverage is also passed.

Output Merger

· Inputs SV_Target[n] (color) and SV_Depth (depth) from the Pixel Shader stage

· blending – Blending involves post process combining texels of multiple 2D render targets. color value combination function against an input pixel color – e.g. modify alpha channel transparency. Configured using D3D11_RENDER_TARGET_BLEND_DESC and D3D11_BLEND_DESC

· Depth test– for each fragment, use Z-buffer algorithm: if the normalized z-coordinate from Rasterizer stage (or clamped value from Pixel Shader stage) is less than the bound depth stencil resource value, then it is in visibility range , else it can be discarded. Ignored if the Pixel Shader stage targeted the use of an ID3D11UnorderedAccessView.

· stencil test – for each fragment, determine whether the area of the render target is masked and cannot be written to. Can be configured for both front face & back face of a primitive. Configured via D3D11_DEPTH_STENCIL_DESC, D3D11_BLEND_DESC, Set test evaluation & pass / fail action using D3D11_COMPARISON_FUNC and D3D11_STENCIL_OP. Ignored if the Pixel Shader stage targeted the use of an ID3D11UnorderedAccessView.

· Output – the Output Merger stage merges results into bound output render target(s). Up to 8 ID3D11RenderTargetView and 1 ID3D11DepthStencilView can be bound using ID3D11DeviceContext::OMSetRenderTargets. If the Pixel Shader stage targeted the use of an ID3D11UnorderedAccessView, use ID3D11DeviceContext::OMSetRenderTargetsAndUnorederedAccessViews. Multiple render targets allow different versions of the same scene. These must have the same size (height, width, depth, sample count, array size) & type (e.g. Texture2D, Texture2DArray) so that depth stencil can match the render target. You can use multiple render target Texture2Ds (MRT - require only a single parallelized Pixel Shader invoke to split), or texture slices in a single Texture2DArray (individual slice Rasterization allow geometry primitives to be rasterized to different target locations).

Resources

Passing data into shaders involves creating & binding resources to the pipeline in C++ on CPU, so that HLSL can manipulate them on GPU, in parallel and at multiple stages. 3 main types of inputs can be passed in to HLSL: buffers, shader resource views (more memory, but slower access) and sampler state objects.

For rendering pipeline, Bound resources are either RO or WO. For compute pipeline, bound resources can be RO, WO or RW

Resource Creation

Resources are created by specifying a specifically typed resource description (via config flags) & a D3D11_SUBRESOURCE_DATA structure describing the type of data loaded.

The following config flag enums are leveraged in the resource description:

· D3D11_USAGE

  • o DEFAULT (GPU RW) – for pipeline outputs: Output Merger stage render target Texture2D and Stream Output stage vertex buffers
  • o IMMUTABLE (GPU RO) – for static buffers created at initialization
  • o DYNAMIC (CPU WO, GPU RO) – used for passing data from C++ to shader programs on each render frame update – e.g. cbuffer scalar values, & matrices for transforms
  • o STAGING (CPU RW, GPU RW) – for DirectCompute GPGPU

· D3D11_CPU_ACCESS_FLAG – read, write or both

· D3D11_BIND_FLAG – specify pipeline location(s) that have access

  • o VERTEX_BUFFER , INDEX_BUFFER – bind geometry to Input Assembler stage – these binding types do not require a resource view.
  • o RENDER_TARGET, DEPTH_STENCIL – bind raster output from Output Merger stage. Note that render target requires a resource view to facilitate binding.
  • o STREAM_OUTPUT – bind geometry output from Stream Output stage
  • o CONSTANT_BUFFER, SHADER_RESOURCE, UNORDERED_ACCESS – bind input to any programmable shader stage. Note that these 3 binding types require resource views to facilitate binding.

· D3D11_RESOURCE_MISC_FLAG - miscalaneous

Resource view types

4 interfaces derive from ID3D11Resource - 4 types of resource views specify how the resource will be used:

  • · ID3D11RenderTargetView – bind a Texture2D for output to the double buffer swap chain
  • · ID3D11DepthStencilView – bind output for depth & stencil tests.
  • · ID3D11ShaderResourceView – RO bound for HLSL shaders at any stage. Multiple shader resource views can access the same resource.
  • · ID3D11UnorderedAccesslView - RW bound for HLSL shaders at Compute Shader or Pixel Shader stage. Only a single unordered access view can access the same resource.

Resource View descriptions use a structured union to specify the type of resource data structure:

  • · Buffer - 1D linear block of memory
  • · BufferEx – raw buffer freeform structure
  • · Texture1D – a vector of texels – typically used to implement lookup tables of float values
  • · Texture2D – a matrix of texels, - typically used for standard image representation: render targets, depth target, normal maps (RGB values map to normal vector XYZ), displacement maps.
  • · Texture2DMS – multi-sampled version
  • · Texture3D – voxels (memory intensive) can be used for isosurface modeling and global illumination (fixed resolution makes it more performant than ray tracing).
  • · TextureCube – for CubeMaps – 6 Textures can be applied together to a model for reflective effects.
  • · Texture1DArray , Texture2DArray , Texture2DMSArray, TextureCubeArray – arrays of textures can be bound in single ops.

Resource data structures are supported by different resource view types

RenderTarget View

DepthStencil View

ShaderResource View

UnorderedAccess View

Buffer

X

 

X

X

Texture1D

X

X

X

X

Texture1DArray

X

X

X

X

Texture2D

X

X

X

X

Texture2DArray

X

X

X

X

Texture2DMS

X

X

X

 

Texture2DMSArray

X

X

X

 

Texture3D

X

 

X

X

TextureCube

   

X

 

TextureCubeArray

   

X

 

BufferEx

   

X

 

Depending on the resource view type and data structure combination, different size, index & offset properties are specifiable for each resource data structure:

  • · ElementOffset, ElelmentWidth
  • · MipSlice
  • · FirstArraySlice, ArraySize
  • · FirstWSlice, WSize
  • · MostDetailedMip, MipLevels
  • · FirstElement, NumElements
  • · First2DArrayFace, NumCubes

Buffers

· 1D linear block of memory.

· Can contain scalar, vector, matrix values, structures of these types, or arrays of these structures. If passing custom data structures via vertex buffers or constant buffers, the C++ structure layout and datatypes must match the corresponding HLSL cbuffer. C++ code binds to a HLSL cbuffer by its name, but HLSL does not use this name internally – instead it uses shader reflection intrinsic keywords mapped against the fields of the cbuffer data structure. E.g. angles & transform matrices for vertex shaders.

· Confusingly, buffers while describe transferrable memory block structures, the term is overloaded to describe the intent of the usage in passing polygon mesh model structures to the pipeline:

  • o Vertex Buffers - array of customizable vertex structures – a typical vertex structure contains a float[3] position vector, a float[3] norm vector, and a float[2] texture coord vector. A model is typically represented as a triangle polygon mesh – each corner of each triangle is a vertex. Multiple models can be combined into 1 triangle strip vertex buffer to reduce draw calls from CPU. Typically bound to Input Assembler stage as input, can also be bound to Stream Output stage for output debug.
  • o Index Buffers - Allow reuse of shared vertices, reducing the vertex buffer size & thus the amount of shader processing.

Constant Buffers

· The mechanism for data transfer of RO data structures from C++ host app on CPU to cbuffer structures in programmable shaders on GPU.

· The value may vary between each C++ draw or dispatch call, but remain constant across all parallel shader invocations for that call. Treated as a global constant in GPU memory for that parallel sweep and is accessible from multiple shader stages as well as parallel invocations across each shader

· HLSL cbuffer structures - constant buffers are globally accessible and immutable across parallel instances of a shader program invocation. Must match CPU host app constant buffer struct. Note correct offset sizes may require padding for this match – fields can be annotated with packoffset. Bound via SSetConstantBuffers. static const variables are added to the $Globals constant buffer and shader entry function uniform params are added to the $Params constant buffer.

· HLSL tbuffer structures - Texture buffers have similar syntax to cbuffer structures. However they are used as mapping targets for large array inputs from bound shader resource views, and use an async memory access mechanism. Bound via SSetShaderResources

Structured Buffers

· The mechanism for data transfer of arrays of data structures from C++ host app on CPU to StructuredBuffer<Tstruct> or RWStructuredBuffer<Tstruct> of struct structures in programmable shaders on GPU, and for passing data between pipeline stages If bound to multiple stages must be RO. HLSL uses Buffer<T> variables for DXGI_FORMAT RO values bound via Load, and StructuredBuffer<Tstruct> for RO arbitrary structs. HLSL RWBuffer<T> and RWStructuredBuffer<Tstruct> support RW values – require manual thread sync. Writing is accomplished via array accessor. For RO use in multiple stages must bind via a shader resource view to StructuredBuffer<Tstruct> . for RW use in a single stage Must bind via an unordered resource view to a RWStructuredBuffer<Tstruct>. like C++, HLSL supports bracketed array indexing and a dot operator. It also supports a GetDimensions() method. an unordered access view binding to a HLSL AppendStructuredBuffer<Tstruct> or ConsumeStructuredBuffer<Tstruct> can also act as a LIFO stack via the HLSL append() & consume() ops.

Byte Address Buffers

· use 4-byte offsets instead of a fixed structure – for GPGPU algorithms, can access custom data structures – trees, linked lists etc. part of each item can specify the (+/-) offset to the next item.

· HLSL ByteAdressBuffer<T> RO values bound via Load#. – access a buffer via byte offset of 1-4

· HLSL RWByteAdressBuffer<T> uses store# operator to write values.

Indirect argument buffers

· reduce CPU passing params into Pixel Shader or Compute shader on each execution pass, by instead passing values from within a resource.

· Use any of the following methods: DrawInstanceIndirect (access a vertex), DrawIndexedInstanceIndirect (the array version) and DispatchIndirect

· Obviously requires GPU RW access

Geometry & Tessellation Buffers

· HLSL PointStream<Tstruct>, LineStream<Tstruct>, TriangleStream<Tstruct> – stream output buffers for the Geometry Shader to emit vertices for a primitive. Use Append() to add a vertex to a strip, RestartStrip() to begin a new strip.

· HLSL InputPatch[n] - input control point array for Hull Shader entry point function & patch constant function; and for Geometry Shader

· HLSL OutputPatch[n] - input control point array for Domain Shader.

Texture Resources

· Image-like 1-3D texel arrays. E.g. a Texture3D has X,Y,Z coordinates as well as UVW normalized coordinates.

· RO resource types: Texture1D , Texture1DArray , Texture2D , Texture2DArray , Texture2DMS , Texture2DMSArray , Texture3D, TextureCube, TextureCubeArray

· Textures are stored in texture specific video memory locations via register(t#)

· Textures support the following filtering functions (mipmap interpolation , minification / magnification):

  • o Mip-map levels - multiple mipmap mip levels allow different resolution granularities in 1-3 dimensions
  • o slices - sub-selections across 1 to 3 dimensions
  • o MSAA - (multisample anti-aliasing) - a quality technique whereby each texel can be composed of up to 32 subsamples, controlled by sample count & quality

· HLSL TextureX resource exposes several methods:

  • o Sample methods - apply filtering– take a SampleState and normalized float texture coordinates as parameters. The SampleBias, SampleGrad & SampleLevel methods support mipmapping
  • o SampleCmp & Gather methods – Boolean comparison & RGBA functions for shadow mapping
  • o Load methods – array indexer for accessing texture raw RO pixel subsamples – for MSAA
  • o Get methodsfor querying metadata – e.g. dimensions, mip levels, MSAA sample positions etc.

Sampler State Objects

used for filtering textures for pixel fragments. D3D11_SAMPLER_DESC can specify up to 16 texture resource sampling configuration settings for each pipeline stage.

· A sampler modifies how the pixels are written to the polygon face when shaded – determine which combination of pixels will be drawn from the original texture e.g. based on position, screen depth.

· HLSL SampleState objects are mapped to s# sample registers.

· Sampler details:

  • o sampling location – AddressU, AddressV, AddressW – control wrapping, flipping, mirroring, value range clamping.
  • o level of detail – MinLOD , MaxLOD, MipLODBias
  • o filtering - D3D11_FILTER subtype can specify minification (eliminate sparkle effects) / magnification (eliminate blockiness) / mip-mapping (multi-resolution representations). Sampling quality can be point (raw) , linear (interpolated average smoothing) or anisotropic (angle based interpolation).
  • o border color - float BorderColor[4]

Graphics Card memory Allocation

HLSL supports a registry access mapping scheme for structure fields. Explicit register location can be specified for a data structure using register – e.g. cbuffer X : register(cb#) {}. These registry schemes are as follows:

  • · v# - inputs (RO)
  • · r#, x# - temp data
  • · t# - textures
  • · cb#[i] – constant buffers
  • · icb#[i] – immediate constant buffers
  • · #u – unordered access
  • · #o – output to be passed to next stage input

HLSL – High Level Shader Language

The graphics pipeline programmable shaders are written in HLSL - C / C++ derived with a simplified API. No support for dynamic memory alloc, recursive functions, pointers or templates.

Inter-stage data passing – binding semantics specify metadata (stage IO data) which consist of up to 4 float or int vectors. The output attributes of the previous stage must match the input attributes of the next stage. HLSL variable & system value semantics (SV_ prefix) adorn parameters and provide the pipeline with binding of the required matching values.

Syntax

Primitive types: bool, int, unint, half (16bit float for backwards compatibility), float, double

Vectors and matrices: support 1-4 components. can use the verbose type syntax (e.g. vector<int,2> , matrix<float,4,4> ) or the compressed syntax (vector2, matrix4x4). Can be initialized via array initializers or constructors that take scalar or vector params. Components can be accessed via array index syntax or via xyzw or rgba ordered swizzle properties. Single array-indexed matrices implicitly cast to vectors. Use the mul op for matrix multiplications. To enforce row major layout, use row_major modifier

Data structures – a struct, interface (implicitly pure virtual class i.e. abstract) or class can contain primitive, vector and matrix members.

Function param semanticsin, out, inout, uniform (constant in)

Control flow – supports loops (for, while, do while) and conditionals (if , case). if a branch is coherent (simultaneous invocations of a shader program all choose the same branch) then dynamic branching occurs: control flow executes a single branch), otherwise predication occurs (all branches are executed, taking more compute cycles).

Attributes – GPU compiler hints for flow control (branch / flatten, loop / unroll) tessellation shaders (domain , maxtessfactor, outputcontrolpoints, outputtopology, partitioning, patchconstantfunc), the geometry shader (maxvertexcount, instance) , the compute shader (numthreads) or the pixel shader (earlydepthstencil)

Intrinsic Functions - HLSL contains various mathematical functions (mapped to graphics card instruction set): general math, vector & matrix manipulations, casting, synch and thread atomicity, pixel & tessellation manipulation.

Reflection - HLSL also supports reflection for querying metadata via ID3D11ShaderReflection, ID3D11ShaderReflectionConstantBuffer, ID3D11ShaderReflectionVariable, ID3D11ShaderReflectionType interfaces.

HLSL Semantics

· semantic strings decorate shader variables, function parameters & struct fields. They specify the intended I/O binding for passing matching parameters between shader pipeline stages.

· Commonly used vertex shader input semantics include POSITION (Vertex position in object space), NORMAL (Normal vector), COLOR (Diffuse / specular color) or TEXCOORD (Texture coordinates). bonId and bonewight are used for vertex skinning. Commonly used vertex shader output semantics include POSITION (Vertex position transformed in screen space), COLOR (pass through) or TEXCOORD (pass through). If tessellation shaders are used, TESSFACTOR is also passed.

IO Semantics

Vertex Shader

Input Semantics

Vertex Shader

Output Semantics

Pixel Shader

Input Semantics

Pixel Shader

Output Semantics

BINORMAL[n] (float4)

BLENDINDICES[n](uint)

BLENDWEIGHT[n](float)

COLOR[n] (float4)

NORMAL[n](float4)

POSITION[n] (float4)

POSITIONT (float4)

PSIZE[n] (float)

TANGENT[n] (float4)

TEXCOORD[n] (float4)

boneId (uint4)

boneweight (float4)

COLOR[n] (float4)

FOG (float)

POSITION[n] (float4)

PSIZE (float)

TESSFACTOR[n] (float)

TEXCOORD[n] (float4)

COLOR[n] (float4)

TEXCOORD[n] (float4)

COLOR[n] (float4)

DEPTH[n] (float)

System-Value Semantics - begin with an SV_ prefix. Pixel shaders can only write to SV_Depth and SV_Target parameters. SV_VertexID, SV_InstanceID, & SV_IsFrontFace can only be input into the first active shader in the pipeline that can interpret it and must be passed to subsequent stages.

Read Only

Write Only

Input Assembler

 

*generated here:

SV_VertexID (uint)

SV_InstanceID (uint) – for DrawInstance calls

SV_OutputControlPointID (uint)

SV_PrimitiveID (uint)

Vertex Shader

SV_VertexID (uint)

SV_InstanceID (uint)

SV_ClipDistance[n] (float)

SV_CullDistance[n] (float)

SV_Position (float4)

Hull Shader

SV_VertexID (uint)

SV_OutputControlPointID (uint)

SV_PrimitiveID (uint)

SV_InsideTessFactor (float/float[2]) - how much to tessellate patch non-edge polygons

SV_TessFactor (float[2|3|4]) - how much to tessellate patch edge polygons

Domain Shader

SV_DomainLocation (float2|3)

SV_OutputControlPointID (uint)

SV_PrimitiveID (uint)

SV_InsideTessFactor (float/float[2])

SV_TessFactor (float[2|3|4)

 

Geometry Shader

SV_GSInstanceID (uint)

SV_PrimitiveID (uint)

SV_ClipDistance[n] (float)

SV_CullDistance[n] (float)

SV_Position (float4)

SV_RenderTargetArrayIndex (uint)

SV_ViewportArrayIndex (uint)

Pixel Shader

SV_IsFrontFace (bool)

SV_Position (float4)

SV_RenderTargetArrayIndex (uint)

SV_ViewportArrayIndex (uint)

SV_Coverage (bool) – mask

SV_Depth (float)

SV_Target[0..7] (float)

SV_SampleIndex (uint)

Output Merger

SV_Coverage (bool)

SV_Depth (float)

SV_Target[0..7] (float)

SV_SampleIndex (uint)

·

Read Only

Compute Shader

SV_DispatchThreadID (uint3)

SV_GroupID (uint3)

SV_GroupIndex (uint)

SV_GroupThreadID (uint3)

Compilation & linkage

· HLSL compiles into a SIMD vectorized assembly language byte code specific to the GPU architecture. Compiled before binding – compile via fxc.exe or in code via D3DXCompileFromFile/Resource, and bound via D3D11Device::Create*Shader – use to create Vertex, Hull, Domain, Geometry, Pixel or Compute shader objects

· To prevent branched shader program combinatorial explosion, dynamic shader linkage allows selecting the appropriate shader program interface implementation during binding for each Draw or Dispatch invocation.

clip_image001

Accessing the Shader pipeline from CPU

A Win32 message pump app is used to host the DX COM API for calling into the pipeline.

2 DirectX classes serve as the API entry points:

  • · ID3D11Device (the device) is used to create & configure resources and attach shader programs.
  • · ID3D11DeviceContext (the DC) is used to bind & manipulate resources and invoke bound shaders. A default single immediate context provides the main rendering thread, but deferred contexts can provide command lists for async resource creation.

Creating resources - The ID3D11Device interface supports the Create family of ops for creating buffers and resources

Binding input resources - The ID3D11DeviceContext can dynamically manipulate resources via the Map & UnMap ops for reading & writing, and via UpdateSubresource for efficient writing to a resource. Resources can be copied via CopyResource, CopySubresourceRegion & CopyStructureCount ops. MipMaps can be dynamically generated from shader resource views using the GenerateMips op.

Initialization process – get the device & DC, create resources, setup windows message loop and render frame function. The windows message loop essentially conducts 2 ops repeatedly : update and render.

Update - Timer based animation often requires calculation of angles which are bound to constant buffers – this can be achieved with the sinf function or via calculating the modulus 360 of an angle to prevent overflow.

Render - The render frame function makes use of several function family groups in ID3D11DeviceContext:

  • · SetRenderTargets - Bind RenderTargetView & DepthStencilView
  • · Clear - clear render target(s)
  • · IASet - bind buffer data
  • · xSSet - set inputs – set shaders buffers, shader resource views & samplers for specific pipeline stages
  • · Draw

Draw calls:

· pass stream of vertices into Input Assembler stage

  • · Auto – make primitive order dependent on vertex order, and setup up the GPU to do 2 passes: the 1st pass Vertex Shader & Geometry Shader stages process the vertices then the Stream Output stage feeds the data stream back into the Input Assembler stage for a 2nd pass. Multiple rendering passes are useful for applying vertex skinning or tessellation on a 1st pass, mitigating the amplification effects.
  • · Indexed – leverage Index buffer
  • · Instanced – multiple copies of same mesh model primitive – introduce variations
  • · Indirect – allow GPU to construct vertex buffer from a previously loaded resource, reducing CPU to GPU data flow

Posted On Wednesday, March 14, 2012 10:28 AM | Feedback (0)

Thursday, December 08, 2011 #

Daytona - Iterative MapReduce on Windows Azure

Daytona - Iterative MapReduce on Windows Azure

Overview

MapReduce is a framework for processing highly distributable problems across huge datasets using a large number of compute nodes. It is a generic mechanism that comprises 2 steps:

  • Map step: The master node takes the input, partitions it up into smaller sub-problems, and distributes them to worker nodes. The worker node processes the smaller problem, and passes the answer back to its master node.
  • Reduce step: The master node then collects the answers to all the sub-problems and combines them in some way to form the output – the answer to the problem it was originally trying to solve.

Microsoft has developed an iterative MapReduce runtime for Windows Azure, code-named "Daytona.". It can be downloaded here: http://research.microsoft.com/en-us/downloads/cecba376-3d3f-4eaf-bf01-20983857c2b1/default.aspx

Daytona can scale out computations for analysis of distributed data. Developers can use this to construct distributed computing cloud data-analysis services on Windows Azure.

This is Microsoft's implementation of Hadoop; the competing distributed computing infrastructure + API of Dryad + LINQ to HPC is now redundant.

An application has already been developed that leverages this infrastructure - Excel DataScope: an analytics service that exposes a library for cluster analysis, outlier detection, classification, machine learning, and data visualization. Users can upload data in their Excel spreadsheet to the DataScope service or select a data set already in the cloud, and then select an analysis model from the Excel DataScope research ribbon to run against the selected data. Daytona will scale out the model processing to perform the analysis. The results can be returned to the Excel client or remain in the cloud for further processing and/or visualization.

Development steps:

  • Develop arbitrary data analytics algorithms as a set of Map and Reduce tasks.
  • Upload the data and algorithms library into Windows Azure blob storage.
  • Deploy the Daytona runtime to Windows Azure - configure the number of virtual machines and the storage for analysis results.
  • From a client, launch the algorithm execution.

Daytona will automatically deploy the iterative MapReduce runtime to all of the configured Azure VMs, sub-dividing the data into smaller chunks so that they can be processed (the “map” function of the algorithm) in parallel. Eventually, it recombines the processed data into the final solution (the “reduce” function of the algorithm). Azure storage serves as the source for the data that is being analyzed and as the output destination for the end results. Once the analytics algorithm has completed, you can retrieve the output from Azure storage or continue processing the output by using other analytics model(s).

Key Properties:

  • Designed for the cloud, specifically for Windows Azure - the scheduling, network communications, and fault tolerance logic leverages Azure fabric infrastructure.
  • Designed for cloud storage services - a streaming based, data-access layer for cloud data sources (currently, Windows Azure blob storage only), which can partition data dynamically and support parallel reads. Intermediate data can reside in memory or in local non-persistent disks with backups in blobs, so that it is possible to consume data with minimum overheads and with the ability to recover from failures. A distributed file system is not required as the Azure storage services automatic persistence and replication are leveraged.
  • Horizontally scalable and elastic - Computations are performed in parallel, so that scaling a computation just requires adding more virtual machines to the deployment compute pool and the infrastructure will take care of the rest. Allows focusing on data exploration, instead of concerns about acquiring compute capacity or hardware management.
  • Optimized forIterative, convergent algorithms - supported in the core runtime, which caches data between iterations to reduce communication overheads and provides different scheduling and relaxed fault tolerance mechanisms. This is exposed via an API for authoring iterative algorithms.

Architecture

T he Da yt ona r unti me co n si sts of two wor k er r o l es d e pl o y ed wit hin a si ng le wi ndo ws a zure s er vic e.

  • Master work er r o l e - the service will run only a single instance of it.
  • Slave work er r o le - the service will run multiple instances of it. The Master will assign one or more tasks to a Slave instance.

A pplica tio ns w ill s u b mit o ne or more jo bs to the Ma s t er. A Job c on si sts of one or more t a sks a nd a task can be one of t he follo wi ng task types:

  • Ma p t a sk – invokes a user-defined function that processes a KVP to generate a set of intermediate KVPs.
  • R e duce t a sk - invokes a user-defined function that processes all values associated with an intermediate key and generates a final set of KVPs.

Te r m i n o logy

  • A pp li catio n - an a b st rac tion that contains one or more MapReduce computations composed by the user. An Application is composed of a (1) Package, (2) Controller and (3) Controller Arguments.
  • Pac ka ge - The collectionof assemblies that compose this Application referenced by name.
  • C ont r o lle r - Takes arguments passed in from external client during application submittal and configures, createsand controlsjobs on behalf of this application.
  • J ob - A single MapReduce computation. An Application can contain multiple MapReduce jobs. Each job is defined by a job configuration and a set of parameters.
  • Task Map or Reducetasksof a Job.
  • C lie nt I nt erface – An interface that is used to submit applications to the Daytona runtime.

Operational Process Steps:

  • Submit, Partition & Assign - The Masterpicks an application thathasbeensubmitted using the client interface. The Master creates the user-definedController instance of theapplication. The Controller contains the logic for creating and submitting jobs for execution. The runtime will automatically generate theappropriate Map/Reduce tasksper job andschedule themforexecution in theslaves. The Controller will leverage auser-defined DataPartitionertosplit the inputdatasetinto a number of data partitions. Amap taskisassigned toeachdatapartition while the number ofreduce tasks isleft for theuser todefinebasedonthe expected sizeof output generatedby the map tasks. Each slave instance isthen assignedoneormoremap tasksbythe Master.
  • Map & Combine - Aslave instancewhichhasbeen assigned amap task reads the content of the correspondingdata partition,parses it into KVPs through auser-defined Reader. Each KVP is then sequentiallypassed to theuser-definedMapfunctionthat produces a set of intermediate KVPs. The intermediate KVPs are partitioned among the count of reduce tasksspecifiedby the user andserialized in theslave instancesmemory or localfilestorage (Optionally,users canspecify a Combinerto locallymergeintermediatevaluesforeachintermediatekey. Users can also control which keys go to which reduce task by specifying a KeyPartitioner ). The slave instance then notifies the master about completion of the task.
  • Reduce - Assoon as themasterisnotified of the completion of one of themap tasks, itassigns the reduce tasks to the slave instances. A slave which has been assigned a reduce task is provided with the details for downloading its input data (intermediate KVPs generated by map tasks). The dataexchange isdonevia inter-role communication. When the slave has downloaded all its input data, it de-serializes and groups it by the intermediate keys as multiple map tasks could have generated the same key. The keys are thensorted and sequentiallypassed oneby one alongwith theirvalues to theuser-defined Reducefunction. The output of the reduce function isthen persisted through a user-definedwriter.

The runtime uses a data caching layer and an optimizedschedulingmechanism to support iterative MapReducejobs, i.e. execution of the same job in a loop within an iterative MapReduce computation.

API

Alg orit hm deployment to Da yt ona requires creating a Map f unct ion a nd a R e duce f unctio n.

A Co ntr oll er must be created to handle job configuration, submission and management on the Master node within the Daytona runtime.

Use the MapReduceClient class to interact with the runtime - responsible for packaging the various components, uploading them to the runtime and submitting the application to be run on the backend.

MapReduceClient Class

This class facilitates interacting with the Daytona runtime by providing methods for submitting an application, tracking its execution status. Optionally, this class can receive a set of string properties from the Controller of the application at the end of execution as a means of communicating results.

Controller Class

Is invoked andexecutedper application within the Daytona runtime to manage jobsubmissions onbehalf of its corresponding application. The controller composes the jobs and orchestrates theirflowwithin an application. For example, a controller of a typical MapReduce application may create one or more jobs and executes them one after the other to form aworkflow (or a chain of jobs). Similarly, the controller of an iterative application mayexecute a job in a looping construct. Additionally, it can be used to perform anypre andpostprocessing activities of an application or to evaluate convergence criteria in the case of an iterative application.

JobConfiguration Class

ca p t ur es t he st at ic co n f ig urati on a nd param ete rs of a Jo b. For example, the Map class type, the Reduce class type and other parameters such as the exception policy are configured using this. A single JobConfiguration canbe sharedbetween multiplejobs.

Job Class

ca pt ur es the d ynamic c on figu ra t io ns and p ara mete rs of a job and pr o vi des m et ho ds for r unni ng and clo si ng a jo b.

CloudClient Class

Provides a serializable wrapper for accessing different cloud clients {CloudBlobClient, CloudTableClient, CloudQueueClient } belonging to a particular storage account.

IDataPartitioner<K,V> & IRecordReader<K,V> Interfaces

Data partitioning provides thefunctionality forsplitting the inputdataset intopartitions aswellasparsing the contents ofeachpartition to generateKVPs. Data partitioning can be provided by implementing the interface IDataPartitioner and each split generated should implement theempty interfaceIDataPartition.

The typesimplementingIDataPartitioner andIDataPartitionshouldbemarked asSerializable.Thisisbecause the Daytona runtime runseach application on adedicatedAppDomain andinstances of these typesneed tobe passedacross AppDomains.

IMapper<KIN, VIN, KOUT, VOUT> Interface

Provides the functionality for processing the KVPs generated from parsing each individual partition. It then generates intermediate KVPs as a result of that processing.

IKeyPartitioner<K,V> Interface

K ey parti tio ni ng pr o vi des t he f unctio nali ty for s plitti ng/ d i stri buti ng t he i nte r me dia te KVPs g e n erated after e x ec uti ng of Ma p/ C o m bi ne amo ng t he r e duce t a s k s. The count of reduce tasks is provided to the partition through the Partition method.

IReducer<KIN, VIN, KOUT, VOUT> Interface

A C o m bi n er can p er fo r m an o pti onal step i m m e dia t e ly aft er t he Map so as to r e duce t he i nput pa ylo ad of R e duce t a sks. Typically, the same implementation is shared for reduce and combine. Provides thefunctionality toprocess allvalues associatedwith an intermediatekeyand generate afinal KVP asoutput.

IRecordWriter<K,V> Interface

Provides t he f unctio nality to write the out put KVPs g e n erated by t he IR e duc er i m p l e m e nt ati on . If the output is large a nd / or n ee ds to be p er si st e d, t h en o ne of t he Az ure st or age se r vi c es or SQL Az ure may be u se d. If the output issmall andneeds tobesent to themasterfor afinalmerge then it canbewritten tomemory.

Context, MapContext & ReduceContext Classes

The context class providesadditional information to various components during the configuration stage. The Context class is specialized throughMapContext andReduceContext classes which inherit from it and provide additional context only relevant in those two components.

The Library Namespace

R e searc h.M ap R e duc e.Li brar y provides a set of commonly used functionality for data partitioning, key partitioning and output writing.

Type integrity

Must be maintained in the configuration across the various components.

public interface IDataPartitioner<K, V>

public interface IRecordReader<K, V>

public interface IMapper<KIN, VIN, KOUT, VOUT> where KOUT : IComparable<KOUT>

public interface IReducer<KIN, VIN, KOUT, VOUT> where KIN : IComparable<KIN>

public interface IRecordWriter<K, V>

IEnumerable<IReduceResult<K, V>>GetReduceOutputs<K, V>()

Walkthrough – A Word Count MapReduce Implementation

Controller

[ControllerAttribute(

    Name = "Word Count",

    Description = "Counts #unique words found in input data.")]

public sealed class WordCountController : Controller

{

    public override void Run()

    {

        string outputContainerName

            = "word-count-output" + Guid.NewGuid().ToString("N");

        var jobConf = new JobConfiguration

        {

            MapperType = typeof(WordCountMapper),

            CombinerType = typeof(WordCountCombiner),

            ReducerType = typeof(WordCountReducer),

            MapOutputStorage = MapOutputStoreType.Local,

            KeyPartitioner = typeof(HashModuloKeyPartitioner<string, int>),

            ExceptionPolicy = new TerminateOnFirstException(),

            JobTimeout = TimeSpan.FromMinutes(10)

        };

        var job = new Job(jobConf, this)

        {

            DataPartitioner = new BlobContainerTextPartitioner(

                this.CloudClient, "word-count-input"),

            RecordWriter = new BlobTextCsvWriter(

                this.CloudClient, outputContainerName),

            NoOfReduceTasks = 2

        };

       // Run the job:

        job.Run();

        this.Results.Add("OutputContainer", outputContainerName);

    }

}

Data Partitioner

for splitting an input container that contains blobs having text data in CSV format.

[Serializable]

public class BlobContainerTextPartitioner : IDataPartitioner<int, string>

{

    public string ContainerName { get; private set; }

    public CloudClient CloudClient { get; private set; }   

    public BlobContainerTextPartitioner(

        CloudClient cloudClient, string containerName)

    {   

        if (cloudClient == null)

        {

            throw new ArgumentNullException("cloudClient");

        }

        if (string.IsNullOrEmpty(containerName)

        || string.IsNullOrWhiteSpace(containerName))

        {

             throw new ArgumentNullException("containerName");

        }

        this.CloudClient = cloudClient;

        this.ContainerName = containerName;

    }

    public IEnumerable<IDataPartition> GetPartitions()

    {

        // Return one BlobPartition per blob in the container.

        CloudBlobContainer container

            = CloudClient.BlobClient.GetContainerReference(this.ContainerName);

        foreach (IListBlobItem blobItem in container.ListBlobs())

        {

            CloudBlob blob = container.GetBlobReference(blobItem.Uri.AbsoluteUri);

            blob.FetchAttributes();

            yield return new BlobTextPartition(blob, 0, blob.Properties.Length, true);

        }

    }

    public IRecordReader<int, string> GetRecordReader(IDataPartition partition)

    {

        return new BlobTextReader(this.CloudClient, partition as BlobTextPartition);

    }

}

Mapper

public sealed class WordCountMapper : IMapper<int, string, string, int>

{

    public IEnumerable<KeyValuePair<string, int>> Map(

        int key, string value, MapContext<int, string> context)

    {

        foreach (string word in Regex

            .Split(value, "[^a-zA-Z0-9]", RegexOptions.Singleline)

            .Where(tuple => !string.IsNullOrEmpty(tuple)))

        {

            yield return new KeyValuePair<string, int>(word, 1);

        }   

    }

    public void Configure(MapContext<int, string> context)

    {

        throw new System.NotImplementedException();

    }

}

Reducer

public sealed class WordCountReducer : IReducer<string, int, string, string>

{

    public IEnumerable<KeyValuePair<string, string>> Reduce(

        string key,

        IEnumerable<int> values,

        ReduceContext<string, int> context)

    {

       return new KeyValuePair<string, string>[]

        {

            new KeyValuePair<string, string>(key, values.Sum().ToString())

        };

    }

    public void Configure(ReduceContext<string, int> context)

    {

        throw new System.NotImplementedException();

    }

}

RecordWriter

Write output in CSV format onto an Azure blob. If the output size is less than or equal to 4MB, then it is also stored inside a buffer which is sent back to the controller.

[Serializable]

public class BlobTextCsvWriter : IRecordWriter<string, string>

{

    [NonSerialized]

    private const int BufferSize = 4 * 1024 * 1024; // 4MB

    [NonSerialized]

    private bool bufferFull;

    [NonSerialized]

    protected CloudBlob blob;

    [NonSerialized]

    private byte[] buffer;

    [NonSerialized]

    private int noOfBytesWrittenInBuffer;

    public CloudClient CloudClient { get; private set; }

    public string ContainerName { get; private set; }

    public string DirectoryName { get; private set; }

    // CTOR

    public BlobTextCsvWriter(

        CloudClient cloudClient, string containerName, string directoryName = null)

    {

        if (cloudClient == null)

        {

            throw new ArgumentNullException("cloudClient");

        }

        if (string.IsNullOrEmpty(containerName)

        || string.IsNullOrWhiteSpace(containerName))

        {

            throw new ArgumentNullException("containerName");

        }

        this.CloudClient = cloudClient;

        this.ContainerName = containerName;

        this.DirectoryName = directoryName;

    }

    public virtual void Write(

        string outputPartition, IEnumerable<KeyValuePair<string, string>> records)

    {

        CloudBlobContainer container

            = this.CloudClient.BlobClient.GetContainerReference(this.ContainerName);

        container.CreateIfNotExist();

        this.blob = (!string.IsNullOrEmpty(DirectoryName))

            ? container

                .GetDirectoryReference(DirectoryName)

                .GetBlobReference(outputPartition)

            : container.GetBlobReference(outputPartition);

        buffer = new byte[BufferSize];

        IEnumerator<KeyValuePair<string, string>> enumerator

            = records.GetEnumerator();

        WriteToBuffer(enumerator, buffer);

        using (Stream stream = blob.OpenWrite())

        {

            stream.Write(buffer, 0, noOfBytesWrittenInBuffer);

            if (bufferFull)

            {

                 using (StreamWriter sw = new StreamWriter(stream))

                {

                    // Re-write the current record as the buffer is full.

                    WriteRecord(sw, enumerator.Current);

                    while (enumerator.MoveNext())

                    {

                        WriteRecord(sw, enumerator.Current);

                    }

                }

            }

        }

    }

    public IReduceResult<string, string> GetResult()

    {

        if (bufferFull)

        {

            return new BlobTextResult(this.CloudClient, blob.Uri.AbsoluteUri);

        }

        else

        {

            byte[] localBuffer = new byte[noOfBytesWrittenInBuffer];

            Buffer.BlockCopy(buffer, 0, localBuffer, 0, noOfBytesWrittenInBuffer);

            return new BlobTextResult(

                this.CloudClient, blob.Uri.AbsoluteUri, localBuffer);

        }

    }

    private void WriteToBuffer(

        IEnumerator<KeyValuePair<string, string>> enumerator, byte[] buffer)

    {

        try

        {

            using (MemoryStream ms = new MemoryStream(buffer))

            {

                using (StreamWriter sw = new StreamWriter(ms))

                {

                    while (enumerator.MoveNext())

                    {

                        WriteRecord(sw, enumerator.Current);

                        sw.Flush();

                        noOfBytesWrittenInBuffer = (int)ms.Position;

                    }

                }

            }

        }

        catch (NotSupportedException)

        {

            bufferFull = true;

        }

    }

    private void WriteRecord(StreamWriter sw, KeyValuePair<string, string> record)

    {

        sw.Write(record.Key);

        sw.Write(",");

        sw.Write(record.Value);

        sw.WriteLine();

    }

}

Deployment

Create an Azure service & storage account

  • using windows azure developer portal:
  • Create a hosted Azure service onto which the Daytona service will be deployed.
  • Create an Azure storage account used by the Daytona service to store information related to applications such as inputs, outputs, results etc.

Update Servi ceC o nfig ura ti o n. cs c fg with information regarding Master & Slave roles

  • Ma s ter r o le - is responsible for picking up new applications from the storage, handling communication with all the slaves, assigning Map and Reduce tasks to available slaves, monitoring task execution etc.
  • Slave r ole - is responsible for executing assigned map and reduce tasks, handling communication with master as well as other slaves, reporting master about the task execution etc.
  • Instances - instancecount ofthe Slaverole asperthe anticipated loadandthe number of cores allocated to yourazureproject. The number of instancesfor Master role must be 1.
  • Dia g nos ti cCon nec tio nS t ring - azure storage account connection string which will be used forlogging by the workerroles.
  • S to ra geCo nnec tionStri ng - azure storage account connection string which will be used for storing the input and output data.
  • Map TaskSlot Si ze - The maximum number of map tasks that can be executed at a slave in parallel.
  • Red uce TaskSlot S ize - The maximum number of reduce tasks that can be executed at a slave in parallel.

<?xml version="1.0" encoding="utf-8"?>

<ServiceConfiguration

    serviceName="Research.MapReduce.CloudHost"  

   xmlns= "http://schemas.microsoft.com/ServiceHosting/2008/10/ServiceConf iguration" 

    osFamily="1"

    osVersion="*">

    <Role name="Research.MapReduce.CloudHost.Master">

        <!--Number of master instances must always be kept as 1-->

        <Instances count="1" />

        <ConfigurationSettings>

            <Setting

                name="DiagnosticsConnectionString"

                value="

                    DefaultEndpointsProtocol=https;

                    AccountName=XXXXXXXX;

                    AccountKey= XXXXXXXXXX"/>

            <Setting

                name="StorageConnectionString"

                value="

                    DefaultEndpointsProtocol=https;

                    AccountName=XXXXXXXX;

                    AccountKey= XXXXXXXXXXXXXXXXXXXXXXXXXXXXXX "/>

       </ConfigurationSettings>

    </Role>

    <Role name="Research.MapReduce.CloudHost.Slave">

        <Instances count="2" />

        <ConfigurationSettings>

            <Setting

                name="DiagnosticsConnectionString"

                value="

                    DefaultEndpointsProtocol=https;

                    AccountName=XXXXXXXX;

                    AccountKey=XXXXXXXXXXXXXXXXXXXXXXXXXXXXXX "/>

            <Setting

                name="StorageConnectionString"

                value="

                    DefaultEndpointsProtocol=https;

                    AccountName=XXXXXXXX;

                    AccountKey=XXXXXXXXXXXXXXXXXXXXXXXXXXXXXX"/>

            <Setting name="MapTaskSlotSize" value="4" />

            <Setting name="ReduceTaskSlotSize" value="1" />

        </ConfigurationSettings>

    </Role>

</ServiceConfiguration>

Update Servi ce De f i n i tio n. cs d e f with information regarding Role element VMSize and LocalStorage

Publish & Deploy the Service

  • 2 options to deploy a Windows Azure cloud service: 1) Using Azure management portal. 2) Using Visual Studio IDE
  • If deploying usi ng A z ure ma nage m e nt po r tal - Daytona provides a defaultMedium” VM sizeprecompiled package.

Running the Application

U p lo ad I n put - Daytona Map-Reduce expects inputdatato come from one of the azurestorageservices or SQL Azure. Input can be uploaded to azure storage service using tools such as cerebrata, cloudberry etc. SQL Management studio can be used to upload data to SQL Azure.

Submit - All the assemblies and references are bundled into a single package. To submit the WordCount application from a console app:

var masterConnectionString =

    "DefaultEndpointsProtocol=http;

    AccountName=<AccountName>;

    AccountKey=<AccountKey>";

var client = new MapReduceClient(

    "word-count-" + Guid.NewGuid(),

    typeof(WordCountController).AssemblyQualifiedName);

client.Submit(masterConnectionString, false);

A bort - Once the application has been submitted to the Daytona runtime, it is possible to abort the already submitted app. Once the abort is requested, runtime will kill the master job and send abort request to all slaves.

client.RequestAbort(masterConnectionString);

Tracking and Result

bool concluded = client.WaitForCompletion(

    TimeSpan.FromMinutes(10), TimeSpan.FromSeconds(5));

if (concluded && client.Succeeded)

{

    foreach (KeyValuePair<string, string> result in client.Results)

    {

        Console.WriteLine("{0} \t: {1}", result.Key, result.Value);

    }

}

else

{

    Console.WriteLine(client.FailReason);

}

The MapReduce client tool

The MapReduce client tool (mrclient.exe) is a CLU that targets the Daytona MapReduce framework to perform common operations like application submission, monitoring etc. on MapReduce from command-line.Syntax:

mrclient [-cs <connection string>] –c <command> [<switch>...] [-h] [-?]

Commands:

  • createp kg - creates a new application package in the Azure storage.
  • deletep kg - deletes an application package from the Azure storage.
  • submitapp - submits a MapReduce application for execution.
  • listapps - lists applications submitted to the service which pass the filtering criteria specified by the switches.
  • getapp - gets the details of an application.
  • abortapp - requests the service to abort a running application.
  • deleteapp - deletes application information from the Azure storage.

Conclusion

We've had a look at the workings of Daytona - an iterative MapReduce runtime for Windows Azure.

Remember, This is Microsoft's implementation of Hadoop; the competing distributed computing infrastructure + API of Dryad + LINQ to HPC is now redundant.

In this blog post, weve had a look at the development steps, key properties, architecture, terminology, operational steps, and API of Daytona. We've also done a walkthrough of implementing, deploying and running a Word Count MapReduce app for Azure.

Good luck using Daytona to scale out computations for analysis of distributed data and construct distributed computing cloud data-analysis services on Windows Azure!

Posted On Thursday, December 08, 2011 7:26 AM | Feedback (1)

Tuesday, December 06, 2011 #

The Windows Azure HPC Scheduler SDK

Overview

Windows HPC Server 2008 is infrastructure for high-end applications that require high performance computing clusters – i.e. for scaling out parallelizable across many compute nodes in a grid. These compute nodes can be coordinated by a head node , which in turn can be proxied via a service broker node that exposes a SOA WCF interface for job scheduling. Additional functionality includes the ability to coordinate between job processes running on nodes via MPI (message passing interface).

The Windows Azure Scheduler is basically a complete & modular deployment of all HPC Server components to run in the cloud, not just compute nodes, and without having to rely on a VM role to host the Head Node.

Functionality: HPC Azure job scheduling and resource management, runtime support for MPI and SOA, web-based job submission interfaces, and persistent state management of job queue and resource configuration.

A basic deployment includes the following:

  • A SQL Azure database - stores state information about the job queue and resource configuration
  • a head node - schedules jobs and support SOA workloads
  • a web-front end node - provides a web-based job submission Scheduler portal
  • compute nodes - to run jobs.

For example, you can deploy a SOA service and client to the head node, and an MPI application to the compute nodes.

It is possible to submit jobs in multiple ways:

  • using the Scheduler Portal
  • via RDC to the head node and then using the HPC Job Manager

Note that while the Azure HPC Scheduler supports LINQ to HPC, this has been discontinued and made redundant by Daytona Azure Iterative MapReduce functionality, so it will not be discussed further in this post.

Plug HPC functionality into Azure Roles

SDK plug-ins

The SDK provides different plug-ins which can be imported via ServiceDefinition.csdef file to extend Windows Azure role functionality for HPC support.

To recap Azure config:

  • ServiceDefinition.csdef specifies the settings that are used by Windows Azure to configure a hosted service (such as the type of role that you want to deploy – worker, web, or VM).
  • ServiceConfiguration.cscfg specifies the values for these settings (such as how many instances of each you want to deploy).

The minimum required config imports appropriate plugins to enables the Azure Scheduler for job scheduling and the node manager for coordinating resource management. Can optionally enable SOA session manager to manage sessions between the SOA client and the service hosts, and SOA broker to manage messages between the SOA client and the service hosts.

  • HpcHeadNode - (Worker Role only). Enables the Azure Scheduler & the node manager.
  • HpcComputeNode - (Worker Role only). Enables the node manager.
  • HpcBrokerNode - (Worker Role only). Enables the SOA broker & the node manager.
  • HpcVmNode - (VM Role only). Enables the node manager
  • HpcWebFrontEnd - (Web Role only). Enables the Azure Scheduler Portal and a RESTFul HTTP web service for job submission.
  • HpcWebFrontEndHeadNode - (Worker or web role) - enables the Azure Scheduler, the node manager, and the Azure Scheduler Portal + job submission web service.

Add the SDK plug-ins

3 types of extended roles are required for a minimum configuration:

  • Head node: Azure worker role with the HpcHeadNode plug-in added. The HpcHeadNode plug-in provides job scheduling and resource management functionality.
  • Compute node: Azure worker role with the HpcComputeNode plug-in added to support MPI and SOA applications + workload management features.
  • Front end: Azure web role with the HpcWebFrontEnd plug-in added to provide a built-in web portal an HTTP RESTful web service for submission to the Azure Job Scheduler.

Eg ServiceDefinition.csdef.

<?xml version="1.0" encoding="utf-8"?>

<ServiceDefinition

    name="AzureSampleService"    

    xmlns="http://schemas.microsoft.com/ServiceHosting/2008/10/ServiceDefinition">

    <WebRole name="FrontEnd" vmsize="Small">

        <Sites>

            <Site name="HPCPortal"

                physicalDirectory="C:\Program Files\Windows Azure Scheduler SDK Preview\HpcPortal">

                <Bindings>

                    <Binding name="HPCWebServiceHttps"    

                        endpointName="Microsoft.Hpc.Azure.Endpoint.HPCWebServiceHttps"/>

                </Bindings>

            </Site>

        </Sites>

        <Endpoints>

            <InputEndpoint name="Endpoint1" protocol="http" port="80" />

        </Endpoints>

        <Imports>

            <Import moduleName="Diagnostics" />

            <Import moduleName="HpcWebFrontEnd" />

            <Import moduleName="RemoteAccess" />

        </Imports>

    </WebRole>

    <WorkerRole name="HeadNode" vmsize="Small">

        <Imports>

            <Import moduleName="Diagnostics" />

            <Import moduleName="HpcHeadNode" />

            <Import moduleName="RemoteAccess" />

            <Import moduleName="RemoteForwarder" />

        </Imports>

    </WorkerRole>

    <WorkerRole name="ComputeNode" vmsize="Small">

        <Imports>

            <Import moduleName="Diagnostics" />

            <Import moduleName="HpcComputeNode" />

            <Import moduleName="RemoteAccess" />

        </Imports>

    </WorkerRole>

</ServiceDefinition>

 

Build & Deployment Requirements

Windows Azure SDK v 1.5 - http://www.microsoft.com/windowsazure/sdk/.

Visual Studio 2010 Professional + Windows Azure Tools for Visual Studio - (To build & publish the sample applications)

Windows Azure Scheduler SDK - http://connect.microsoft.com/hpc (64-bit only).

HPC Pack 2008 R2 with SP3 release candidate client utilities (required for SOA Service Broker) -http://connect.microsoft.com/hpc site.

HPC Pack 2008 R2 MS-MPI redistributable package - Required for MPI support - http://www.microsoft.com/download/en/details.aspx?displaylang=en&id=14737.

HPC Pack 2008 R2 SDK - To write applications that use the HPC Pack APIs - http://www.microsoft.com/download/en/details.aspx?id=26645.

Windows Azure subscription - to configure a hosted service, storage account, and SQL Azure database to deploy HPC scheduler applications - http://go.microsoft.com/fwlink/p/?LinkId=205528

Installing the Azure Scheduler SDK + Sample Apps

To install the SDK - Open an elevated Command Prompt ( Run as Administrator) & run the installation program: WindowsAzureSchedulerSDK_x64.msi. HPC plugin folders will be installed in C:\Program Files\Windows Azure SDK\v1.5\bin\plugins folder.

To install the sample apps - unzip WAS-SDK-Samples-3921.zip , build AzureSampleService.sln, Configure the AppConfigure project as the startup project & publish to Windows Azure. Arbitrarily set the number of instances of each node type, bearing the Azure billing plan in mind. settings: Solution Configurations == Debug, Solution Platforms == Mixed Platforms. projects in the solution:

  • AppConfigure - Configures and publishes the other projects
  • CertificateGenerator - Helper library that creates Windows Azure management certificates
  • AzureSampleService - Defines the roles for the Azure Scheduler and Azure service configuration files
  • ComputeNode - Defines the properties of compute nodes
  • FrontEnd - Defines the properties of the web front end node
  • HeadNode - Defines the properties of the head node

Including application binaries in the deployment

  • Specify which roles should include which application binaries.
  • For example, a service might deploy a SOA service and client to the head node, and an MPI application to the compute nodes.
  • To include binaries in the bin folder for a role, can either specify it as the output directory for the application project, or add the application as a reference in one or more role projects.

Deploy the sample Azure Scheduler service - run the AppConfigure project in visual studio, provide your Azure subscription ID, + arbitrary names for hosted service, storage account, and scheduler deployment. Additionally specify a certificate .cer file (need to upload this certificate to the relevant Azure subscription via the /windows.azure.com portal). The data is specified in the appconfigure.exe.config file. Publish to Azure.

Validate the Deployment

  • Make a Remote Desktop connection to the role instances
  • Verify that Windows Azure Scheduler role binaries are deployed in the location: E:\plugins\<PluginName>\HPCPack\bin. Additionally, on compute node role and head node role instances, you can verify that the sample applications are installed in the E:\approot folder.
  • run a HPC clusrun command on the nodes in the Windows Azure Scheduler. see clusrun.
  • run the hpcjobmanager command to Start HPC Job Manager – provides a GUI for submitting and monitoring jobs on the cluster.
  • Connect to the Windows Azure Scheduler Web Portal using a web browser. E.g. https://<prefix>.cloudapp.net/Portal. submit a simple test job.
  • view events in Event Viewer in Applications and Services Logs\Microsoft\HPC\Scheduler\Operational.

Running MPI Applications

  • configure the Firewall to allow MPI communication between compute nodes - configure an application-based firewall exception on the nodes by using the hpcfwutil command.
  • To submit an MPI job to run on compute nodes, use job submit to start an mpiexec command that runs the .exe. e.g.
job submit /numnodes:<NumComputeNodes> mpiexec <.exe> <NumIntervals> <NumIterations>
  • To view the output of the calculation, use the task view <jobId> command.
  • Unlike MPI programs deployed to an on-premises Windows HPC Server cluster the visual studio add-in debugger cannot be used to debug MPI applications that are deployed with the Windows Azure Scheduler.

Running SOA Applications

  • SOA clients use the Microsoft.Hpc.Scheduler and Microsoft.Hpc.Scheduler.Session APIs to send and receive messages from a service that is hosted on one or more compute nodes. These APIs leverage the Azure Scheduler infrastructure for identification of available service hosts, distribution of service requests, load-balancing, and error handling.
  • use the hpcpack and hpcsync CLUs to upload service DLLs to an Azure storage account and then deploy the service to each compute node. Hpcsync automatically deploys the DLLs to the expected location. 

to package a SOA service:

hpcpack create <folder.zip> <folder>
  • to upload a package to an Azure storage account, where <yourStorage> is the name of your storage account, and <yourKey> is the primary access key to your storage account:
hpcpack upload <folder.zip> /account:< Storage account name> /key:<storage account access key>
  • to deploy the package to all compute nodes:
clusrun hpcsync
  • Run the SOA client on the head node - To submit service requests , run the SOA client, that is installed on the head node.
  • To view the compute nodes that were allocated to process your service requests, you can use HPC Job Manager.
  • To view resource allocation - Start > All Programs > Microsoft HPC Pack 2008 R2 > HPC Job Manager > Job Management pane > All Jobs.
  • select a job from the list. View Job dialog box > Allocated Nodes - lists the resources that were used to process the service requests.
  • If the head node was enabled for SOA workloads, A deployment that supports SOA workloads provides broker and session management functionality to help distribute incoming service requests to the available service hosts and then return the messages to the client. This functionality can be enabled in the ServiceConfiguration.cscfg file for the deployment. The necessary elements can be added to the configuration file by calling the EnableSOA() method (in the Microsoft.Hpc.Azure.ClusterConfig namespace).

Deployment Configuration Settings

Subscription ID - Windows Azure subscription ID. Obtain from the Windows Azure Management Portal.

Management certificate - x.509 v3 certificate name that contains a public key, and is saved as a .cer file

Service name - The DNS name of a hosted service in your subscription - must be unique across Windows Azure services.

Storage account - A subdomain name used in URLs for a storage account in your subscription - must be lowercase alphanumerical & unique across Windows Azure services.

SQL Server location - The region selected for the SQL Azure Server that is used for the deployment

SQL database name

SQL Server administrator

SQL administrator password

Deployment name - The name of the job scheduler component that is installed on the head node role and that is used for the Windows Azure Scheduler deployment on the hosted service. Sets the CCP_SCHEDULER environment variable on the role instances

Administrator name - administrator of the job scheduler

Password

Head nodes - Number of instances of the HeadNode role that are in the sample Windows Azure Scheduler

Compute nodes - Number of instances of the ComputeNode role that are deployed in the sample Windows Azure Scheduler

Web frontend nodes - Number of instances of the FrontEnd role that are deployed in the sample Windows Azure Scheduler.

Conclusion

The Windows Azure Scheduler is basically a complete & modular deployment of all HPC Server components to run in the cloud, not just compute nodes, and without having to rely on a VM role to host the Head Node. Functionality includes the following: HPC Azure job scheduling and resource management, runtime support for MPI and SOA, web-based job submission interfaces, and persistent state management of job queue and resource configuration.

In this blog post, we have looked at using SDK plugins to extend Azure roles with HPC functionality. We have also examined build and deployment requirements and the installation process.

Good luck using the Windows Azure HPC Scheduler SDK to develop a complete HPC solution in the cloud!

Posted On Tuesday, December 06, 2011 6:36 AM | Feedback (0)

Sunday, December 04, 2011 #

Unit Testing a ConcurrentPriorityQueue

I’m leveraging a ConcurrentPriorityQueue – from http://code.msdn.microsoft.com/ParExtSamples.

clip_image002

This class basically is a thread safe IProducerConsumerCollection wrapper for a binary heap that prioritizes smaller values. You use it as you would a dictionary, where the priority is the key, except you can have duplicate keys (ie values with the same priority).

I needed to demonstrate to a customer that it worked.

I set up my queue and my priority enum values:

var q = new ConcurrentPriorityQueue<int, string>();

var priorityValues = Enum.GetValues(typeof(JobPriority));

I then randomly enqueued different priorities from different threads:

var random = new Random();
Parallel.For(0, 1000, i =>
{
    var randomPriority = (JobPriority) priorityValues.GetValue(random.Next(priorityValues.Length));
    Debug.WriteLine("enqueueing:" + randomPriority);
    q.Enqueue((int) randomPriority, randomPriority.ToString());
});
 
and finally I asserted that dequeueing occurs in order:
while (q.TryDequeue(out printJobKVP))
{
                
    var jobPriority = (JobPriority)Enum.Parse(typeof(JobPriority), printJobKVP.Value);
    Assert.IsTrue(jobPriority >= previousJobPriority);
    previousJobPriority = jobPriority;
}
 

Posted On Sunday, December 04, 2011 1:05 PM | Feedback (0)

C++ AMP

Overview

  • C++ AMP is a GPGPU API – it allows you to define functions (kernels) that take some input, perform an expensive calculation on the GPU and return the output to CPU. GPU supports fast calculative operations across many SIMD-like cores - NVidia Tesla supports 512 cores compared to the paltry 10 cores available on the CPU today - even Intel's Knights Corner will only support 60 cores next year. Suitable only for certain classes of problems (i.e. data parallel algorithms) and not for others (e.g. algorithms with branching or recursion or other complex flow control).
  • Caveat - you pay a high cost for transferring the input data from the CPU to the GPU and the results back to the CPU, so the computation itself has to be long enough to justify the overhead transfer costs.
  • DirectX 11 offers DirectCompute API for GPGPU – this requires you to code in HLSL (a C like language for expressing pixel, vertex & tesselation shaders for graphics pipelines). C++ AMP abstracts away from that - is part of Visual C++. You don't need to use a different compiler or learn different syntax.
  • The C++ AMP programming model includes multidimensional arrays, indexing, memory transfer, tiling, and a mathematical function library. C++ AMP language extensions and compiler restrictions enable you to control how data is moved from the CPU to the GPU and back, which enables you to control the performance impact of moving the data back and forth.
  • Note: A competitor of C++ AMP is OpenCL , which abstracts over CUDA.
  • System Requirementscompile time: Visual Studio 2011 dev preview. runtime: DirectX 11.
  • The C++ AMP Math Library provides support for double-precision functions.

Canonical Example – Matrix Addition

#include <amp.h>

#include <iostream>

using namespace concurrency;

using namespace std;

void CampMethod() {

    int aCPP[] = {1, 2, 3, 4, 5};

    int bCPP[] = {6, 7, 8, 9, 10};

    int sumCPP[5] = {0, 0, 0, 0, 0};

    // Create C++ AMP wrappers for GPU transport – in this case 1D vectors of int

    array_view<int, 1> a(5, aCPP);

   array_view<int, 1> b(5, bCPP);

    array_view<int, 1> sum(5, sumCPP);

    parallel_for_each(

        // Define the compute domain, which is the set of threads that are created.

       sum.grid,

        // Define the lambda expression to run on each thread on the accelerator – pass by val

        [=](index<1> idx) mutablerestrict(direct3d)

        {

            sum[idx] = a[idx] + b[idx];

        }

    );

    // Print the results. The expected output is "7, 9, 11, 13, 15".

    for (int i = 0; i < 5; i++) {

        cout << sum[i] << "\n";

    }

}

  • THis example uses C++ arrays to construct three C++ AMP array_view objects. You supply four values to construct anarray_viewobject: the data values, the rank, the element type, and the length of thearray_viewobject in each dimension. The rank and type are passed as type parameters. The data and length are passed as constructor parameters. .
  • The parallel_for_each function provides a mechanism for iterating through the data elements, or compute domain. In this example, the compute domain is specified bysum.grid.The code that you want to execute is contained in a lambda statement, orkernel function. The restrict(direct3d)modifier verifies that the hardware, oraccelerator, that the code runs on complies with the C++ AMP hardware requirements.
  • The index Class variable, idx, is declared with a rank of one to match the rank of thearray_viewobject. It accesses the individual elements of thearray_viewobjects.

Shaping and Indexing Data: index, extent, and grid

  • You must define the data values and declare the shape of the data before you can run the kernel code. All data is defined to be an array (rectangular), and you can define the array to have any rank (number of dimensions). The data can be any size in any of the dimensions. If you use an array_view object, the origin can use non-zero index values. For convenience, the runtime library has specific types and functions for 3-dimensional arrays.
  • index Class
    • The index Class specifies a location in the array or array_view object by encapsulating the offset from the origin in each dimension into one object.
    • The following example creates a one-dimensional index that specifies the third element in a one-dimensional array_view object. The index is used to print the third element in the array_view object. The output is 3.

int aCPP[] = {1, 2, 3, 4, 5};

array_view<int, 1> a(5, aCPP);

index<1> idx(2);

cout << a[idx];

// Output: 3.

    • The following example creates a two-dimensional index that specifies the element where the row = 1 and the column = 2 in a two-dimensional array_view object. The first parameter in the index constructor is the row component, and the second parameter is the column component. The output in the cell at the specified index [1,2] is 5.

int aCPP[] = {1, 2, 3, 4, 5, 6};

// 2x3 2D matrix created from array input

array_view<int, 2> a(2, 3, aCPP);

index<2> idx(1, 2);

cout << a[idx];

// Output: 5

2 rows, 3 columns:

1 2 3

4 5 6

    • The following example creates a three-dimensional index that specifies the element where the depth = 0, the row = 1, and the column = 3 in a three-dimensional array_view object. Notice that the first parameter is the depth component, the second parameter is the row component, and the third parameter is the column component. The output is 8.

int aCPP[] = {

1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12,

1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12};

// 3D matrix: Length is 4 in the x dimension, 3 in the y dimension, and 2 in the z dimension.

array_view<int, 3> a(2, 3, 4, aCPP);

// Specifies the element at x = 3, y = 1, z = 0.

index<3> idx(0, 1, 3);

cout << a[idx] << "\n";

// Output: 8.

  • extent Class
    • The extent class is a multidimensional slice – it specifies the length of the data in each dimension of thearrayorarray_viewobject.

int aCPP[] = {

1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12,

1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12};

// 3D matrix - 3 rows, 4 columns, depth is 2.

array_view<int, 3> a(2, 3, 4, aCPP);

cout << "The number of columns is " << a.extent[2] << "\n";

cout << "The number of rows is " << a.extent[1] << "\n";

cout << "The depth is " << a.extent[0]<< "\n";

    • You can construct anarrayorarray_viewobject by using anextentobject in the constructor.

int aCPP[] = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24};

extent<3> e(2, 3, 4);

array_view<int, 3> a(e, aCPP);

  • grid Class
    • The grid Class specifies an extent at an index. It enables you to specify the set of threads to be created and to conveniently access a subset of your data by defining an extent at a specific location. The array / array_view class exposes a grid object that is defined to have the index at the origin of the array and the extent of the whole array.

int aCPP[] = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12};

// Length is 4 in the x dimension, 3 in the y dimension, and 2 in the z dimension.

array_view<int, 3> a(2, 3, 4, aCPP);

Moving Data to the Accelerator: array and array_view

  • Two data containers used to move data to the accelerator are defined in the runtime library. They are the array Class and the array_view Class . The fundamental difference between the two classes is that the array class creates a deep copy of the data when the object is constructed – pass by ref in the lambda [=]. Thearray_viewclass is a wrapper that copies the data when the kernel function accesses the data – pass by value in the lambda [&].
  • array Class
    • When an array object is constructed, a deep copy of the data is created on the accelerator. The kernel function modifies the copy on the accelerator. When the execution of the kernel function is finished, you must manually copy the data back to the host. The following example multiplies each element in a vector by 10. After the kernel function is finished, the vector conversion operator is used to copy the data back into the vector object.

vector<int> data(5);

for (int count = 0; count < 5; count++)

{

    data[count] = count;

}

array<int, 1> a(5, data);

parallel_for_each(

    a.grid,

    //a is explicitly captured by reference. Other variables will be captured by value

    [=, &a](index<1> idx) restrict(direct3d)

    {

        a[idx] = a[idx] * 10;

    }

);

data = a;

for (int i = 0; i < 5; i++)

{

    cout << data[i] << "\n";

}

  • array_view Class
    • The array_view has nearly the same members as the array class, but the underlying behavior is not the same. Data passed to thearray_viewconstructor is not replicated on the GPU as it is with anarrayconstructor. Instead, the data is copied to thearray_viewobject when the kernel function is executed. Therefore, if you create twoarray_viewobjects that using the same data, botharray_viewobjects refer to the same memory space. When you do this, you need to synchronize any multithreaded access. Additionally, your kernel function must bedense, that is, it must update every element of thearray_viewobject to update thearray_viewdata.

Executing Code over Data: parallel_for_each

  • The parallel_for_each function defines the code that you want to run on the accelerator against the data in the array or array_view object.
  • The parallel_for_each method takes two arguments, a compute domain and a lambda expression.
  • The compute domain is a grid object or a tiled_grid object that defines the set of threads to create for parallel execution. One thread is generated for each element in the compute domain. In this case, the grid object is one-dimensional and has five elements. Therefore, five threads are started. Each thread has access to all the elements in the compute domain.
  • The lambda expression defines the code to run on each thread. The capture clause, [=] , specifies that the body of the lambda expression accesses all captured variables by value. In this example, the parameter list creates a one-dimensionalindexvariable namedidx. The value of theidx.xis 0 in the first thread and increases by one in each subsequent thread.
  • The mutable keyword enables the body of a lambda expression to modify variables that are captured by value, in this case the sum variable.
  • The restrict(direct3d) modifier, or restriction clause , ensures that there is compatibility with hardware targets, enable specialization for hardware targets, and enable code-generation optimizations. The limitations on functions that have the restrict modifier are described in the Restriction clause.
  • The lambda expression can include the code to execute or it can call a separate kernel function. The kernel function must include the restrict(direct3d) modifier.

#include <amp.h>

#include <iostream>

using namespace concurrency;

using namespace std;

void AddElements(

    index<1> idx,

    array_view<int, 1> sum,

    array_view<int, 1> a,

    array_view<int, 1> b

) restrict(direct3d)

{

    sum[idx] = a[idx] + b[idx];

}

void AddArraysWithFunction() {

    int aCPP[] = {1, 2, 3, 4, 5};

    int bCPP[] = {6, 7, 8, 9, 10};

    int sumCPP[5] = {0, 0, 0, 0, 0};

    array_view<int, 1> a(5, aCPP);

    array_view<int, 1> b(5, bCPP);

    array_view<int, 1> sum(5, sumCPP);

    parallel_for_each(

        sum.grid,

        [=](index<1> idx) mutable restrict(direct3d)

        {

            AddElements(idx, sum, a, b);

        }

    );

    for (int i = 0; i < 5; i++) {

        cout << sum[i] << "\n";

    }

}

Simplifying & Accelerating Code: Tiles & Barriers

  • Tiling divides an array or array_view object into equal rectangular subsets, or tiles. For each thread, you have access to thegloballocation of a data element relative to the wholearrayorarray_viewand access to thelocallocation relative to the tile. Using the indexlocal value simplifies your code because you don't have to write the code to translate index values from global to local. To use tiling, you call the grid::tile method on the compute domain in the parallel_for_each method, and you use a tiled_index object in the lambda expression.
  • The code has to access and keep track of values across the tile. You use the tile_static keyword and the tile_barrier::wait method to accomplish this. A variable that is declared with thetile_statickeyword has a scope across an entire tile, and an instance of the variable is created for each tile. You must handle synchronization of tile-thread access to the variable. The tile_barrier::wait Method stops execution of the code until all the threads in the tile have executed. So you can accumulate values across the tile by usingtile_staticvariables. When all the threads in the tile have finished, you can finish any computations that require access to all the values.
  • The tile_static keyword is used to declare a variable that can be accessed by all threads in a tile that's in an array or array_view object. The lifetime of the variable starts when execution reaches the point of declaration and ends when the kernel function returns.
  • The following code example uses the sampling data & code replaces each value in the tile by the average of the values in the tile.

int sampledata[] = {

    2, 2, 9, 7, 1, 4,

    4, 4, 8, 8, 3, 4,

    1, 5, 1, 2, 5, 2,

    6, 8, 3, 2, 7, 2};

 

// The tiles are:

// 2 2 9 7 1 4

// 4 4 8 8 3 4

//

// 1 5 1 2 5 2

// 6 8 3 2 7 2

 

// Averages – create an initial null matrix.

int averagedata[] = {

    0, 0, 0, 0, 0, 0,

    0, 0, 0, 0, 0, 0,

    0, 0, 0, 0, 0, 0,

    0, 0, 0, 0, 0, 0,

};

array_view<int, 2> sample(4, 6, sampledata);

array_view<int, 2> average(4, 6, averagedata);

parallel_for_each(

    // Create threads for sample.grid and divide the grid into 2 x 2 tiles.

    sample.grid.tile<2,2>(),

    [=](tiled_index<2,2> idx) mutable restrict(direct3d)

    {

        // Create a 2 x 2 array to hold the values in this tile.

        tile_static int nums[2][2];

        // Copy the values for the tile into the 2 x 2 array.

        nums[idx.local.y][idx.local.x] = sample[idx.global];

        // When all the threads have executed and the 2 x 2 array is complete, find the average.

        idx.barrier.wait();

        int sum = nums[0][0] + nums[0][1] + nums[1][0] + nums[1][1];

        // Copy the average into the array_view.

        average[idx.global] = sum / 4;

    }

);

for (int i = 0; i < 4; i++) {

    for (int j = 0; j < 6; j++) {

        cout << average(i,j) << " ";

    }

    cout << "\n";

}

// Output.

// 3 3 8 8 3 3

// 3 3 8 8 3 3

// 5 5 2 2 4 4

// 5 5 2 2 4 4

Creating a C++ AMP Application

  • To create a project
    • File>New>Project>Visual C++>Win32 Console Application,
    • type something in the Name box
    • In Solution Explorer, delete stdafx.h, targetver.h, and stdafx.cpp from the project.
    • Open the .cpp file and put in your C++ AMP code.
  • Clear the Precompiled header check box.
    • In Solution Explorer, Project >Properties > Configuration Properties, > C/C++, Precompiled Headers - For thePrecompiled Headerproperty, selectNot Using Precompiled Headers
  • Build > Build Solution.

Debugging a C++ AMP Application

  • Debugging the CPU Code
    • In Solution Explorer, Project> Properties > Configuration Properties > Debugging. Verify thatLocal Windows Debuggeris selected.
  • Debugging the GPU Code
    • In Solution Explorer, Project> Properties > Configuration Properties > Debugging. In the Debugger to launch list, selectGPU C++ Direct3D Compute Debugger.
  • To use the GPU Threads window
    • To open the GPU Threads window, on the menu bar, chooseDebug,Windows,GPU Threads.
    • shows the total number of active and blocked (by Barrier) GPU threads
    • There may be multiple tiles allocated for a computation, where each tile contains a number of threads.
    • Because you are debugging locally on the software emulator (reference rasterizer), there will be an active GPU thread emulated by each core of your CPU.
    • a yellow arrow pointing to the row that includes the current thread. You can select a row and chooseSwitch To Thread.
    • o The Call Stackwindow always displays the call stack of the current GPU thread.
  • To use the Parallel Stackswindow
    • To open the Parallel Stacks window, on the menu bar, choose Debug, Windows, Parallel Stacks.
    • You can use the Parallel Stacks window to simultaneouslyinspect the stack frames of multiple GPU threads.
    • You can inspect the properties of a GPU thread that are available in the GPU Threads window in the DataTip of the Parallel Stacks window.
    • use the Parallel Watch window to inspect the values of an expression across multiple threads at the same time - enter expressions whose values you want to inspect across all GPU threads (via Add Watchcolumn). You can filter & sort expressions.
    • can export the content in the Parallel Watch window to Excel by choosing the Excel button
    • You can flag specific GPU threads by flagging them in the GPU Threads window, the Parallel Watch window, or the DataTip in the Parallel Stacks window.
    • You can group, freeze (suspend) and thaw (resume) GPU threads the same way you do with CPU threads - from either the GPU Threads window or the Parallel Watch window.

restrict keyword

  • The restriction modifier is applied to function declarations. It enforces restrictions on the code in the function and the behavior of the function in applications that use the C++ AMP runtime. The restrict clause takes the following forms:
    • restrict(cpu) The function can run only on the host CPU. (default)
    • restrict(direct3d) The function can run only on the Direct3D target and cannot run on the CPU.
  • The following are not allowed in direct3d (ie you cannot use on the GPU):
    • Recursion.
    • Variables declared with the volatile keyword.
    • Virtual functions.
    • Pointers to functions.
    • Pointers to member functions.
    • Pointers in structures.
    • Pointers to pointers.
    • goto statements.
    • Labeled statements.
    • try , catch , or throw statements.
    • Global variables.
    • Static variables. Use tile_static Keyword instead.
    • dynamic_cast casts.
    • The typeid operator.
    • asm declarations.
    • Varargs.

Learn More

Posted On Sunday, December 04, 2011 8:20 AM | Feedback (1)

Friday, November 18, 2011 #

Moving blogs

I'm moving to Microsoft Israel blog host - you can follow me over here

I will continue to cross post at Geeks with Blogs

Posted On Friday, November 18, 2011 11:41 AM | Feedback (2)

Saturday, October 29, 2011 #

Post-Build C++ Skill Rebuild

·        For the last decade, the majority of my dev work has leveraged the .NET Framework for construction of information systems. However, my interest has lain in numerical computing.

·        Is it possible to have an increasingly higher level of abstraction and at the same time achieve underlying high performance computing? The prevailing winds say no: C# is aimed at productivity, and C++ is for performance. Garbage collection was great, but do we still need it with the availability of smart pointers? Would you like to leverage GPGPU from .NET ? Yes, there is 3rd party Brahma.NET (LINQ to GPU - http://ananthonline.net/2010/09/29/new-brahma-syntaxusage/ ) ,  but not much from MS – Accelerator (http://research.microsoft.com/en-us/projects/accelerator/) never made it out of MS Research, and the company is pushing C++ AMP for GPGPU. That’s C++ AMP, not C# AMP. The Win 8 status of XNA is in doubt (via ominous silence), so the only way to program a sufficiently advanced UI for a metro app (i.e. something dazzling that gives competitive advantage by having high barriers of entry, not a XAML LOB form!) so far will be via DirectX, which is only supported in C++.

·        Aside from interop, I hadn’t done much in C++ since the 1990s (ATL and DirectX). I spent the last couple of weeks revising. This post describes the resources I used to get back up to speed in an unmanaged world.

October – My Grand C++ Cram

ANSI C++ Language Tutorial

·        I started with http://www.cplusplus.com/doc/tutorial/ - These tutorials explain the C++ language from its basics up to ANSI-C++ 2003, including basic concepts such as arrays or classes and advanced concepts such as polymorphism or templates.

·        Coming from C#, there are a lot of C++ oddities that this tutorial allowed me to revise and reacquaint myself with. I will blog about these differences in a future post.

·        Basics of C++

o   Structure of a program

o   Variables. Data types.

o   Constants

o   Operators

o   Basic Input/Output

·       Control Structures

o   Control Structures

o   Functions (I)

o   Functions (II)

·       Compound Data Types

o   Arrays

o   Character Sequences

o   Pointers

o   Dynamic Memory

o   Data Structures

o   Other Data Types

·       Object Oriented Programming

o   Classes (I)

o   Classes (II)

o   Friendship and inheritance

o   Polymorphism

·       Intermediate Concepts

o   Templates

o   Namespaces

o   Exceptions

o   Type Casting

o   Preprocessor directives

o   Input/Output with files

 

STL Reference

·        http://www.cplusplus.com/reference/

·        The Standard template library is kind of like System.Collections.Generic for C++. The STL is a collection of functions, constants, classes, objects and templates that extends the C++ language providing basic functionality to perform several tasks, like classes to interact with the operating system, data containers, manipulators to operate with them and algorithms commonly needed. The declarations of the different elements provided by the library are split in several headers that shall be included in the code in order to have access to its components

·        C Library

·        C++ Standard Library: Standard Template Library (STL)

·        C++ Standard Library: Input/Output Stream Library - Provides functionality to use an abstraction called   streams   specially designed to perform input and output operations on sequences of character, like files or strings.

Visual C++ reference for Windows Runtime

·        Allot of detritus has accumulated in the C++ APIs pile over the past 2 decades. MS is pushing 'modern C++', forcing Win8 C++ developers to use better practices. Visual C++ in Visual Studio 11 for Windows Developer Preview has a new programming model for creating Metro style apps and components. One of the primary features of the new model is the abstract binary interface, or ABI, which defines an interface for inter-language communication. The new programming model is based on an updated version of COM, but the way that you program against this new model is more simple and natural than old-style COM programming. 

·        Windows Runtime objects and ref new

·        Numeric Types

·        Ref classes

·        Value structs

·        Strings

·        Properties

·        Interfaces and Parameterized Interfaces

·        Arrays

·        Enums

·        Delegates

·        Events

·        Collections

·        Exceptions

·        Platform::Uri and Guid types

·        Casting between C++ and WinRT types

·        Inheritance

·        Partial classes

·        Compiler and Linker options

·        Boxing

C++/CLI

·        Allot of the modern C++ stuff seems similar to C++/ CLI, so I revised that for good measure - C++/CLI for the C# programmer - http://www.c-sharpcorner.com/UploadFile/b942f9/9697/

·        I also looked at the Differences between C++/CLI and WinRT C++ - http://social.msdn.microsoft.com/Forums/en-GB/winappswithnativecode/thread/0aec2f93-de7e-4f2b-8090-17ceb34ed759

 

C++ 11

·        http://en.wikipedia.org/wiki/C%2B%2B11

·        The next thing I looked into was the latest version of C++. The Wikipedia article was a very informative source.  I would say the biggest change here is the inclusion for a very powerful expression of Lambda Functions in C++. Heres the TOC:

·        1   Candidate changes for the impending standard update

·        2   Extensions to the C++ core language

·        3   Core language runtime performance enhancements

o   3.1   Rvalue references and move constructors

o   3.2   Generalized constant expressions

o   3.3   Modification to the definition of plain old data

·        4   Core language build time performance enhancements

o   4.1   Extern template

·        5   Core language usability enhancements

o   5.1   Initializer lists

o   5.2   Uniform initialization

o   5.3   Type inference

o   5.4   Range-based for-loop

o   5.5   Lambda functions and expressions

o   5.6   Alternative function syntax

o   5.7   Object construction improvement

o   5.8   Explicit overrides and final

o   5.9   Null pointer constant

o   5.10   Strongly typed enumerations

o   5.11   Right angle bracket

o   5.12   Explicit conversion operators

o   5.13   Template aliases

o   5.14   Unrestricted unions

o   5.15   Identifiers with special meaning

·        6   Core language functionality improvements

o   6.1   Variadic templates

o   6.2   New string literals

o   6.3   User-defined literals

o   6.4   Multitasking memory model

o   6.5   Thread-local storage

o   6.6   Explicitly defaulted and deleted special member functions

o   6.7   Type long longint

o   6.8   Static assertions

o   6.9   Allow sizeof to work on members of classes without an explicit object

o   6.10   Allow garbage collected implementations

·        7   C++ standard library changes

o   7.1   Upgrades to standard library components

o   7.2   Threading facilities

o   7.3   Tuple types

o   7.4   Hash tables

o   7.5   Regular expressions

o   7.6   General-purpose smart pointers

o   7.7   Extensible random number facility

o   7.8   Wrapper reference

o   7.9   Polymorphic wrappers for function objects

o   7.10   Type traits for metaprogramming

o   7.11   Uniform method for computing the return type of function objects

·        8   Features planned but removed or not included

·        9   Features to be removed or deprecated

 

C++ Programming Wiki booklets

·        I found some nice resources on esoteric C++ topics here:

·        STL - http://en.wikibooks.org/wiki/C++_Programming/STL

o   The   Standard Template Library   (STL) offers collections of algorithms, containers, iterators, and other fundamental components, implemented as templates, classes, and functions essential to extend functionality and standardization to C++. STL main focus is to provide improvements implementation standardization with emphasis in performance and correctness.

·        TMP - http://en.wikibooks.org/wiki/C%2B%2B_Programming/Templates/Template_Meta-Programming

o   Template meta-programming refers to uses of the C++ template system to perform computation at compile-time within the code.

·        Smart Pointers - http://en.wikibooks.org/wiki/C++_Programming/Operators/Pointers/Smart_Pointers

o   Using raw pointers to store allocated data and then cleaning them up in the destructor can generally be considered a very bad idea since it is error-prone. Even temporarily storing allocated data in a raw pointer and then deleting it when done with it should be avoided for this reason. For example, if your code throws an exception, it can be cumbersome to properly catch the exception and delete all allocated objects. Smart pointers can alleviate this headache by using the compiler and language semantics to ensure the pointer content is automatically released when the pointer itself goes out of scope.

·        RTTI - http://en.wikibooks.org/wiki/C++_Programming/RTTI

o   RTTI refers to the ability of the system to report on the dynamic type of an object and to provide information about that type at runtime (as opposed to at compile time), when utilized consistently can be a powerful tool to ease the work of the programmer in managing resources.

·        C++ Garbage Collection ( a little sparse) - http://en.wikibooks.org/wiki/C++_Programming/Compiler/Linker/Libraries/Garbage_Collection

·        Libraries (including popular 3rd party libraries) - http://en.wikibooks.org/wiki/C++_Programming/Compiler/Linker/Libraries

o   Additional functionality that goes beyond the standard libraries (like garbage collection) are available (often free) by third party libraries, preventing wheel (and square wheel) re-invention.

·        The Boost Library (nice overview!) - http://en.wikibooks.org/wiki/C++_Programming/Libraries/Boost

o   The   Boost library   ( http://www.boost.org/ ) provides free,   peer-reviewed ,   open source   libraries   that extend the functionality of C++. Boost is like the BCL for C++.

·        Optimization - http://en.wikibooks.org/wiki/C++_Programming/Optimization

o   the use of virtual member functions, the right container, etc

·        Win32 - http://en.wikibooks.org/wiki/C++_Programming/Code/API/Win32

o   most direct way to interact with Windows for   software applications . Lower level access to a   Windows   system, mostly required for   device drivers , is provided by the   Windows Driver Model   .

·        C++ Programming/Threading

o   http://en.wikibooks.org/wiki/C++_Programming/Threading

 

Concurrency Runtime

·        Next I read through the section on MSDN that discussed parallelism on the CPU - http://msdn.microsoft.com/en-us/library/ee207192.aspx - this set of APIs raises the level of abstraction so that you do not have to manage the infrastructure details that are related to concurrency. You can also use it to specify scheduling policies that meet the quality of service demands of your applications. Coming from TPL, the were a lot of programmatic similarities that were evident. CCR goes a step further than the TPL though: It exposes UMS (user mode scheduling), allowing you to set task priority, and preemptively release the CPU from a non-busy thread.

·        Parallel Patterns Library (PPL) - http://msdn.microsoft.com/en-us/library/dd492418.aspx

o   builds on the scheduling and resource management components of the Concurrency Runtime

·        Asynchronous Agents Library - http://msdn.microsoft.com/en-us/library/dd492627.aspx

o   a template library that promotes an actor-based programming model and in-process message passing for coarse-grained dataflow and pipelining tasks. Very similar concepts to Axum, MPI.NET, TPL 4.5 DataFlow and F# DataFlowVariable<T>

 

·        Task Scheduler (Concurrency Runtime) - http://msdn.microsoft.com/en-us/library/dd984036.aspx

 

C++ AMP

·        Following this I got round to GPGPU – I read through

·        http://msdn.microsoft.com/en-us/library/hh265137(VS.110).aspx and , http://www.danielmoth.com/Blog/

·        C++ Accelerated Massive Parallelism accelerates execution of C++ code by taking advantage of the data-parallel hardware that is generally present as a GPU on a discrete graphics card. The C++ AMP programming model includes multidimensional arrays, indexing, memory transfer, tiling, and a mathematical function library. C++ AMP language extensions and compiler restrictions enable you to control how data is moved from the CPU to the GPU and back, which enables you to control the performance impact of moving the data back and forth.

·        Basically this is a C++ abstraction over HLSL for GPGPU – just like OpenCL is over CUDA. I must say that the programmatic concepts easily transferred across from what I had previously learned by going through MS Accelerator, but allot more configurable.

·        I especially liked the tile_static modifier – a type of shared memory for a subset of GPU threads – think rollup and linear algebra block matrices !

 

Windows Programming in C++

·        Currently going through using the Win32 API - http://msdn.microsoft.com/en-us/library/ff381399(VS.85).aspx

o   Introduction to Windows Programming in C++     - describes some of the basic terminology and coding conventions used in Windows programming.

o   Module 1. Your First Windows Program   - create a simple Windows program that shows a blank window.

o   Module 2. Using COM in Your Windows Program - introduces how COM underlies many of the modern Windows APIs.

o   Module 3. Windows Graphics     - Windows graphics architecture, with a focus on Direct2D.

o   Module 4. User Input   - mouse and keyboard input.

·        At some point, I'll get back to that Windows Internals book too!

Next Month

October's nearly over, and aside from reading the Roslyn docs and a JQuery book, (and my day job!!) , I've learned and relearned allot of C++. So whats next?

COM

·        I'm rearing to revise this stuff – its been a while since my programming life revolved around COM (I vaguely remember monikers, IDL etc) - http://msdn.microsoft.com/en-us/library/ee663262(VS.85).aspx. Heres some links:

o   Component Object Model (COM)             COM is a platform-independent, distributed, object-oriented system for creating binary software components that can interact. COM is the foundation technology for Microsoft's OLE (compound documents) and ActiveX (Internet-enabled components) technologies.

o   Automation   - enables software packages to expose their unique features to scripting tools and other applications..

o   Microsoft Interface Definition Language (MIDL)    - defines interfaces between client and server programs. The MIDL compiler in the Platform SDK enables you to create the interface definition language (IDL) files and application configuration files (ACF) required for RPC interfaces and COM/DCOM interfaces. MIDL also supports the generation of type libraries for OLE Automation.

o   Structured Storage       - provides file and data persistence in COM by handling a single file as a structured collection of objects known as storages and streams.

DirectX

·        I recently gave a course on XNA, so am ready to deep dive back into this platform - http://msdn.microsoft.com/en-us/library/ee663274(VS.85).aspx

 

The Boost C++ Libraries

·        You just cannot be proficient in C++ without Boost - http://en.highscore.de/cpp/boost/frontpage.html. As I stated, this is the BCL of C++.

o   Chapter 1: Introduction

o   Chapter 2: Smart Pointers

o   Chapter 3: Function Objects

o   Chapter 4: Event Handling

o   Chapter 5: String Handling

o   Chapter 6: Multithreading

o   Chapter 7: Asynchronous Input and Output

o   Chapter 8: Interprocess Communication

o   Chapter 9: Filesystem

o   Chapter 10: Date and Time

o   Chapter 11: Serialization

o   Chapter 12: Parser

o   Chapter 13: Containers

o   Chapter 14: Data Structures

o   Chapter 15: Error Handling

o   Chapter 16: Cast Operators

 

WDK – Windows Driver Kit

·        I know next to nothing about this !!!http://msdn.microsoft.com/en-us/library/ff557573(VS.85).aspx – I am very curious about how to write code in kernel mode.

 

Optimizing C++ Wikibook

·        To master something , you have to know how to tweak it - http://en.wikibooks.org/wiki/Optimizing_C%2B%2B

·        Optimization life cycle

·        Writing efficient code

o   Performance improving features

o   Performance worsening features

o   Constructions and destructions

o   Allocations and deallocations

o   Memory access

o   Thread usage

·        General optimization techniques

o   Input/Output

o   Memoization

o   Sorting

o   Other techniques

·        Code optimization

o   Allocations and deallocations

o   Run-time support

o   Instruction count

o   Constructions and destructions

o   Pipeline

o   Memory access

o   Faster operations

 

As Yoda stated – C# devs:  "You must unlearn what you have learned" - http://www.youtube.com/watch?v=FDezrybpuO8

Cheers,

Josh

 

Posted On Saturday, October 29, 2011 9:59 PM | Feedback (0)

Wednesday, September 21, 2011 #

SVG - an introduction

So someone moved your Silverlight cheese? Go and get some HTML5 cheese!

Before WPF/E & Avalon were anything more than vapourware, W3C had the SVG standard(Scallable Vector Graphics) for 2D vector graphics over the web. Using Javascript, you could manipulate 2D animations & tranforms. You can embed SVG in HTML5 today.

Yeah, its like going back to Silverlight 1.0, but you just have to deal with it!

Embrace change.

SVG Features
·        Designed for 2D graphics - display vector graphics & text along with raster graphics 
·        Use XML (Unicode text) to describe vector graphics - can be searched & indexed, provides a standard for graphical data interchange
·        SVG 1.0 is aW3C standard, & & 2.0 is in the works
·        Powerful - can Manipulate the SVG DOM via XSLT and/or JavaScript on the client
·        The future direction of 2D GUI development that involves animations and transforms
 
Vector graphics
·        a set of instructions for drawing a series of geometric shapes at a set of coordinates
·        Different from Raster graphics - Compressed bitmap of pixel RGB values - most display devices are raster - until SVG, browsers only supported raster graphics
·        self understanding objects rather than pixels - can dynamically change shape & color, allows text to be searchable
·        used in CAD, Adobe Illustrator, Adobe PostScript language & Macromedia Flash - but unlike these binary encoded formats, SVG is XML text, and is thus searchable, parseable & dynamic on the client side (not just event driven)
·        Scalability - scaled without a loss of image quality , unlike raster images, which require anti-aliasing. Graphics aren't limited by fixed pixels and can change size without distorting them - adjust to the available screen resolution à interoperability
SVG Advantages
·        File size - smaller than raster images, .SWF flash files, can be compressed to .svgz files --> less bandwidth consumption
·        Zoom functionality - viewer allows 4 magnification steps: x2, x4, x8, x16 -from the context menu, select zoom in, zoom out or original view - not available in traditional raster images
·        Panning functionality - hold down the alt key & left mouse button
·        Selective display of elements by setting an element's visibility attributeto "visible" or "hidden"
·        Open source code - just select view source !
·        International language support - based on Unicode & browser settings detection - don’t have to create a bitmap for each supported language
·        Accessible to search engines because it is text based - unlike raster images of text
·        Data driven graphics - use an XML datasource, SVG can be generated on the server or client
·        Resolution independent - image is rendered appropriate to the display device - no need to maintain multiple versions of a graphic for different devices
·        Rollovers are declarative - can do without JavaScript
 
SVG 2.0
·        The SVG 1.0 feature set with modular DTDs
·        DOM level 3 text events - allow an event listener to determine which key was pressed
·        Text wrapping - achieved via the element <flowText> & its subelements <flowRegion>, <flowDiv>, <flowPara> & <flowSpan>
·        Z-ordering- unlike SVG 1.0 which uses a fixed rendering order whereby the object that is last in the tree order will render topmost , SVG   could use CSS2 z-index style attribute
·        DOM Level 3 Load and save - access XML data on remote servers & use SAX to update the display - the XML is parsed into a new DOM tree, navigated & new nodes are created in the SVG graphic - implemented in Adobe SVG Viewer by the proprietary functions getURL, parseXML , & postURL
·        Vector filter affects - current filter affects require the target element to be rasterized before being applied, consuming allot of memory - vector filters take vectors as input & generate new vector data as output
·        Multiple namespace compatibility - ensure different APIs (e.g. MathML) can work together
·        Linking & synchronization of SMIL audio
·        W3C XHTML+MathML+SVG profile - combines namespaces enabling mixing & validating the standards via <xmlns> elements in the same document
SVG Document structure
The root <svg> element
·        width & height attributes defines the canvas size - both default to "100%"
·        The zoomAndPan attribute can be set to "enable" (default) or "disable"
·        The <title> element's content will be available to viewer title bar
·        The <desc> element is informational - can be applied to any SVG element as a subelement
·        A shape's location & size are part of its structure while color & style are part of its presentation
·        HTML5 allows embedding SVG directly using <svg>...</svg> tag anywhere in the body:
<svg xmlns="http://www.w3.org/2000/svg" width="100" height="100" >
<title> xxx </title>
<desc> xxx <desc>
...
</svg>
Coordinate system
·        The point 0,0 is upper left corner.
·        The viewport is the canvas area specified by the<svg> element  width & height attributes
·        units can be the following:
o   Px - pixels (default)
o   Em - the height of the letter x
o   Ex - default font character height
o   Pt - points (1/72 inch)
o   Pc -picas (1/6 inch)
o   Cm - centimeters
o   Mm - millimeters
o   In - inches
·        The <svg> element viewBox attribute defines the onscreen size of objects by defining a new coordinate system and provides a way to scale content - contains the min x-value, min y-value, the width & the height of a user defined coordinate system - e.g. viewBox="0 0 80 80" - note: height & width must both be positive
·        Viewport width & height are respectably defined as the 3rd minus 1st value & the 4th minus the 2nd value
·        To preserve the aspect ratio (of width to height) between the viewbox & the viewport, use the <svg> element preserveAspectRatio attribute - values can be various combinations of {xMin, xMid, xMax} and {yMin, yMid, yMax} plus 1 of 3 specifiers
·        E.g. xMinYMid = align viewbox min x with viewport left corner and align viewbox midpoint y value with viewport mid y value
·        The meet specifier will scale the graphic according to the smaller dimension so that the entire graphic fits into the viewport
·        The slice specifier will scale the graphic according to the larger dimension & cut off the parts that lie outside the viewport
·        The none specifier will stretch & squash the graphic to fit precisely in the viewport
·        E.g.
<svg      width="100"
height="100"
preserveAspectRatio="xMidYMid meet"
viewBox="0 0 90 90" >
·        <svg> elements can be nested with the appropriate preserveAspectRatio attribute value in order to establish a new viewport & coordinate system at any time for different document fragments - can be transformed as a whole or imported into another SVG document using the <image> element (does not support nested animation)
Using links in SVG
·        As in HTML, the <a> element can contain a graphic & its xlink:href attribute can reference a URL, even that of another SVG file
·        xlink:href is defined in XLink namespace - SVG only supports outbound XLinks
·        Xlink is expressed in global attributes and provides both simple & extended links
·        Need to add the XLink namespace declaration to the <svg> element:
<svg ... xmlns:xlink="http://www.w3.org/1999/xlink" >
·        To use a link in SVG, use an <a> element and enclose a <text> element with a style attribute that specifies "stroke:blue;" to signify a hyperlink:
<a xlink:href="..." target="new">
<text ... style="stroke:blue;" >
xxx
</text>
</a>
·        use the attribute value target="new" to open the link in a new window
·        Use XPointer links in a <use> element to enable reuse of elements defined in the <defs> element (provides fragment identification for generic XML documents & external parsed entities)
·        E.g. bare name XPointer:
<use xlink:href="#rectX" ... />
·        E.g. XPointer scheme:
<use xlink:href="#xpointer(id('rectX'))" ... />
·        An additional way to link is to specify an SVG <view> element that is referenced by a link
·        The <view> element has a Viewbox attribute and allows the user to select scaling or translation (a link enabled zooming or panning affect)
·        E.g.:
<view id="viewNormal" viewbox="0 0 100 100"/>
<view id="viewDouble" viewbox="0 0 50 50"/>
<view id="viewTriple" viewbox="0 0 33 33"/>
<view id="viewPanLeft" viewbox="100 0 100 100"/>
<view id="viewPanRight" viewbox="-100 0 100 100"/>
 
...
<a xlink:href="#viewNormal">
...
</a>
·        To control cursor appearance and have it appear as a hand, nest the object of interest within an <a> element whose xlink:href attribute has an empty string value
·        e.g.
<a xlink:href="">
...
</a>
Basic graphical elements
The <circle> element
·        cx, cy - specify the centerpoint of the circle
·        r -radius
·        rx, ry - radii for an eclipse 
·        E.g.
<circle cx="100" cy="100" r="20" style="..." />
 
The <line> element
·        x1, x2, y1,y2 - specify the endpoints of the line
·        E.g.
<line x1="0" y1="0" x2="100" y2="100" style="stroke: black;" />
 
The <rect> element
·        x, y, width, height
·        x, y values default to 0,0
·        rx, ry - specify radius for rounded corners
 
The <polyline> element
·        Describes a shape created by a number of straight lines
·        Points attribute takes comma delimited pairs of x, y coordinates
·        It is best to set the style fill property to none
·        E.g.
<polyline points="10 25, 67 120, 45 89" style="stroke: black; fill:none;" />
 
The <polygon> element
·        Describes a closed multisided shape with straight edges
·        Points attribute takes comma delimited pairs of x, y coordinates
·        Unlike <polyline>, the shape is automatically closed
·        E.g.
<polyline points="10 25, 67 120, 45 89" style="stroke: black; fill:none;" />
 
 
The <path> element
·        A general element - compactly specify a path for an arbitrary shape as a sequence of lines & curves
·        Can be used to define the outline of a clipping area or a transparency mask
·        d attribute - signifies data to draw - entire path is contained in 1 attribute --> reduces bandwidth load, memory requirements of loading entire DOM structure into XML Parser - consists of 1 letter commands:
·        M - move to coordinates x, y (starts a new sub-path)
·        L - draw a line to coordinates x, y
·        A -draw an arc
·        Z - close path - draw a straight line back to the beginning point of the current sub-path
·        H - draw a horizontal line to an x axis position
·        V - draw a vertical line to a y axis position
·        A - arc - specify 7 arguments : x, y radius of the ellipse on which the points lie, the x-axis-rotation of the ellipse, the large-arc-flag (0 if the arc's measure is < 180 degrees or 1 if >= 180 degrees), sweep-flag (0 if the arc is drawn in the negative angle direction and 1 if the angle is drawn in the positive angle direction) & the endpoint x, y coordinates
·        Q - Quadratic Bezier curves - specified by a curve control point x, y coordinate (acts like a magnet whose proximity to the curve directly affects the magnitude of attraction) and the x, y coordinate of the endpoint (the start point is specified by the previous command) - multiple sets of (4 numbers each) control & endpoints can be specified after a Q command to generate a polybezier curve
·        T - smooth quadratic curve (continuation of the previous curve) - only specify the endpoint x, y coordinate, as the curve control point x, y coordinate is automatically reflected from the control point on the previous command relative to the current point (according to the formulas: x2 = 2 * x - x1; y2 = 2 * y - y1)
·        C - Cubic Bezier curves - more complex as they have 2 control points, one for each endpoint - specify 3 sets of coordinates: the control point for the start point, the control point for the end point & the end point
·        S - smooth cubic Bezier curve - (continuation of the previous curve) - only specify the endpoint control point x, y coordinate and the endpoint x, y coordinate, as the start point curve control point x, y coordinate is automatically reflected from the endpoint control point on the previous command relative to the current point
·        Note: uppercase command letters signify absolute coordinates, while lowercase command letters signify coordinates relative to the current pen position
·        Paths can be made shorter by placing multiple space delimited coordinates after a L command letter
·        E.g.
<path d="M 10 10 L 100 100 50 67 M 9 9 A 50 50 0 0 0 100 100 M 40 40 Q 20 60 70 90 T 50 50 C 20 20 40 40 50 50 S 70 70 60 60" style="stroke: black; fill:none;" />
 
The <marker> element
·        Use to place shapes on a path - e.g. an arrow at the end
·        Multiple <marker> elements can be referenced by a <path> element
·        Does not display by itself - place in a <defs> element so that it can be used as a template
·        Considered part of the presentation, not the content - referenced in a <path> element style attribute using the marker-start, marker-mid (attached to every vortex in the path except for the 1st & last)  marker-end or marker property (all)- e.g.
·        Style="marker-start: url(#markerX); fill:none; stroke:black; "
·        Encloses a self contained graphic with its own private set of coordinates and other attributes
·        Attributes:
·        id
·        markerWidth, markerHeight
·        refX , refY - specify which marker point to align with the path (default = 0,0)
·        Orient - specify how the marker is to match the orientation of the path - in degrees or auto (default = 0)
·        markerUnits - if set to strokeWidth then the marker grows in proportion to the path's stroke width, if set to userSpaceOnUse then marker will remain the same irrespective of the stroke width
·        Viewbox
·        PreserveAspectRatio
·        E.g.
<defs>
<marker id="markerX"
markerWidth="5"
markerHeight="5"
refX="5" refY="5"
orient="auto"
markerUnits="strokeWidth">
 
<circle ... />
</marker>
</defs>
 
Special elements
The <g> grouping element
·        Make document more structured & understandable
·        Treat subelements as a referencable unit - via a unique id attribute
·        Any styles applied to a group will flow down to its subelements
·        While a nested <svg> element cannot be transformed, a nested <g> element can be
·        May be nested and may have its own <title> & <desc> elements
·        The transform attribute should not be applied - better to specify x, y coordinates in the calling <use> element
·        E.g.
<g id="g1" class="classX" >
<line x1="0" y1="0" x2="100" y2="100" style="stroke: black;" />
<line x1="50" y1="50" x2="150" y2="150" style="stroke: black;" />
</g>
 
The <symbol> element
·        Another way of grouping elements
·        Unlike <g>, Set as a template only (not automatically rendered)
·        Can specify viewbox and preserveAspectRatio attributes - <symbol> can fit into the viewport established by the <use> element
 
The <use> element
·        Utilize a template defined element
·        Repeat elements - copy & paste group elements
·        x, y attributes - specify coordinates
·        xlink:href attribute - specify the group to reuse. Note: can reference any valid file or URI - can place a set of common elements in one svg file & use them selectively in other svg files
·        Disadvantages: positioning is not absolute but relative to the original, the original style cannot be overridden, and both the original and the copy must appear in the document (the original does not act as a true template)
·        E.g.
<use xlink:href="#g1" x="50" y="50" />
<use xlink:href="xxx.svg#g1" x="50" y="50" />
 
<defs> element
·        Define the template elements
·        overcomes the disadvantages of <use>
·        Encloses all the original <group> elements - instructs SVG to define the groups as a template without displaying them --> SVG can process these groups more efficiently in a streaming environment
·        Note: applied in conjunction with <use> element , which references <group> elements as normal
 
<foreignObject> element
·        Allows objects & code from non SVG namespaces (e.g. XHTML, SMIL) to be embedded in an SVG document
·        Consider placing an XHTML block of wrapable text within SVG (which cannot wrap text)
·        Possibly not supported in your browser of choice!
·        Causes a new viewport to be created with the following attributes: x, y, width , height , requiredExtensions
 
<image> element
·        Creates a new viewport that imports an external image file
·        Can include an entire .svg or raster image .jpg or .png file
·        Attributes: x, y, width, height
·        Raster files are scaled to fit the rectangle, while vector images use the rectangle as a viewport
·        Doesn't support nested animation of SVG
·        E.g.
<image xlink:href="xxx.jpg" x= "60" y="30" width="50" height ="100" />

 

Posted On Wednesday, September 21, 2011 10:13 AM | Feedback (0)

Thursday, September 15, 2011 #

Build - Not all it could be.

 
Programming for Metro
Metro is the future http://zd.net/rnT9VZ .NET is for old apps. WinRT replaces WPF & WCF. XAML is big. There were no Silverlight sessions.
HTML 5 and Javascript ( Blend for HTML) are back in fashion – I wont dwell on this because web apps are (inconsistently) simplistic in concept and over-complex in development.
Yes, all those XAML skills are portable, Silverlight (RIP) apps can run as Metro apps with a bit of fidgeting with namespaces – you just need to know what controls to replace in the Windows.UI.Xaml.Controls namespace http://msdn.microsoft.com/en-us/library/windows/apps/br227716(v=VS.85).aspx - (Some are omniously missing - Silverlight / WPF had allot of controls). here is a list of Metro Controls that do not have direct counterparts in Silverlight / WPF (or at least not the same name):
·        ApplicationBar - Represents the container control that holds application UI components for commanding and experiences.
·         CaptureElement - Represents a media capture from a capture device, for example recorded video content.
·        CarouselPanel - Represents a panel that presents its items on a surface with a viewport, and includes scrolling capabilities and item virtualization.
·         FlipView - Represents an items control that displays one item at a time, and which enables "flip" behavior for traversing its collection of items.
·        FlipViewItem - Represents the items wrapper class for a FlipView control.
·        Frame - Represents a content control that supports navigation.
·        GridView - Represents a specialized ordered list view.
·        GroupItem - Represents the root element for a subtree that is created for a group.
·        GroupStyle - Describes how to display the grouped items in a collection, such as the collection from GroupItems.
·        GroupStyleSelector - A class that you derive from to select the group style as a function of the parent group and its level.
·        ItemContainerGenerator - Provides mappings between the items of an ItemsControl and their container elements.
·    JumpViewer - Represents a scrollable control that incorporates two views that have a semantic relationship. For example, the JumpView might be an index of titles, and the ContentView might include details and summaries for each of the title entries. Views can be changed using zoom or other interactions.
·        JumpViewerLocation - Communicates information for items and view state in a JumpViewer, such that hosts for scrolling and virtualization (such as ListViewBase) can get correct item and bounds information.
·        ListViewItemTemplateSettings - Provides calculated values that can be referenced as TemplatedParent sources when defining templates for a ListViewItem.
·         NotifyEventArgs - Provides data for the ScriptNotify event.
·        OrientedVirtualizingPanel - Adds infrastructure (provides base class) for virtualizing layout containers that support spatial cues, such as VirtualizingStackPanel and WrapGrid.
·        Page - Encapsulates a page of content that can be navigated to.
·      ProgressRing - Represents a control that indicates that an operation is ongoing. The typical visual appearance is a ring-shaped "spinner" that cycles an animation as progress continues.
·         RichTextBlockOverflow - Provides the overflow area for linked text containers, as a companion class to RichTextBlock.
·        StyleSelector - A class that you derive from to select the item style as a function of the content data and its specific item container.
·        ToggleSwitch - Represents a switch that can be toggled between two states.
·        UIElementCollection - Represents an ordered collection of UIElement objects.
·        VariableSizedWrapGrid - Provides a grid-style layout panel where each tile/cell can be variable size based on content.
·        WebView - Provides a UI element that hosts HTML content within the application.
·        WebViewBrush - Provides a brush that renders the currently hosted HTML.
·        WrapGrid - Positions child elements sequentially from left to right or top to bottom. When elements extend beyond the container edge, elements are positioned in the next row or column.
On the other hand, you can also see whats new in WPF 4.5 - http://msdn.microsoft.com/en-us/library/bb613588(v=VS.110).aspx (flogging a dead horse?).
 
But when do you expect to be developing for Metro? Win8 is at least a year away, and a service pack 1 (the corporate seal of approval) a year after that. Win8 seems targeted at consumers - its very touchy-feely, social networking enhanced, pastel colored – geared towards tablets. How will it benefit corporations – you know the guys who actually shell out the cash for us to develop the apps? They just wound up the Win 7 upgrade cycle. Corporations spend money on software to enhance information systems for 2 reasons: 1) to increace revenue and 2) to decreace costs. Its unclear how Metro will help out here. The workplace is about productivity, its not a gadget fashion show! To top that off, we may be heading into the Great Depression II – Europe is collapsing and the US is drowning in debt – if tens of thousands of people are getting laid off, why would an organization blow cash on upgrading all its office computers now?
 
The Revenge of C++
should I ditch C# / F# and jump back into C++ ?
 
GPGPU - C++ AMP (Accelerated Massive Parallelism) - http://blogs.msdn.com/b/nativeconcurrency/archive/2011/09/13/c-amp-in-a-nutshell.aspx  GPU programming is impressive and just in time for Intel MIC – why isnt there a C# version in the .NET Framework ? As it turns out, we have MS Accelerator stuck in MS Research for years http://research.microsoft.com/en-us/projects/accelerator/ , and the 3rd party LINQ to GPU Brahma.NET http://www.infoq.com/news/2010/05/Brahma
 
Will XNA vnext leverage VS2012 DX Graphics frame Debugger, or MSBuild for HLSL ? DirectXMath is a Win8 replacement for XNA Math API - http://msdn.microsoft.com/en-us/library/windows/desktop/hh437833!  If all the new 3D debugging features are dependent on DirectX 11, What is the future of XNA, which is mired back on DirectX 9 ?
 
My first instinct was to go back to basics - Learn to Program for Windows in C++ - http://msdn.microsoft.com/en-us/library/ff381399(VS.85).aspx
 
Win8 Metro claims language parity: - I had a look at Getting started with Windows Metro style app development - http://msdn.microsoft.com/library/windows/apps/br211386/ - Its plain to see that C++ requires SIGNIFICANTLY more code than C#
·        For Binding – IcustomPropertyProvider
·        For Async - AsyncOperationWithProgressCompletedHandler
 
 And what about COM interop marshaling overhead? - even with easily discoverable metadata, WInRT is still COM.
 
What do we have to learn from Build:
 
Is that it?
I was expecting revolutionary evolution, not evolutionary revolution ...
 
I love the buzz of being on the cutting edge of technology, and leveraging the power of the highest level of abstraction. That's why I'm in this industry. As a .NET dev, I can pick & choose which APIs I am interested in and tailor my choice of which customer project to pick up (Unfortunately, I cant find an F# customer...). When a great API such as WPF / Silverlight is superseded, and there will never be a v.next, then for me its time to move on to the next cutting edge.
 
There is so much room for improvement in .NET. (See here for the list: http://geekswithblogs.net/JoshReuben/archive/2010/09/13/series-net-5.0---not-quite-there-yet.aspx). There is also allot than can be said for the state of software engineering mechanisms in general: Today there is no MDA story that works, we have no silver bullet for capturing compilable requirements, code is way to verbose and is repeated across projects ad nauseam! Its sad for me that Build has decided to reinvent allot of the BCL, instead of building on what has gone before, abstracting it and improving it. Thats not evolution - its recycling. It also stinks of internal politics between divisions.
 
Do you recall the massive learning curve that accompanied .NET 1.0 and 3.0, and to a lesser extent LINQ TPL & RX? That what i crave - exponential improvement. Compare that to metro + WinRT, which is another way to write nearly the same old XAML with what boils down to is just Win8 controls. Immersive - WPF had multitouch. If I wanted to use the Facebook API, well I would just use it. TPL Dataflow - come on, we saw this at PDC - its not a new reveal. LINQ to HPC - its been on the boil for the last 2-3 years.
 
Look at it this way: how much did you have to read to get your head around WCF, WPF, ASP.NET (whatever), compared to the great, but comparetively minor modifications in .NET 4.5 - I just dont feel there are new revolutionary concepts or programming paradigm shifts to wrap my head arround..
 
Check out the sessions, and see for yourself!:  http://bit.ly/qzP32k
 

Posted On Thursday, September 15, 2011 11:43 PM | Feedback (2)

Copyright © JoshReuben

Design by Bartosz Brzezinski

Design by Phil Haack Based On A Design By Bartosz Brzezinski