Overview
OpenCL
is a GPGPU API
that abstracts over acceleration devices (be they CPU, GPU or FPGA)
to provide
data-parallelism
(as well as task-parallelism) behavior.
heterogeneous
portability is achieved
by avoiding
high level abstractions and exposing
the hardware in a
context that explicitly defines its work scheduling capabilities.
An
OpenCL application consists of two parts:
-
the
host
program that
runs on the CPU - API
functions
to discover devices and their capabilities &
create a
context, create command-queues, define
memory objects, &
submit
commands to the
queue, optionally with sync constraints
-
kernels
(typically
simple functions that transform input memory objects {work
items} into
output memory objects)
that
run on the GPU.
Defined
in
the
OpenCL programming language: an
extended
subset of ISO C99.
The Host
Program
specifies
the context for the kernels - defines the NDRange (a
1 to 3 dimensional hierarchical process space) and
the command queues
that specify
how and when the kernels execute on
devices and access memory objects.
A kernel program
object is loaded /
generated & built
at runtime within the host program. Memory
objects are explicitly defined on the host and moved between the host
and the OpenCL devices.
Command
queues
communicate between the host and a device -
3
types of commands:
Command
Queues are Async - When
the kernel has completed its work, memory objects produced in the
computation are
copied back to
the host. Default is
In-order
execution - serialized
execution order of commands in a queue, but
synchronization
commands can be used to
control execution order of
multiple kernels to achieve
load balancing.
Memory
Objects
2
types of memory
objects:
-
1)
buffer objects - a contiguous block of memory - A programmer can map
data structures onto this buffer and access the buffer through
pointers -->
flexibility.
-
2)
image objects
- storage format
optimized to a specific OpenCL device. opaque - OpenCL provides
functions to manipulate images, but contents are hidden from the
kernel program. Can
address
subregions as distinct memory objects.
memory
region hierarchy:
-
Host
memory
-
Global
WorkGroup memory: R/W access to all work-items in all work-groups.
consistent at a work-group barrier.
-
Local
WorkItem memory: shared by all work-items in that work-group. values
seen by a set of work-items are guaranteed to be consistent at
work-group synchronization points.
-
Private
WorkItem
memory:
not visible to
the host. cannot be reordered.
memory
consistency
-
memory
values
are NOT
guaranteed to be consistent across all
work-items at all times.
-
When
all the work-items associated with a kernel complete, load / store
ops
for
the
kernel's
memory objects are completed before the kernel command is signaled
as completed.
For the in-order queue, this is sufficient for
consistency. For an out-of-order queue - 2
synchronization
point
mechanisms:
1) force
consistency at specific synchronization points such as a
command-queue barrier. 2)
explicitly manage
consistency through event mechanisms
OpenCL
can interop with OpenGL:
-
sharing
a device context
-
sharing
memory objects
-
signaling
events
The Kernel Programming Language
An
extension of C99.
Features
removed: recursion, pointers, bitfields
features
added:
-
Address
space qualifiers
-
Vector
types and operations
-
A
large set of built-in functions
-
Atomic
functions for unsigned integer and single-precision scalar variables
in global and local memory
am
embedded profile version relaxes
its
floating-point standards &
relative errors and
makes double support
optional.
A
Typical OpenCL Host
Code Flow:
Discover
& probe
component hardware
capabilities:
-
clGetPlatformIDs() - query available platforms and select one (eg 1st) cl_platform_id
-
clCreateContextFromType() - create a cl_context from a cl_context_properties array containing a cl_platform_id
-
clGetContextInfo() - query CL_CONTEXT_DEVICES to get the size_t of the cl_context device buffer array
-
create a new cl_device_id array pointer to allocate memory for the devices buffer array according to the size_t
-
clGetContextInfo() - call again with the size_t and the array pointer to populate the buffer
Create
the blocks of instructions (kernels) that will run on the platform:
-
use standard C ifstream to open a .cl kernel source file for reading, ostringstream to read a char stream from file via rdbuf() and convert the stream to a char array via good old c_str()
-
clCreateProgramWithSource() - create the cl_program against the cl_context & cl_device from the source code string
-
clBuildProgram() - compile it
-
clCreateKernel() - Create a cl_kernel from the cl_program
Set
up and manipulate memory objects involved in the computation:
-
create 3 vanilla float arrays for kernel IO data mapping (eg 2 in, 1 out)
-
create a cl_mem[3] array of memory buffers.
-
clCreateBuffer()
x 3
– map the buffers:
Copy
the arrays into memory
objects allocated in device memory to
be passed to the kernel where
they can be accessed. input
buffers are created
CL_MEM_READ_ONLY |
CL_MEM_COPY_HOST_PTR
and mapped & initialized from host arrays, while output buffers
are created CL_MEM_READ_WRITE
and initialized
NULL. Create a
cl_mem[3] array of memory
objects (2 in to be
populated from the float arrays, 3rd
to be populated when the kernel runs) against the cl_context
to be used as arguments
to the cl_kernel
Execute
the kernels (optionally
with task parallelism synchronization) and collect
the results.
-
calculate
processing
space extent -
size_t
global and local
workgroup dimensions
-
clEnqueueNDRangeKernel() - enqueue the c_kernel in the cl_command_queue, specifying the processing space dimensions
-
clEnqueueReadBuffer() - Read the output buffer (the mapped cl_mem ) back to the Host from the cl_command_queue
-
check the results in the output array !
A
Canonical 2-in 1-out kernel .cl
file:
__kernel
void sum_kernel(
__global
const float *x,
__global
const float *y,
__global
float *retval){
int g_id = get_global_id(0);
retval[g_id] = x[g_id] + y[g_id];
}
OpenCL Programming Idioms
There
are 2 common programming idioms found in OpenCL coding:
-
1)
in C style, most cl_X functions return a
cl_int
error code (or success code in the case of CL_SUCCESS).
-
2)
context query operations work by populating single item arrays and
generally require 2 calls to clGetContextInfo()
for a specific context code: the first to retrieve the array size_t
for constructing the appropriate storage space to hold the item, the
second to populate the single item array – thats
how it is !