ThreadWeave
文件大小: unknow
源码售价: 5 个金币 积分规则     积分充值
资源说明:Python package to bring nd-arrays to the GPU


















ThreadWeave

Introduction

In a nutshell, ThreadWeave aims to bring the type of functionality found in scipy.weave to the GPU, or SIMD architectures, more generally. That is, ThreadWeave allows one to write concise inline SIMD kernels directly in Python, while providing both complete imperative backwards compatibility, as well as a useful set of higher level abstractions.

Most significantly, nd-arrays are first class citizens of the ThreadWeave world. They can be passed around, indexed, have their attributes read, and have boundary handling modes applied to them, all within the kernel.

An example says more than a thousand words, so here are two short ones:

import numpy as np

from threadweave.backend import OpenCL as Backend

from threadweave.stencil import laplacian

 

with Backend.Context(device = 0) as ctx:

 

    laplacian_kernel = ctx.kernel("""

        (<type>[d,n,m] output) << [d,n,m] << (<type>[d,n,m] input):

            n:      serial

            d:      variable

            input:  padded(stencil)

        {

            <type> r = 0;

            for (s in stencil(input))

                 r += input(s.d,s.n,s.m) * s.weight;

            output(d,n,m) = r;

        }""")

 

    shape = (2,6,6)

    input = ctx.array(np.arange(np.prod(shape)).astype(np.float32).reshape(shape) ** 2)

    print input    #some arbitrary input data

 

    stencil = laplacian(2,5)[np.newaxis,:,:] #construct 2-dim 5-pts laplacian, with broadcasting

    output = laplacian_kernel(input, stencil = stencil)

    print output

 

Without going into the details; this example shows the declaration and invocation of a Laplacian filter applied to a stack of 2d arrays; these few lines of code are converted into 70 lines of dense CUDA or OpenCL boilerplate. Of course good libraries exist already to perform convolutions; the added value, aside from demonstration purposes, is that ThreadWeave allows for greatly expanded flexibility. By acting on a stack of 2d arrays, we can expose additional parallelism, and we have full control over boundary handling, as well as the way in which axes are iterated over. Also, the iteration over the stencil is transparently unrolled.

Aside from eliminating boilerplate from imperative C code, the functionality at the core of ThreadWeave also makes for a good platform for building higher level abstractions:

import numpy as np

from threadweave.backend import CUDA as Backend

 

with Backend.Context(device = 0) as ctx:

    #some example data

    shape = 2,3,4

    A = ctx.arange(np.prod(shape),np.float32).reshape(shape)

    shape = 2,4,3

    B = ctx.arange(np.prod(shape),np.float32).reshape(shape)

 

    #declare product, contracted over j

    stacked_matrix_product = ctx.tensor_product('nij,njk->nik')

    C = stacked_matrix_product(A, B)

 

    print C

    print ctx.tensor_product('nii->ni')(C)   #print the diagonals of C too, for good measure

 

This short example declares and invokes a contracting tensor product; more specifically, it multiplies a stack of matrices. The function ctx.tensor_product is a work-alike of the invaluable numpy.einsum, and can be used to concisely express inner products, outer products, diagonal extractions, transposes, and any other operation which can be expressed as a tensor product with Einstein summation convention.

By building on top of the abstractions at the core of ThreadWeave, the functionality in ctx. tensor_product can be implemented in just a handful of lines. (in its current fully functional yet suboptimal form, that is).

To give an impression of how this happens: both tensor_product and laplacian_kernel have the same type; KernelFactory, which wraps a KernelDeclaration and a cache of compiled KernelInstances, which are created by letting a backend specific CodeGenerator act on a KernelDeclaration for a given set of template arguments.

You get the idea; quite a few of the same things going on behind the scenes, which are identical for both examples. The methods Context.prod and Context.kernel merely wrap two different front-ends for parsing their respective grammars into a KernelDeclaration; many more can be created with relatively little effort to cater to specific wants. This frees one from grave discussions, with oneself or others, over syntactical details; if your problem domain asks for different syntax, it should be easy to implement on top of ThreadWeave.

 

Feature summary

Here is a concise listing of the core features that ThreadWeave provides at present.

-          Nd-array awareness:

o   Passing nd-arrays in and out of kernels

o   Indexing of nd-arrays

o   Various boundary handling modes (padded, wrapped, clamped, fixed)

o   Accessing of array attributes such as shape with numpythonic syntax.

-          Axes:

o   Declaratively relate kernel and array axes to one another. This gives a highly flexible and compact notation.

o   Decoupling between the grid of work items to be performed, and the grid of threads that perform them. This allows us to overcome hardware limitations on threadgrids, but more importantly, allows one to declaratively optimize the relation between the work to be performed, and the arrangement of threads to accomplish this goal.

-          Runtime features:

o   JIT compilation of kernels templated on shape, type and stencil arguments

o   Runtime type & shape validation of kernel arguments

o   Output shape inference and allocation (if not given to the kernel)

o   Computation of sensible defaults for grid/block parameters

-          Abstracting away the differences between different backends; they can be switched out by changing a single import. As of now, CUDA and OpenCL are supported. This applies to both the Python side of things, where ThreadPy presents a uniform interface, as well as papering over the superficial syntactical differences between the two C-dialects, like atomicAdd / atomic_add.

 

ThreadWeave Concepts

Nd-arrays

The two primary SIMD languages, CUDA and OpenCL, are both C dialects, which is the natural choice in many ways. However, C was never designed with SIMD in mind; and although both dialects offer some abstractions for dealing with multiple threads, one is left with the minimal array abstractions C offers for the multiple data part of the equation. By extending C with nd-array capabilities, we not only gain the immediate benefits of this tried and true abstraction, but it also enables the introduction of novel concepts, aimed at harnassing the typically tight relationship between the grid of threads and the data they act on.

ThreadWeave facilitates working with nd-arrays both within the kernel code, as well as during declaration and invocation.

In order to provide access to arrays within the kernel, a pointer to the device memory is passed into the kernel (shocker). In addition, all shape information is either passed into the kernel at compile time or runtime. This information is then used to compute all strides, which in turn is used to provide clean syntax for indexation of nd-arrays. All the array properties passed into or computed within the kernel are accessible within the kernel by means of numpythonic syntax.

Each array axis can be given a boundary handling mode. By default, array access is not bounds checked. But when working with stencil operations for instance, one needs a consistent way of handling array boundaries. All typical boundary handling modes are planned functionality (clamped, fixed, reflected); for now, ThreadWeave only supports wrapped boundaries, and padded boundaries. By marking an array as padded (with a given stencil), it is treated as a view upon a larger array (with enough padding such that all access through the stencil is safe). This is both a very flexible way of specifying boundary conditions, and it is the most efficient in terms of overhead inside the kernel.

At the moment, only C-contiguous arrays are supported (plus a restricted notion of views on these, in the form of padded arrays). Support for arbitrary strides would be nice to have, but the nd-array classes included in the current stable branch of pycuda and pyopencl do not support such. This functionality is however around the corner, in the form on the package compyte, which provides a closer analogue of numpy’s ndarray, with a shared code base between pycuda and pyopencl. It’s probably best to wait for the public release of compyte and its integration into pycuda/pyopencl rather than reinventing that wheel.

Kernels and axes

In ThreadWeave, kernels are declared as having an arbitrary number of axes, with these axes spanning a grid of repetitive work items, which all have the same kernel body applied to them. As opposed to the typical SIMD model, not all these work items are necessarily performed in parallel; many kernels benefit from having a thread act serially along an axis, such as to minimize thread launch overhead, and maximize reuse of computed terms which are constant between work items. This is opposed to the CUDA/OpenCL model, where the kernel grid has a one to one mapping to a grid of physical threads.

An axis can have its iteration type set to serial, parallel, or hybrid iteration, with the latter meaning the work items along the axes are divided among a number of threads, each of which having responsibility for looping over a part of the axis, and the other two iteration modes denoting the logical extremes of that scenario. By being able to declaratively change an axis’ iteration behavior, one can rapidly find a balance between the benefits of exposing parallelism and its drawbacks, without rewriting a single line of kernel body code.

Within a kernel declaration, we can create size relations between kernel axes and array axes. This allows the kernel shape and output arguments’ shape to be deduced from the input arguments’ shape. In typical use, most array axes are directly related to a kernel axis in this manner.

Stencil operations

Stencil operations are a staple of GPU computing, and as such, ThreadWeave attempts to accommodate them. Most importantly, ThreadWeave supports special syntax for iterating over (unrolled) stencils. The syntax is ‘for (voxel_id in stencil_id)’, where any properties of the individual voxels of a stencil can be accessed through the loop dummy variable, as in voxel_id.weight, for instance.

For a comprehensive hands on tutorial of stencil operations in ThreadWeave, see the watershed segmentation unittest in the source.

Context objects

ThreadWeave adds a thin wrapper around native pycuda/pyopencl context objects. This is useful for abstracting away some of the superficial differences between the two, as well as hiding some of their rather C-like characteristics behind a more pythonic interface. This context object aims to be a work-alike of the numpy namespace, providing a comprehensive toolbox of functionality in a familiar interface. For more information, see ThreadPy.

 

Frontend Syntax

ThreadWeave at present supports several front-end syntaxes for accessing its functionality; a general imperative syntax, which allows for the most flexibility, and two more derived ways of declaring stencils, for pure elementwise kernels and tensor products.

General imperative syntax

The imperative syntax is largely familiar C code, with some extensions. The most visible difference is in the signature of the kernel declaration. Fully writing out all details of this syntax is more cluttered than just reading the relevant parser definitions in the source (which is quite concise and readable, thanks to pyparsing), and arguably the syntax is most easily grasped by looking at the examples, but nonetheless, a coarse description of the syntax is given here.

The global structure of the syntax is thus:

Colon_terminated_signature_line:

Indented_annotations_block

Curly_braced_body

The signature has the following grammar:

(output arguments) << [kernel axes] << (input arguments)

Which is supposed to be read as ‘the input arguments mapped over a kernel of the given shape yield the output arguments’.

Kernel axis identifiers should be a single lowercase character. Constraints on the size of the axis can be written inline with their declaration.

Arguments have the form typename[array axes] array_identifier, with an optional assignment of default values for output arguments (constants value or copy of input argument).

The array and axes identifiers declared in the signature can be further specified with annotations in the proceeding annotations section.

For axes, we can specify an iteration mode (serial, parallel, hybrid), and binding behavior (compile-time or run-time). By default, templated size arguments bind at compile time; this leads to less arguments to pass in, and enables more compiler optimizations. However, while kernels are typically called often with the same size parameters, this need not be the case, and template arguments can be annotated as variable, to defer binding of their size value to runtime, to avoid excessive compilation overhead.

For arrays, the o


本源码包内暂不包含可直接显示的源代码文件,请下载源码包。