ThreadWeave - 源码 - 源码 - 免费下载

开通博客赚积分

发布资源赚积分

ThreadWeave

文件大小： unknow

源码售价： 5 个金币积分规则积分充值

充值1元得10金币

资源说明：Python package to bring nd-arrays to the GPU

ThreadWeave

Introduction

In a nutshell, ThreadWeave aims to bring the type of functionality
found in scipy.weave to the GPU, or SIMD architectures, more generally. That
is, ThreadWeave allows one to write concise inline SIMD kernels directly in Python,
while providing both complete imperative backwards compatibility, as well as a
useful set of higher level abstractions.

Most significantly, nd-arrays are first class citizens of
the ThreadWeave world. They can be passed around, indexed, have their
attributes read, and have boundary handling modes applied to them, all within
the kernel.

An example says more than a thousand words, so here are two
short ones:

import
numpy as np
from
threadweave.backend import OpenCL as Backend
from
threadweave.stencil import laplacian
with
Backend.Context(device = 0) as ctx:
laplacian_kernel =
ctx.kernel("""
(<type>[d,n,m] output) <<
[d,n,m] << (<type>[d,n,m] input):
n: serial
d: variable
input: padded(stencil)
{
<type> r = 0;
for (s in stencil(input))
r += input(s.d,s.n,s.m) *
s.weight;
output(d,n,m) = r;
}""")
shape = (2,6,6)
input =
ctx.array(np.arange(np.prod(shape)).astype(np.float32).reshape(shape) ** 2)
print input #some arbitrary input data
stencil = laplacian(2,5)[np.newaxis,:,:]
#construct 2-dim 5-pts laplacian, with broadcasting
output = laplacian_kernel(input, stencil
= stencil)
print output

Without going into the details; this example shows the
declaration and invocation of a Laplacian filter applied to a stack of 2d
arrays; these few lines of code are converted into 70 lines of dense CUDA or OpenCL
boilerplate. Of course good libraries exist already to perform convolutions; the
added value, aside from demonstration purposes, is that ThreadWeave allows for
greatly expanded flexibility. By acting on a stack of 2d arrays, we can expose
additional parallelism, and we have full control over boundary handling, as
well as the way in which axes are iterated over. Also, the iteration over the
stencil is transparently unrolled.

Aside from eliminating boilerplate from imperative C code, the
functionality at the core of ThreadWeave also makes for a good platform for
building higher level abstractions:

import
numpy as np
from
threadweave.backend import CUDA as Backend
with
Backend.Context(device = 0) as ctx:
#some example data
shape = 2,3,4
A =
ctx.arange(np.prod(shape),np.float32).reshape(shape)
shape = 2,4,3
B =
ctx.arange(np.prod(shape),np.float32).reshape(shape)
#declare product, contracted over j
stacked_matrix_product =
ctx.tensor_product('nij,njk->nik')
C = stacked_matrix_product(A, B)
print C
print ctx.tensor_product('nii->ni')(C) #print the diagonals of C too, for good measure

This short example declares and invokes a contracting tensor
product; more specifically, it multiplies a stack of matrices. The function
ctx.tensor_product is a work-alike of the invaluable numpy.einsum, and can be
used to concisely express inner products, outer products, diagonal extractions,
transposes, and any other operation which can be expressed as a tensor product
with Einstein summation convention.

By building on top of the abstractions at the core of ThreadWeave,
the functionality in ctx. tensor_product can be implemented in just a handful
of lines. (in its current fully functional yet suboptimal form, that is).

To give an impression of how this happens: both
tensor_product and laplacian_kernel have the same type; KernelFactory, which
wraps a KernelDeclaration and a cache of compiled KernelInstances, which are
created by letting a backend specific CodeGenerator act on a KernelDeclaration
for a given set of template arguments.

You get the idea; quite a few of the same things going on
behind the scenes, which are identical for both examples. The methods
Context.prod and Context.kernel merely wrap two different front-ends for
parsing their respective grammars into a KernelDeclaration; many more can be created
with relatively little effort to cater to specific wants. This frees one from
grave discussions, with oneself or others, over syntactical details; if your
problem domain asks for different syntax, it should be easy to implement on top
of ThreadWeave.

Feature summary

Here is a concise listing of the core features that
ThreadWeave provides at present.

-
Nd-array awareness:

o
Passing nd-arrays in and out of kernels

o
Indexing of nd-arrays

o
Various boundary handling modes (padded,
wrapped, clamped, fixed)

o
Accessing of array attributes such as shape with
numpythonic syntax.

-
Axes:

o
Declaratively relate kernel and array axes to
one another. This gives a highly flexible and compact notation.

o
Decoupling between the grid of work items to be
performed, and the grid of threads that perform them. This allows us to
overcome hardware limitations on threadgrids, but more importantly, allows one
to declaratively optimize the relation between the work to be performed, and
the arrangement of threads to accomplish this goal.

-
Runtime features:

o
JIT compilation of kernels templated on shape,
type and stencil arguments

o
Runtime type & shape validation of kernel
arguments

o
Output shape inference and allocation (if not
given to the kernel)

o
Computation of sensible defaults for grid/block
parameters

-
Abstracting away the differences between
different backends; they can be switched out by changing a single import. As of
now, CUDA and OpenCL are supported. This applies to both the Python side of
things, where ThreadPy presents a uniform interface, as well as papering over
the superficial syntactical differences between the two C-dialects, like
atomicAdd / atomic_add.

ThreadWeave Concepts

Nd-arrays

The two primary SIMD languages, CUDA and OpenCL, are both C
dialects, which is the natural choice in many ways. However, C was never
designed with SIMD in mind; and although both dialects offer some abstractions
for dealing with multiple threads, one is left with the minimal array
abstractions C offers for the multiple data part of the equation. By extending
C with nd-array capabilities, we not only gain the immediate benefits of this
tried and true abstraction, but it also enables the introduction of novel
concepts, aimed at harnassing the typically tight relationship between the grid
of threads and the data they act on.

ThreadWeave facilitates working with nd-arrays both within
the kernel code, as well as during declaration and invocation.

In order to provide access to arrays within the kernel, a pointer
to the device memory is passed into the kernel (shocker). In addition, all
shape information is either passed into the kernel at compile time or runtime.
This information is then used to compute all strides, which in turn is used to
provide clean syntax for indexation of nd-arrays. All the array properties
passed into or computed within the kernel are accessible within the kernel by
means of numpythonic syntax.

Each array axis can be given a boundary handling mode. By
default, array access is not bounds checked. But when working with stencil
operations for instance, one needs a consistent way of handling array
boundaries. All typical boundary handling modes are planned functionality
(clamped, fixed, reflected); for now, ThreadWeave only supports wrapped
boundaries, and padded boundaries. By marking an array as padded (with a given
stencil), it is treated as a view upon a larger array (with enough padding such
that all access through the stencil is safe). This is both a very flexible way
of specifying boundary conditions, and it is the most efficient in terms of
overhead inside the kernel.

At the moment, only C-contiguous arrays are supported (plus
a restricted notion of views on these, in the form of padded arrays). Support
for arbitrary strides would be nice to have, but the nd-array classes included
in the current stable branch of pycuda and pyopencl do not support such. This
functionality is however around the corner, in the form on the package compyte,
which provides a closer analogue of numpys ndarray, with a shared code base
between pycuda and pyopencl. Its probably best to wait for the public release
of compyte and its integration into pycuda/pyopencl rather than reinventing
that wheel.

Kernels and axes

In ThreadWeave, kernels are declared as having an arbitrary
number of axes, with these axes spanning a grid of repetitive work items, which
all have the same kernel body applied to them. As opposed to the typical SIMD model,
not all these work items are necessarily performed in parallel; many kernels
benefit from having a thread act serially along an axis, such as to minimize
thread launch overhead, and maximize reuse of computed terms which are constant
between work items. This is opposed to the CUDA/OpenCL model, where the kernel
grid has a one to one mapping to a grid of physical threads.

An axis can have its iteration type set to serial, parallel,
or hybrid iteration, with the latter meaning the work items along the axes are
divided among a number of threads, each of which having responsibility for looping
over a part of the axis, and the other two iteration modes denoting the logical
extremes of that scenario. By being able to declaratively change an axis iteration
behavior, one can rapidly find a balance between the benefits of exposing
parallelism and its drawbacks, without rewriting a single line of kernel body code.

Within a kernel declaration, we can create size relations
between kernel axes and array axes. This allows the kernel shape and output
arguments shape to be deduced from the input arguments shape. In typical use,
most array axes are directly related to a kernel axis in this manner.

Stencil operations

Stencil operations are a staple of GPU computing, and as
such, ThreadWeave attempts to accommodate them. Most importantly, ThreadWeave
supports special syntax for iterating over (unrolled) stencils. The syntax is for
(voxel_id in stencil_id), where any properties of the individual voxels of a
stencil can be accessed through the loop dummy variable, as in voxel_id.weight,
for instance.

For a comprehensive hands on tutorial of stencil operations
in ThreadWeave, see the watershed segmentation unittest in the source.

Context objects

ThreadWeave adds a thin wrapper around native
pycuda/pyopencl context objects. This is useful for abstracting away some of the
superficial differences between the two, as well as hiding some of their rather
C-like characteristics behind a more pythonic interface. This context object
aims to be a work-alike of the numpy namespace, providing a comprehensive
toolbox of functionality in a familiar interface. For more information, see
ThreadPy.

Frontend Syntax

ThreadWeave at present supports several front-end syntaxes
for accessing its functionality; a general imperative syntax, which allows for
the most flexibility, and two more derived ways of declaring stencils, for pure
elementwise kernels and tensor products.

General imperative syntax

The imperative syntax is largely familiar C code, with some
extensions. The most visible difference is in the signature of the kernel
declaration. Fully writing out all details of this syntax is more cluttered
than just reading the relevant parser definitions in the source (which is quite
concise and readable, thanks to pyparsing), and arguably the syntax is most
easily grasped by looking at the examples, but nonetheless, a coarse
description of the syntax is given here.

The global structure of the syntax is thus:

Colon_terminated_signature_line:

Indented_annotations_block

Curly_braced_body

The signature has the following grammar:

(output arguments)
<< [kernel axes] << (input arguments)

Which is supposed to be read as the input arguments mapped
over a kernel of the given shape yield the output arguments.

Kernel axis identifiers should be a single lowercase
character. Constraints on the size of the axis can be written inline with their
declaration.

Arguments have the form typename[array
axes] array_identifier, with an optional assignment of default values
for output arguments (constants value or copy of input argument).

The array and axes identifiers declared in the signature can
be further specified with annotations in the proceeding annotations section.

For axes, we can specify an iteration mode (serial,
parallel, hybrid), and binding behavior (compile-time or run-time). By default,
templated size arguments bind at compile time; this leads to less arguments to
pass in, and enables more compiler optimizations. However, while kernels are
typically called often with the same size parameters, this need not be the
case, and template arguments can be annotated as variable, to defer binding of
their size value to runtime, to avoid excessive compilation overhead.

For arrays, the o

部分文件列表（点击文件名可查看文件内容）

					
									本源码包内暂不包含可直接显示的源代码文件，请下载源码包。