Introductionstdcuda is a library of data-parallel algorithms with an STL-like interface.
What is stdcuda?stdcuda is designed to allow CUDA programmers convenient access to parallel algorithms through a templated interface similar to the C++ Standard Template Library. stdcuda provides a suite of
... [More]
commonly encountered data parallel algorithms which may be used as primitive building blocks of larger systems.
Featuresstdcuda exposes the high-performance computing capabilities of emerging CUDA-capable parallel platforms through a familiar serial programmatic interface. A few of these features include:
vector_dev provides convenient device memory management similar to std::vector. scan provides an efficient parallel prefix-sum. reduce provides an efficient parallel reduction. All functions are implemented through header files, without the hassles common to linked libraries. ExamplesManaging Device Arrays with vector_dev// vector_example.cu
// This example demonstrates how stdcuda manages device memory
#include
#include
// stdcuda classes and functions reside in the stdcuda namespace
using namespace stdcuda;
int main(void)
{
// create a vector of ints residing on a CUDA device
vector_dev data(10000);
// fill it with random values
srand(13);
for(int i = 0; i != data.size(); ++i)
{
data[i] = rand();
}
// print the 1024th value
int val = data[1024];
printf("The 1024th value is %i\n", val);
return 0;
}Host-to-Device Copy// copy_example.cu
// This example demonstrates how to copy an array from the host to the device
#include
#include
#include
#include
int main(void)
{
// Because per-element access to a vector_dev is slow, we should initialize
// a vector on the host and copy it to a device vector en masse to
// amortize the transfer cost
// create a vector of ints residing on the host
std::vector h_data(10000);
// fill it with random values
srand(13);
for(int i = 0; i != h_data.size(); ++i)
{
h_data[i] = rand();
}
// create a vector of ints residing on the device and copy from h_data
stdcuda::vector_dev d_data(h_data.begin(), h_data.end());
// check to ensure the 1024th elements of each match
if(h_data[1024] == d_data[1024])
{
printf("No problems!\n");
}
return 0;
}Parallel Reduction// reduction_example.cu
// This example demonstrates how to compute the sum
// of a large array of numbers with a parallel reduction
#include
#include
#include
#include
int main(void)
{
// create a vector of ints residing on the device
stdcuda::vector_dev d_data(10000);
// initialize as before
...
// find the sum of the elements of d_data with a reduction
printf("Reducing %u elements:\n", d_data.size());
int sum = stdcuda::reduce(d_data.begin(), d_data.end(), 0);
printf("The sum is %i\n", sum);
return 0;
}Counting// counting_example.cu
// This example demonstrates how to count the number of occurrences
// of some element in a large array of numbers with match and pop_count
#include
#include
#include
#include
int main(void)
{
// create a vector_dev as before
stdcuda::vector_dev d_data(10000);
// initialize as before
...
// create an array to hold a bit vector
stdcuda::vector_dev matches(d_data.size());
// identify all matches of the number 10
stdcuda::match(d_data.begin(), d_data.end(), matches.begin(), 10);
// count all non-zero elements of the matches array
int result = pop_count(matches.begin(), matches.end());
printf("%i occurrences.\n", result);
return 0;
}Stream compactionUsing stdcudaIn order to use stdcuda functions in your CUDA code, you need only checkout the source and make it accessible via your include path:
$ svn checkout http://stdcuda.googlecode.com/svn/trunk/stdcuda stdcudaRelated LibrariesCUDPP provides a plan-based interface to several data-parallel primitives such as scan, stream compaction, and sparse matrix-vector multiplication using CUDA. CUDPP is carefully tuned with the objective of peak performance on GPU hardware. [Less]