Vulkan compute convolution/cross-correlation

2018-04-08 code

A simple headless Vulkan compute shader example using the C++ vulkan.hpp API.

The example computes the valid 2D cross correlation (i.e. non flipped kernel which is sometimes also named convolution) of some sample data with a kernel. The square input and kernel sizes can be specified at runtime; as well as the work group size.

Building

The source code is available from GitLab or with pre-built binaries vulkan_compute_convolution.tar.gz

To build the code you need:

GNU Make
a C++14 compiler
the Vulkan 1.0 library and headers (e.g. Debian stretch-backports)
lslangValidator to compile the shader (e.g. the LunarG Vulkan SDK)

For benchmarking:

Google benchmark (libbenchmark-dev on Debian)
R

Performance

On an Intel i5-3317U running on the GPU with the Vulkan API is up to twice as fast as on the CPU.

The FLOPS (floating-point operations per second) are relative to the mean of all CPU samples. The plot is split per the two kernel sizes 3 and 5.

As expected the FLOPS of the CPU are relatively constant with respect to the input size since the CPU is easily fully utilized. The GPU on the other hand appears to be underutilized, as the performance increases with an increasing input size and thus work load.

OpenCL

Since the main limitation on the performance is the GPU, we expect a similar performance from Vulkan as by using OpenCL.

The convolution samples of the AMD-APP-SDK-v3.0.130.136 are used for comparison.

The OpenCL sample is somewhat different: The input is padded such that the output has the same size. This is compensated by the FLOPS calculation. Furthermore, the type of the data is int instead of float. Which is converted to float for the computation and then back to int for storage in the output.

Comparing the Vulkan with the AMD APPSDK OpenCL sample for an input size of 512, there is a notable difference for the kernel size 5 (~ 7%). Whether this difference is due to the drivers or the different data types; or something else entirely is not clear.

Interestingly the OpenCL local data store (LDS) seems to have little effect (~ 3%).

Code size

The complexity of the implementation is, comparing the source code line count, comparable. 750 lines of code for the Vulkan sample and 880 for the OpenCl one. Again the OpenCL example is a bit different; mainly in that it also supports a separable convolution.

Building

Performance

OpenCL

Code size

Links