Chapter 3: the compute pipeline
This chapter is currently being written
In the previous chapter, we saw how memory works and how resources are managed in Vulkan. This allowed us to understand the simplest pipeline out there: the transfer pipeline, which enables transfers of resources (additionally, it opens ways of modifying their contents, albeit in limited ways).
In this chapter, we finally harness the power of GPUs for running custom computations. The compute pipeline is our first truly interesting pipeline!
Along the way, we will encounter the GLSL shader programming language.
A. A high-level overview
The compute pipeline enables using the GPU as a general-purpose computing device. We do this by writing special, GPU-compatible programs called shaders (this name reflects the origin of shaders as programs for controlling graphical operations; compute shaders may also be called kernels). Shaders are built using a domain specific language such as GLSL (website). We then compile these programs into a binary intermediate language called SPIR-V (website). GPUs do not run SPIR-V natively. Instead, their drivers contain a compiler for turning SPIR-V code into the machine language that corresponds to their actual architecture. Although the compilation from GLSL to SPIR-V can be done at compile-time, the compilation from SPIR-V to machine language is device-specific and has to be done at run-time.
Just like we may call a program with different arguments, we may call a compute shader with different parameters. These parameters come in two forms: push constants and descriptors. Push constants are small pieces of data updated directly from the CPU, whereas descriptors give us a way of interacting with resources found in GPU memory (a distinction is made between read-only resources — referred to as uniforms — and read/write resources — referred to as storage resources). A shader has a given prototype that represents what kind of push constants and descriptors it expects. We describe this prototype explicitly in the form of a pipeline layout object.
We then build a compute pipeline, an object centralizing all the information required to run our shader. In particular, it regroups the shader itself and its prototype. To run the shader, we record a command buffer and bind the pipeline to it through a special command. Note that we did not yet set the values of shader arguments — push constants or descriptors. If Vulkan expected us to set them while building the compute pipeline object, we would have no other choice than to build a new pipeline every time we would like to run the same shader with different parameters. However, building a pipeline is costly! Therefore, Vulkan defines special commands for providing values for push constants and binding descriptor sets. We simply pass them to the command buffer to which the pipeline was bound.
Finally, we call special dispatch commands for running a command buffer bound to a compute pipeline.
In summary, we go through the following steps:
- We build our compute shader and compile it to SPIR-V (outside of Vulkan)
- We create a compute shader object (this implicitly compiles the SPIR-V to machine language)
- We describe the interface of the compute shader as a pipeline layout
- We create a compute pipeline object that regroups the shader object and the pipeline layout
-
We register a command buffer:
- We bind the pipeline object
- We bind the descriptor sets/push constants used by the pipeline object
- We dispatch the command buffer
Up to this point, we glossed over important notions regarding how compute tasks are dispatched. To benefit from the parallelism of GPUs, we need to explicitly split up the problem in smaller subtasks:
- The compute shader needs to run a certain number of times to cover the entire problem area. We refer to each run as an invocation. All invocations run the same code, but the shader language lets us know the exactly which invocation is currently being handled through a special variable. That way, different invocations can behave differently: for instance, vec[<invocation_id>] returns a different item of vec for each invocation.
- There is also the notion of workgroups. Large tasks cannot run efficiently on a single chip, so we split them up into smaller workgroups. All invocations of a given workgroup share the same caches.
Compute tasks are used for parallel computations. These problems have a certain dimensionality to them. For instance, summing the contents of a vector would be a one-dimensional problem, whereas running a kernel on a matrix would be a two-dimensional problem. This dimensionality is reflected in workgroups. For instance, in 2D problems, we want 2D-neighbors to share their caches as much as possible. In fact, workgroups support splitting in up to three dimensions.
The dimensions of a workgroup are defined directly in the shader code. It is then the responsibility of the user to dispatch enough workgroups so as to cover all the data. Furthermore, dispatching itself can be done in 1D, 2D or 3D. This is useful in some situations, but my understanding is that it is mostly a quality of life feature. For instance, assume that we are doing a convolution on a 16x16 matrix and that our workgroups are of size 8x8. Then, our matrix will be split into fourths. Doing a dispatch of the form 2x2, we run precisely as many invocations as required. To access the current item of the matrix for each invocation, we can do something like m[<workgroup_id.x>*8 + <local_id.x>][<workgroup_id.y>*8 + <local_id.y>]]. Without this feature, we would need to dispatch 4 workgroups flatly and we would have to compute equivalents to <workgroup_id.x> and <workgroup_id.y> manually from a global workgroup id. This computation would be slightly less obvious: something like m[<workgroup_id>%2*8 + <local_id.x>][<workgroup_id/2>*8 + <local_id.y>]].
B. The compute pipeline in more detail
B.1. From GLSL shaders to Vulkan shader objects
B.1.1. Writing compute shaders
Compute shaders can be written in different languages. Throughout the rest of this series, we assume the use of GLSL, (documentation) although using another language is fine. Although the inner workings of this language are not the focus of this series, we will discuss it to some extent. You may want to take a look at this collection of CUDA puzzles to build basic intuition about writing compute shaders (CUDA is not part of Vulkan but the fundamentals are the same everywhere). Once the basics are in place, writing shaders is relatively straightforward: GLSL feels like a more limited version of C with first-class support for vectors and matrixes and a notion of input and output resources.
Below is a very minimal example of what a GLSL compute shader may look like:
Identifying the current invocation
layout(local_size_x = 8, local_size_y = 8, local_size_z = 1) in; defines the size of the workgroup for this shader. In order to identify the index of the current invocation, we can rely on special shader variables set of values defined by GLSL:
- in uvec3 gl_WorkGroupSize: the dimensions of a workgroup.
- in uvec3 gl_NumWorkGroups: how many workgroups have been dispatched.
- in uvec3 gl_WorkGroupID: id of the workgroup this invocation belongs to.
- in uvec3 gl_LocalInvocationID: id of the invocation within this workgroup.
- in uvec3 gl_GlobalInvocationID: global id of this invocation (= gl_WorkGroupID*gl_WorkGroupSize + gl_LocalInvocationID).
- in uint gl_LocalInvocationIndex: same use as gl_LocalInvocationID but a single number and not a vector (you can probably ignore this one).
Interface blocks: describing the prototype of a shader
It is not enough to describe the push constants and descriptors in the pipeline layout: we also need to declare them in the shader itself! Declarations of parameters always start with a layout section. This lets GLSL know how the different parameters are to be accessed from memory, through the following fields (see the doc for the gory details — I put more precise links whenever possible):
- binding (doc): binding = <n> (default value: 0). Identifier of the resource that will also be mentioned in the pipeline layout when describing the corresponding resource. Remember that shader parameters are described twice: once in the shader itself and once in the pipeline layout. The binding is a user-defined id that is present in both of these descriptions and builds an explicit correspondence between two descriptions of the same parameter. Push constants do not need that field, as each shader is limited to one such resource.
- set (doc): set = <n> (default value: 0). Identifier of the descriptor set from which to load the resource. It is used in addition to binding to designate the actual resource to load. More information on descriptor sets will follow in the section about pipeline layouts. For the same reason as before, push constants do not need that field.
- push constant marker: push_constant, marker indicating that the resource is a push constant and has no corresponding descriptor.
- memory layout (doc): either shared (default value except for push constants), packed, std140 or std430 (default value for push constants). Vulkan can be used from different programming languages with different in-memory representations of structures: we may have a contiguous sequence of fields or have offsets between different members (see this article by Eric S. Raymond for more information in the context of C). The memory layout qualifiers provide a way of specifying how to decode a raw blob of memory corresponding to the structure. It is the developer's responsibility to ensure that the data actually matches the expectations of the shader (refer to the doc for the details). Note that std430 can only be used for push constants or storage buffers. Obviously, this field is only relevant for data described with a structure type.
- matrix storage order (doc, meaning): either column_major (default value) or row_major. Only relevant for objects containing matrixes.
- image formats (doc): many to choose from, e.g. rgba8 or r32ui. Only relevant for images. Should agree with the VkFormat of the image in question.
- align: align = <n>. Gives a minimum alignment (in bytes) for members of a structure.
Individual fields of structured objects may come with a layout of their own (through known keywords for adding information about alignment or matrix storage order, or the yet unseen offset keyword to specify the offset of individual structure members).
Furthermore, the behavior of shader parameters can be refined through memory qualifiers. The more information about the behavior of the code there is, the more optimizations can be applied by the driver. The common qualifiers for shader parameters are readonly (the object cannot be written to) and writeonly (the object cannot be read from). Refer to the doc for a more extensive coverage of this topic.
Additionally, uniform is specified for almost all parameters of the shader: uniform buffers, uniform images, storage images or push constants — but not storage buffers, which are described with buffer instead. As you can see, the semantics of this keyword is not perfectly aligned in Vulkan and in GLSL.
The type of the resource also needs to be specified using keywords such as buffer or image2D. Structure types work a bit differently. Consider for instance layout(push_constant, std430) uniform pc_struct { vec4 data; } pc;. This describes a push_constant named pc that is defined with a structured type. The type itself is given the name pc_structure. It contains the single field data. We could also have defined the structure type prior to its use instead of going for an inline definition (see the doc).
Note that defining unnamed shader parameters with a structured type is allowed, e.g., layout(push_constant, std430) uniform pc_struct { vec4 data; }; (note the disappearance of pc). Doing so pulls all their fields into the toplevel namespace (i.e., references to data in the main function would resolve to the field of this parameter).
Shared variables
Shared variables are a feature that is exclusive to compute shaders. Declaring variables with the shared qualifier shares them among all members of a workgroup. Accesses to it have to be synchronized inside of GLSL, as further discussed in the doc.
Interacting with images from shaders
GLSL defines special functions for interacting with images. The details are in the doc.
B.1.2. Compiling compute shaders
glslangValidator is the GLSL to SPIR-V compiler provided by Khronos, the consortium behind the Vulkan standard. glslc is a wrapper developped by Google for this compiler. It makes its syntax closer to that of gcc. Compiling GLSL shaders to SPIR-V is straightforward: we just set up a Makefile or something similar and we are good to go. We only need to pay attention to the locations where the SPIR-V files thus generated are sent as we will need to upload those on the GPU.
B.1.3. Building shader modules
Vulkan devices are expected to know how to handle SPIR-V files. In practice, this means that they are fitted with a compiler from SPIR-V to their machine language. In order to build a shader module in Vulkan, we have to do two things:
- The SPIR-V code has to be uploaded to the device
- The SPIR-V has to be compiled to a form adapted to the device
Luckily for us, we do not have to worry about these boring details: VkCreateShaderModule handles everything transparently, from the upload of the code to the compilation of the SPIR-V code. We only need to provide a pointer to our SPIR-V code and a measure of its length in bytes. Magic! The shader is not compiled as soon as the shader module is created, but when the pipeline is created: the shader is compiled so as to be as efficient as possible for a specific pipeline layout.
B.2. The pipeline layout, a GPU computation's prototype
The pipeline layout defines the interface of a shader explicity for Vulkan (although we described this interface a first time in the shader itself, Vulkan does not extract this information directly from there).
vkCreatePipelineLayout is used to create a pipeline layout. We have two kinds of shader parameters to describe: there are push constants as well as traditional resources.
Push constants are specified through a notion of push constants ranges: in situations where we have multiple shaders, each shader may have access to a different range of the memory bound through the push constant mechanism (different ranges may overlap, and your GLSL code should specify a matching offset). When using the compute pipeline, we are limited to a single shader only, so this becomes irrelevant. Remember that a very limited amount of memory can be bound through this mechanism (at least 128 byte, sometimes a bit more).
Descriptor set layouts describe the other shader parameters. The notion of descriptor sets is needlessly complex in the context of compute shaders. It is built for more graphics pipeline vkCreateDescriptorSetLayout VkDescriptorSetLayoutBinding
B.3. The pipeline object
TODO VkDescriptorPool VkComputePipelineCreateInfo VkPipelineShaderStageCreateInfo
Derivatives: don't
B.4. Binding parameters and dispatching computations
We start by creating a command buffer that we fill up after having called vkBeginCommandBuffer. In fact, we can immediately call vkCmdBindPipeline to bind the pipeline we previously created to the buffer: further commands will know that they refer to it.
The main thing left to do is to bind values to the parameters of the shader. Again, we have to consider both push constants and other, classical resources. TODO vkCmdPushConstants vkCmdBindDescriptorSets vkUpdateDescriptorSets
vkCmdDispatch dispatch the commands using XxYxZ workgroups (there are variants of this command but they are very niche ones, see vkCmdDispatchIndirect and vkCmdDispatchBase.
Once all of this is done, we call vkEndCommandBuffer to indicate that we are done registering our command buffer. To run our computation on the GPU, we submit the command buffer through vkQueueSubmit on a queue that supports compute operations — and just like that, we are done!