Chapter 3: the compute pipeline
This chapter has been written in one go and has never been refined nor proofread
In the previous chapter, we saw how memory works and how resources are managed in Vulkan. This allowed us to understand the simplest pipeline out there: the transfer pipeline, which enables transfers of resources (additionally, it opens ways of modifying their contents, albeit in limited ways).
In this chapter, we finally harness the power of GPUs for running custom computations. The compute pipeline is the first truly interesting pipeline we will get to manipulate!
Along the way, we will encounter the GLSL shader programming language.
A. A high-level overview
The compute pipeline enables using the GPU as a general-purpose computing device. We do this by writing special, GPU-compatible programs called shaders (this name reflects the origin of shaders as programs for controlling graphical operations; compute shaders may also be called kernels). Shaders are built using a domain specific language such as GLSL (website). We then compile these programs into a binary intermediate language called SPIR-V (website). GPUs do not run SPIR-V natively. Instead, their drivers contain a compiler for turning SPIR-V code into the machine language that corresponds to their actual architecture. Although the compilation from GLSL to SPIR-V can be done at compile-time, the compilation from SPIR-V to machine language is device-specific and has to be done at run-time.
Just like we may call a program with different arguments, we may call a compute shader with different parameters. These parameters come in two forms: push constants and descriptors. Push constants are small pieces of data updated directly from the CPU, whereas descriptors give us a way of interacting with resources found in GPU memory (a distinction is made between read-only resources — referred to as uniforms — and read/write resources — referred to as storage resources). A shader has a given prototype that represents what kind of push constants and descriptors it expects. We describe this prototype explicitly in the form of a pipeline layout object. As we will see in the graphics chapter, describing the pipeline layout as a shader's prototype is not exact: it really describes which resources are available to shaders in the pipeline. Different shaders have different prototypes, but they all refer to the same layout. As there is only one shader in the compute pipeline, the analogy works here, but it does not carry over in the graphical case. Anyways, I find this little lie useful for building intuition in this simpler case.
We then build a compute pipeline object, a construct that centralizes all the information required to run our shader. In particular, it contains both the shader itself and its prototype. To run the shader, we record a command buffer and bind the pipeline to it through a special command. Note that we did not yet set the values of shader arguments — push constants or descriptors. If Vulkan expected us to set them while building the compute pipeline object, we would have no other choice than to build a new pipeline every time we would like to run the same shader with different parameters. However, building a pipeline is costly! Therefore, Vulkan defines special commands for providing values for push constants and binding descriptor sets. We simply pass them to the command buffer to which the pipeline was bound.
Finally, we call special dispatch commands for running a command buffer bound to a compute pipeline.
In summary, we go through the following steps:
- We build our compute shader and compile it to SPIR-V (outside of Vulkan)
- We create a compute shader object (this implicitly compiles the SPIR-V to machine language)
- We describe the interface of the compute shader as a pipeline layout
- We create a compute pipeline object that regroups the shader object and the pipeline layout
-
We register a command buffer:
- We bind the pipeline object
- We bind the descriptor sets/push constants used by the pipeline object
- We dispatch the command buffer
Up to this point, we glossed over important notions regarding how compute tasks are dispatched. To benefit from the parallelism of GPUs, we need to explicitly split up the problem in smaller subtasks:
- The compute shader needs to run a certain number of times to cover the entire problem area. We refer to each run as an invocation. All invocations run the same code, but the shader language lets us know the exactly which invocation is currently being handled through a special variable. That way, different invocations can behave differently: for instance, vec[<invocation_id>] returns a different item of vec for each invocation.
- There is also the notion of workgroups. Large tasks cannot run efficiently on a single chip, so we split them up into smaller workgroups. All invocations of a given workgroup share the same caches.
Compute tasks are used for parallel computations. These problems have a certain dimensionality to them. For instance, summing the contents of a vector would be a one-dimensional problem, whereas running a kernel on a matrix would be a two-dimensional problem. This dimensionality is reflected in workgroups. For instance, in 2D problems, we want 2D-neighbors to share their caches as much as possible. In fact, workgroups support splitting in up to three dimensions.
The dimensions of a workgroup are defined directly in the shader code. It is then the responsibility of the user to dispatch enough workgroups so as to cover all the data. Furthermore, dispatching itself can be done in 1D, 2D or 3D. This is useful in some situations, but my understanding is that it is mostly a quality of life feature. For instance, assume that we are doing a convolution on a 16x16 matrix and that our workgroups are of size 8x8. Then, our matrix will be split into fourths. Doing a dispatch of the form 2x2, we run precisely as many invocations as required. To access the current item of the matrix for each invocation, we can do something like m[<workgroup_id.x>*8 + <local_id.x>][<workgroup_id.y>*8 + <local_id.y>]]. Without this feature, we would need to dispatch 4 workgroups flatly and we would have to compute equivalents to <workgroup_id.x> and <workgroup_id.y> manually from a global workgroup id. This computation would be slightly less obvious: something like m[<workgroup_id>%2*8 + <local_id.x>][<workgroup_id/2>*8 + <local_id.y>]].
For further reference, here is a representation of the compute pipeline (modified version of the graph found on this page):
B. The compute pipeline in more detail
B.1. From GLSL shaders to Vulkan shader objects
B.1.1. Writing compute shaders
Compute shaders can be written in different languages. Throughout the rest of this series, we assume the use of GLSL (documentation), although using another language is fine. Although the inner workings of this language are not the focus of this series, we will discuss it to some extent. You may want to take a look at this collection of CUDA puzzles to build basic intuition about writing compute shaders (CUDA is not part of Vulkan but the fundamentals are the same everywhere). Once the basics are in place, writing shaders is relatively straightforward: GLSL feels like a more limited version of C with first-class support for vectors and matrixes and a notion of input and output resources.
Below is a very minimal example of what a GLSL compute shader may look like:
Identifying the current invocation
layout(local_size_x = 8, local_size_y = 8, local_size_z = 1) in; defines the size of the workgroup for this shader. In order to identify the index of the current invocation, we can rely on special shader variables set of values defined by GLSL:
- uvec3 gl_WorkGroupSize: the dimensions of a workgroup.
- uvec3 gl_NumWorkGroups: how many workgroups have been dispatched.
- uvec3 gl_WorkGroupID: id of the workgroup this invocation belongs to.
- uvec3 gl_LocalInvocationID: id of the invocation within this workgroup.
- uvec3 gl_GlobalInvocationID: global id of this invocation (= gl_WorkGroupID*gl_WorkGroupSize + gl_LocalInvocationID).
- uint gl_LocalInvocationIndex: same use as gl_LocalInvocationID but a single number and not a vector (you can probably ignore this one).
Interface blocks: describing the prototype of a shader
It is not enough to describe the push constants and descriptors in the pipeline layout: we also need to declare them in the shader itself! Declarations of parameters always start with a layout section. This lets GLSL know how the different parameters are to be accessed from memory, through the following fields (see the doc for the gory details — I put more precise links whenever possible):
- binding (doc): binding = <n> (default value: 0). Identifier of the resource that will also be mentioned in the pipeline layout when describing the corresponding resource. Remember that shader parameters are described twice: once in the shader itself and once in the pipeline layout. The binding is a user-defined id that is present in both of these descriptions and builds an explicit correspondence between two descriptions of the same parameter. Push constants do not need that field, as each shader is limited to one such resource.
- set (doc): set = <n> (default value: 0). Identifier of the descriptor set from which to load the resource. It is used in addition to binding to designate the actual resource to load. More information on descriptor sets will follow in the section about pipeline layouts. For the same reason as before, push constants do not need that field.
- push constant marker: push_constant, marker indicating that the resource is a push constant and has no corresponding descriptor.
- memory layout (doc): either shared (default value except for push constants), packed, std140 or std430 (default value for push constants). Vulkan can be used from different programming languages with different in-memory representations of structures: we may have a contiguous sequence of fields or have offsets between different members (see this article by Eric S. Raymond for more information in the context of C). The memory layout qualifiers provide a way of specifying how to decode a raw blob of memory corresponding to the structure. It is the developer's responsibility to ensure that the data actually matches the expectations of the shader (refer to the doc for the details). Note that std430 can only be used for push constants or storage buffers. Obviously, this field is only relevant for data described with a structure type.
- matrix storage order (doc, meaning): either column_major (default value) or row_major. Only relevant for objects containing matrixes.
- image formats (doc): many to choose from, e.g. rgba8 or r32ui. Only relevant for images. Should agree with the VkFormat of the image in question.
- align: align = <n>. Gives a minimum alignment (in bytes) for members of a structure.
Individual fields of structured objects may come with a layout of their own (through known keywords for adding information about alignment or matrix storage order, or the yet unseen offset keyword to specify the offset of individual structure members).
Furthermore, the behavior of shader parameters can be refined through memory qualifiers. The more information about the behavior of the code there is, the more optimizations can be applied by the driver. The common qualifiers for shader parameters are readonly (the object cannot be written to) and writeonly (the object cannot be read from). Refer to the doc for a more extensive coverage of this topic.
Additionally, uniform is specified for almost all parameters of the shader: uniform buffers, uniform images, storage images or push constants — but not storage buffers, which are described with buffer instead. As you can see, the semantics of this keyword is not perfectly aligned in Vulkan and in GLSL.
The type of the resource also needs to be specified using keywords such as buffer or image2D. Structure types work a bit differently. Consider for instance layout(push_constant, std430) uniform pc_struct { vec4 data; } pc;. This describes a push_constant named pc that is defined with a structured type. The type itself is given the name pc_structure. It contains the single field data. We could also have defined the structure type prior to its use instead of going for an inline definition (see the doc).
Note that defining unnamed shader parameters with a structured type is allowed, e.g., layout(push_constant, std430) uniform pc_struct { vec4 data; }; (note the disappearance of pc). Doing so pulls all their fields into the toplevel namespace (i.e., references to data in the main function would resolve to the field of this parameter).
Samplers
Images can be accessed directly or through samplers. A sampler is an interface that controls how an underlying raw image is accessed (doc). In GLSL, the type of such resources is marked with, e.g., sampler2D instead of image2D. Precisely how the sampling is performed is described on the Vulkan side: samplers are created with vkCreateSampler, and are mostly used when images do not represent data but textures to be applied to geometry, i.e., graphics tasks (see this page for an illustration of what this looks like in practice).
From the point of view of the shader, samplers can be separate or combined. Combined shaders are tied to a specific image, whereas separate samples can be applied to any accessible and compatible image. See this page for more information.
Shared variables
Shared variables are a feature that is exclusive to compute shaders. Declaring variables with the shared qualifier shares them among all members of a workgroup. Accesses to it have to be synchronized inside of GLSL, as further discussed in the doc.
Interacting with images from shaders
GLSL defines special functions for interacting with images. The details are in the doc.
Other GLSL functions
There are many other functions that are defined by GLSL for use in shaders. Here is the full list.
B.1.2. Compiling compute shaders
glslangValidator is the GLSL to SPIR-V compiler provided by Khronos, the consortium behind the Vulkan standard. glslc is a wrapper developped by Google for this compiler. It makes its syntax closer to that of gcc. Compiling GLSL shaders to SPIR-V is straightforward: we just set up a Makefile or something similar and we are good to go. We only need to pay attention to the locations where the SPIR-V files thus generated are sent as we will need to upload those on the GPU.
B.1.3. Building shader modules
Vulkan devices are expected to know how to handle SPIR-V files. In practice, this means that they are fitted with a compiler from SPIR-V to their machine language. In order to build a shader module in Vulkan, we have to do two things:
- The SPIR-V code has to be uploaded to the device
- The SPIR-V has to be compiled to a form adapted to the device
Luckily for us, we do not have to worry about these boring details: VkCreateShaderModule handles everything transparently, from the upload of the code to the compilation of the SPIR-V code. We only need to provide a pointer to our SPIR-V code and a measure of its length in bytes. Magic! The shader is not compiled as soon as the shader module is created, but when the pipeline is created: the shader is compiled so as to be as efficient as possible for a specific pipeline layout.
B.2. The pipeline layout, a GPU computation's prototype*
*Terms and conditions may apply. This view only works in the compute case. The truth is that the layout defines which resources may be referred to by shaders of the pipeline, as discussed in the first section of this chapter. There is a single shader in the compute pipeline, so it does not make a difference here.
The pipeline layout defines the interface of a shader explicity for Vulkan (although we described this interface a first time in the shader itself, Vulkan does not extract this information directly from there).
vkCreatePipelineLayout is used to create a pipeline layout. i.e., to describe the interface of a shader. The interface of a shader is defined by its parameter, of which there are two kinds: push constants and classical resources.
Let's first consider push constants. The push constant mechanism allows us to bind a limited amount of memory for the fastest form of data transfer there is (the minimum allowed amount of memory is 128 bytes; you can rarely assume that more will be available). In situations where we have multiple shaders, each shader may have access to a different range of the memory thus bound (different ranges may overlap, and your GLSL code should specify an appropriate offset for push constants, reflecting the offset of the accessed range). When using the compute pipeline, we are limited to a single shader only, so this may all seem a bit contrived. The compute pipeline layout nonetheless relies on the notion of push constant ranges to integrate push constant access for compute shaders.
Classical resources are described through the notion of descriptor set layouts, created with vkCreateDescriptorSetLayout. Just like push constant ranges, the notion of descriptor sets seems contrived when considered in the context of the compute pipeline, but it was defined with more general usecases in mind. Remember that we saw how to specify the id of the set a resource is part of in the layout section of a shader's parameters declarations. Set descriptor set layouts federate descriptions of sets of resource bindings. Each individual resource is defined through a VkDescriptorSetLayoutBinding. In particular, the VkDescriptorSetLayoutBinding contains a binding field, that we should set to a value matching the one we picked for this parameter in the GLSL shader. In addition, we need to give the type of bound resources, as well as the stage they are active for (for compute tasks, the compute stage only).
For compute shaders, it is best to put everything in the first descriptor set. We will discuss more advanced uses of descriptor sets at a later point.
A descriptor set layout binding may bind several resources of the same type. How many resources are bound is controlled through the descriptorCount field. In GLSL, resources whose descriptor set layouts have their descriptorCount set to something greater than one are described as arrays of resources.
Note that unlike any other resource, it is actually possible to bind a sampler (be it combined or separate) at layout creation time. This is only possible for samplers that remain unchanged throughout the lifetime of the pipeline (although the underlying image may of course always change, both for separate and combined samplers). This is done by providing an array of VkSamplers (one per descriptorCount).
B.3. The pipeline object
B.3.1. Creating compute pipelines
We create compute pipelines through vkCreateComputePipelines. This function makes it possible to create several compute pipelines at once. For each pipeline, a layout and a shader stage are expected.
B.3.2. Pipeline cache
Creating a pipeline is costly (mostly because each bound shader needs to be compiled). Pipeline caching is about saving this result between runs. A VkPipelineCache object is created through vkCreatePipelineCache. In order to cache the results of vkCreateComputePipelines, we need to pass a handle to a cache object to this function. We can accelerate future runs of this program by writing the result of vkGetPipelineCacheData to a file. This data can be loaded while creating the pipeline cache object in the first place.
B.3.3. Pipeline derivatives
Vulkan offers a way of creating new pipelines from existing ones through the pipeline derivation mechanism. This feature is not supported by major implementers and the word out there is that it is best to ignore it.
B.4. Binding parameters and dispatching computations
We start by creating a command buffer that we fill up after having called vkBeginCommandBuffer. In fact, we can immediately call vkCmdBindPipeline to bind the pipeline we previously created to the buffer: further commands will know that they refer to it. This enables us to start binding values to the parameters used by the pipeline's shader.
Let's start with push constants. vkCmdPushConstants lets us pass arbitrary data to the push constant data range. We do not have to update the whole range at once: this function takes an offset and a size argument that allow us to update only a subset of the whole range at a time. Furthermore, we explicitly specify which shader stages are concerned with the changes we perform.
The situation is more complex for classical resources. In our layout, we have descriptor set layouts. We now bind actual descriptor sets to the pipeline, according to the layout. However, we cannot create a descriptor set just like that: we have to allocate it instead. This is done through vkAllocateDescriptorSets. Note that this function allocates several sets at a time. For each set, we have to pass the relevant layout. For compute shaders used in isolation, we only need to allocate one descriptor set.
A descriptor set is allocated from a descriptor pool, a form of memory pool specialized for this purpose. These are created using vkCreateDescriptorPool. When creating such an object, we are expected to define some limits explicitly: how many descriptor sets may be allocated from the pool, as well as limitations per individual descriptor type (with those latter being given as VkDescriptorPoolSize). Declaring a large amount of resources upfront instead of asking precisely for the resources that will be used is acceptable in most cases, although it is not optimal. Setting the VK_DESCRIPTOR_POOL_CREATE_FREE_DESCRIPTOR_SET_BIT flag enables freeing individual descriptor sets through vkFreeDescriptorSets. Indeed, this is not possible by default! Instead, you are only allowed to free all the resources allocated on the pool at once through vkResetDescriptorPool. Not activating this flag seems to be the best choice in most situations.
Once our descriptor set has been allocated, we can set its value using vkUpdateDescriptorSets, which can do many writing/copying operations in one call. Each of these operations is described explicitly in special structures. For instance, assume that we have created either an array of VkDescriptorBufferInfo (a new structure used exclusively for describing such bindings), and that we had previously described in our pipeline layout a resource (with a certain set and binding id within that set) that corresponds to an array of buffers. Then, Then, we update the descriptor set with a VkWriteDescriptorSet operation. We say that we want to copy data from our VkDescriptorBufferInfo array, and that the data to update is located at binding id x in a given set — we also specify how many items we would like to update in the array; in fact, we can overwrite only part of the binding if we so want, and we can do so at an offset from the start. There are two other kinds of resources that we can pass to VkWriteDescriptorSet instead of VkDescriptorBufferInfo: VkDescriptorImageInfo and VkBufferView (created through vkCreateBufferView). The former is obviously used for images, whereas the latter is a variant of the structure we previously used for buffer descriptors. Both of these kinds of resources make use of views (indeed, the descriptor info for images is made of a sampler, an image layout and an image view created through vkCreateImageView). Views are a kind of fat pointers that store information regarding how they should access and interpret the raw data that they point to. For instance, for images, we could restrict the pointer to a subset of the image (e.g., a specific miplevel and/or layer), reinterpret the data as another format (at our own risk), invert the values of some components, among other funky stuff. Buffer views are rarely used in practice, see this discussion about buffer textures for more information about their only (AFAIK) classical usecase. Alternatively, we could copy describe a copy of existing descriptors using VkCopyDescriptorSet. This operation follows similar principles.
Now that we have a descriptor set, we can bind it to our pipeline through vkCmdBindDescriptorSets. In fact, we may bind consecutive sets tied to the same, explictly provided layout in a single call. Additionally, this is where we apply dynamic offsets if we use dynamic buffers or images, a concept whose explanation I currently defer to the wonderful vkguide.dev.
vkCmdDispatch dispatches the commands using XxYxZ workgroups (there are variants of this command but they are very niche ones, see vkCmdDispatchIndirect and vkCmdDispatchBase).
Once all of this is done, we call vkEndCommandBuffer to indicate that we are done registering our command buffer. To run our computation on the GPU, we submit the command buffer through vkQueueSubmit on a queue that supports compute operations — and just like that, we are done!