Chapter 3: the compute pipeline
In the previous chapter, we saw how Vulkan handles memory and resources. This allowed us to understand the simplest pipeline out there: the transfer pipeline, which enables transfers of resources and opens some limited ways of modifying their contents.
In this chapter, we finally get to harness the power of GPUs for running arbitrary computations. The compute pipeline is the first truly interesting one that we get to tame!
Along the way, we will discuss the GLSL shader programming language, the device-independent language that we use to write programs for GPUs.
A. A high-level overview
A.1. Building and executing a compute shader
The compute pipeline allows us to run arbitrary computations on GPUs. Before we get to that point, however, we need to go through a series of steps.
First, we need to build programs that can run on such devices. We call these special programs shaders, a name that reflects their origin as means of controlling graphical operations (compute shaders are also called kernels). We build shaders using domain-specific languages such as GLSL (wiki). Ultimately, Vulkan only supports the SPIR-V standard binary intermediate language (website), so we have to compile our GLSL code into this form before using it (technically, we could write SPIR-V directly using some assembly for it, but it would be extremely tedious). GPUs do not run SPIR-V code directly. Instead, their drivers compile it into their native machine language. Although the first compilation (from GLSL to SPIR-V) can be done ahead of time, the second one (from SPIR-V to machine language) is device-specific and has to be done at run-time.
Vulkan defines shader modules, which are basically pointers to arbitrary SPIR-V code (that has to be compiled to a device-specific language later on). We cannot run shader objects directly: we need to wrap them in a more general construct called a compute pipeline. Indeed, although the transfer pipeline was an abstract concept, the compute pipeline is also a concrete object.
To build a compute pipeline, we need to define its pipeline layout, which represents the type of resources that our compute shader will be able to access. Our shader takes some parameters, and the pipeline layout gives us a way of controlling these parameters (each parameter of our shader is bound to an entry in the pipeline layout, and we will later bind pipeline layout entries to actual resources of the right type). Parameters come in two forms: push constants and descriptors. Push constants map into a small memory region that is updated directly from the CPU, whereas descriptors will ultimately be bound to buffers, images, and other resources found in GPU memory (Vulkan makes a distinction between read-only resources — referred to as uniforms — and read/write resources — referred to as storage resources).
Pipeline creation is a costly operation: it is at this stage that the compilation from SPIR-V to machine language takes place. Pipeline objects may be cached to avoid paying the cost of compilation on later runs of the program.
To run a shader, we record a command buffer and bind the shader's pipeline to it through a special command. We then use commands for providing parameters to the shader: we set push constants and bind descriptors to resources. Finally, we use special dispatch commands for running the computation on the GPU. The very last step is to submit this command buffer to a compatible queue.
The pipeline layout does not actually store descriptors directly in the pipeline layout. Instead, it adds a level of indirection through descriptor set layouts, where each set regroups related descriptors, and we bind descriptors one set at a time. Having multiple descriptor sets helps us avoid rebinding descriptors that do not change across consecutive operations in a command buffer, which is good for performance. A typical pattern is to start recording a command buffer, immediately bind the global resource descriptors, loop on a bunch of objects and, for each of them, bind their local resource descriptors and dispatch a compute call (after this loop, the command buffer is closed and submitted to a compute queue).
In summary, we would go through the following steps for running a single compute operation:
- We build our compute shader and compile it to SPIR-V (outside of Vulkan)
- We create a compute shader object based on the resulting SPIR-V code
- We describe what kind of resources our shader should be able to access in a pipeline layout object
- We create a compute pipeline object that regroups the shader object and the pipeline layout (this is a costly step as it is here that the SPIR-V code gets compiled to machine language; we should use caching here)
-
We record a command buffer:
- We bind the pipeline object to it
- We provide values for the push constants and bind the descriptors used by the pipeline object
- We record a dispatch command to run a compute operation (we may use several of them in the same command buffer to run several compute operations)
- We submit the command buffer to a queue that supports compute operations
For further reference, here is a representation of the compute pipeline (modified version of the graph found on this page; you should be able to understand everything here by the end of this chapter):
A.2. Workgroups and invocations
Compute tasks work by running the same shader (with the same parameters) in a massively parallel manner. We refer to each run as an invocation. So, each invocation runs the exact same code with the exact same parameters. Yet, we typically expect each invocation to run on different pieces of some large data. How is that possible? To solve this conundrum, the shader language exposes a special variable that varies on a per-invocation basis: vec[<invocation_id>] returns a different item of vec for each invocation.
Large tasks cannot run efficiently on a single on-device chip, so we split them up into smaller units called workgroups (and we always have to use at least one workgroup). All invocations of a given workgroup share the same caches.
Compute tasks have a certain dimensionality to them. For instance, summing the contents of a vector would be a one-dimensional problem, whereas running a kernel on a matrix would be a two-dimensional problem. This gets reflected in the structure of workgroups: for 2D problems, we want 2D-neighbors to share their caches as much as possible, whereas in a 1D setting we would consider linear neighborhoods. In fact, workgroups support splitting in up to three dimensions. The dimensions of a workgroup are defined directly in the shader's code (yes, we have to recompile a shader just to change workgroup dimensions). We are responsible for dispatching enough workgroups to cover all the data (we control this through the arguments of our dispatch command).
Furthermore, dispatching itself can be done in 1D, 2D or 3D, which is mostly a quality of life feature. For instance, assume that we are doing a convolution on a 16x16 matrix and that our workgroups are of size 8x8. Then, our matrix will be split into fourths. Doing a 2x2 dispatch, we run precisely as many invocations as required (2x2x8x8 = 16x16). To access the current item of the matrix for each invocation, we can do something like m[<workgroup_id.x>*8 + <local_id.x>][<workgroup_id.y>*8 + <local_id.y>] from GLSL. Without this feature, we would have to dispatch 4 workgroups flatly and we would then have to manually compute equivalents to <workgroup_id.x> and <workgroup_id.y> from a global workgroup id. This would be tedious: something like m[<workgroup_id>%2*8 + <local_id.x>][<workgroup_id>/2*8 + <local_id.y>].
B. A deeper dive
B.1. From GLSL shaders to Vulkan shader modules
B.1.1. Writing compute shaders
We can write compute shaders in different languages. In this series, we use GLSL (documentation). Before looking at this language in more detail, you may want to check out this collection of CUDA puzzles to build some basic intuition about compute shaders (CUDA is not part of Vulkan but the fundamentals are the same everywhere). Once the basics are in place, writing shaders is relatively straightforward: GLSL feels like a more limited version of C with first-class support for vectors and matrices.
Below is a very minimal example of a GLSL compute shader:
Identifying the current invocation
layout(local_size_x = 8, local_size_y = 8, local_size_z = 1) in; defines the size of the workgroups for this shader. As this is hardcoded in the shader's code, we must build different shaders just to change the workgroups' dimensions (see this discussion for more detail; we may use specialization constants to mitigate that cost: read on!). We rely on special GLSL variables to identify the index of the current invocation:
- uvec3 gl_WorkGroupSize: the dimensions of the workgroup
- uvec3 gl_NumWorkGroups: the number of dispatched workgroups
- uvec3 gl_WorkGroupID: the id of the workgroup this invocation belongs to
- uvec3 gl_LocalInvocationID: the id of the invocation within this workgroup
- uvec3 gl_GlobalInvocationID: the global id of this invocation (= gl_WorkGroupID*gl_WorkGroupSize + gl_LocalInvocationID)
- uint gl_LocalInvocationIndex: same use as gl_LocalInvocationID, but a single number and not a vector (you can probably ignore this one as it is the same, save for being more annoying to use)
Interface blocks: describing the prototype of a shader
Shaders take parameters, which we describe using interface blocks on the GLSL side. Such blocks always start with a layout section, which is about how the different parameters are to be accessed from memory. It may contain the following fields (see the doc for the gory details — I put more precise links whenever possible):
- binding index (doc): binding = <n> (default value: 0). This is an arbitrary identifier for a resource. We can choose any value we want as long as we are coherent in our choice: we have to repeat this value in the pipeline layout entry corresponding to the resource that should be bound to this parameter. This field is not used for push constants as there is at most one of those per shader, and the pipeline layout describes it unambiguously.
- set index (doc): set = <n> (default value: 0). This is the identifier of the descriptor set from which to load the resource. It is used in addition to binding to fully identify the actual resource to load. Again, this field is not used for push constants.
- push constant marker: push_constant, a marker indicating that a resource is a push constant.
- memory layout (doc): either shared (default value except for push constants), packed, std140 or std430 (default value for push constants). Vulkan can be used from different programming languages with different in-memory representations of structures: we may have a contiguous sequence of fields or have offsets between different members (see this article by Eric S. Raymond for more information in the context of C). The memory layout qualifiers provide a way of specifying how to decode a raw blob of memory corresponding to a structure. We are responsible for ensuring that the data actually matches the expectations of the shader (refer to the doc for the details). Note that std430 can only be used for push constants or storage buffers. Memory layout is only relevant for data with a structure type.
- matrix storage order (doc, meaning): either column_major (default value) or row_major. Only relevant for objects containing matrices.
- image formats (doc): many to choose from, e.g., rgba8 or r32ui. Only relevant for images. Should agree with the VkFormat of the image in question.
- alignment: align = <n>. Gives a minimum alignment (in bytes) for structure members.
Individual fields of structured objects may come with a layout of their own, which is specified using either align, one of the matrix order constants, or the offset keyword (which is used to specify the offset of individual structure members and can only be applied to individual members).
Furthermore, we can refine the behavior of shader parameters through memory qualifiers, with the most common options being the readonly and the writeonly qualifiers. Refer to the doc for a more extensive coverage of this topic. These qualifiers provide information to the driver, which helps it produce optimized code.
Additionally, we specify uniform for almost all parameters of the shader: uniform buffers, uniform images, storage images or push constants — but not for storage buffers, which we describe with buffer instead. As you can see, these words have different meanings in Vulkan and in GLSL.
We specify the type of each of our parameters using keywords such as vec3 or image2D. Structure types are more complex: consider for instance layout(push_constant, std430) uniform pc_struct { vec4 data; } pc;. This describes a push_constant (pc) that is of a structured type (pc_structure) containing a single field (data). We can to leave out the parameter name for structured types, as in layout(push_constant, std430) uniform pc_struct { vec4 data; }; (note the disappearance of pc). Doing so pulls all of the structure's fields into the toplevel namespace (i.e., a reference to data in the main function resolves to the field of this parameter). We may also define structure types prior to their use instead of going for inline definitions (see the doc).
Samplers
Images can be accessed directly (image<n>D, with n = 1, 2 or 3) or through samplers (sampler<n>D; doc). Samplers add a level of indirection to image accesses. They are mostly used to apply effects to images that represent textures to be applied to geometry and are more commonly seen in the graphics pipeline. Take a look at this page for illustrations of what samplers can help us achieve. We control the behavior of a sampler on the Vulkan side: we create them with vkCreateSampler, and we bind them to a descriptor in the pipeline layout, like any other resource.
From the point of view of the shader, samplers can be separate or combined. Combined samplers are tied to a specific image, whereas separate samplers can be applied to any (compatible) image. See this page for more information.
Shared variables
Shared variables are exclusive to compute shaders. Declaring variables with the shared qualifier means that all members of a workgroup share a single variable. Accesses to it have to be synchronized inside GLSL, as further discussed in the doc.
We discussed two mechanisms for providing run-time parameters to shaders: uniforms and push constants. A limitation of these two mechanisms is that they are used after the shaders have already been compiled, leaving no chance for the compiler to optimize the code based on the provided values. Of course, this also has upsides: for real-time workloads, where command buffers are run and re-run continuously, we would not want to pay the heavy price of a compilation step every time we update our push constants. Recompiling shaders would only be worth it for parameters that are updated infrequently enough for this price to be amortized.
Specialization constants are constants that we provide at run-time to specialize SPIR-V code before it gets compiled to machine code (doc). They work with parameters of type bool, int, uint, float, or double, and are used like layout(constant_id = 8) const int my_specialization_const = 12; (here, 12 is the default value that is used is no explicit value is provided for the constant; also, note that const can be used for any variable or parameter, but we are obligated to use it with specialization constants).
Specialization is especially interesting for distributing programs: Vulkan applications are usually published with their shaders precompiled as SPIR-V binaries (this first compilation step can be done once and for all by the developer, no need to burden the users' machines with it). Specialization constants allow us to distribute one SPIR-V binary for a whole family of shaders; the alternative would be to generate an array of SPIR-V shaders by changing the parameters directly in the GLSL code, and precompiling each variant separately (this would only work for values that do not vary over a large range — we cannot ship millions of shaders) or to run a GLSL to SPIR-V compilation on the end user's machine for each specialization (which is definitively more costly than specialization).
Workgroup sizes are an interesting target for specialization constants: we can pick a workgroup size tailored for a specific device at run-time, and specialize our shader without going back to GLSL (the doc gives a clear example of how to do this).
Other GLSL functions
There are many other GLSL functions for use in shaders. Here are those for interacting with images, and here is the full list.
B.1.2. Compiling compute shaders
glslangValidator is the GLSL to SPIR-V compiler provided by Khronos, the consortium behind the Vulkan standard. glslc is a wrapper developed by Google for this compiler. It makes its syntax closer to that of gcc. Compiling GLSL shaders to SPIR-V is straightforward: we just set up a Makefile or something similar and we are good to go.
B.1.3. Building shader modules
VkCreateShaderModule takes a pointer to SPIR-V code and its length in bytes. The handle that it returns corresponds to this code, and not to the result of its compilation into the device's machine language: this last — and costly — compilation step will only take place when we actually create the pipeline object.
B.2. The pipeline layout
The pipeline layout specifies which resources are available for use in our shaders. Since this is the compute pipeline, which contains a single shader only, the pipeline layout reads a lot like a rehash of the interface of our shader (things will get more interesting for the graphics pipeline). We create a pipeline layout through vkCreatePipelineLayout, which describes push constants and classical resources alike.
Let's first consider push constants. The push constant mechanism is the fastest way to transfer data from CPU to GPU (the values are encoded directly in the command buffer), but it only supports a very limited amount of data: at least 128 bytes and rarely much more. This data can be updated efficiently between individual dispatches in our command buffer. In the pipeline layout, we specify the range of push constant memory accessible to the compute shader (this helps the driver optimize things). In the graphics pipeline, each shader can have its own range, and ranges can overlap. The GLSL declaration for a push constant parameter must specify an offset matching the start of the accessible range.
A drawback of push constants is that updating their values requires recording a new command buffer. Recording command buffers is not all that expensive, however. So, should we use push constants or not? There is no clear-cut answer to this question, you can check this stackoverflow post for more details (short version: if your engine is already recording new command buffers on every frame, then it comes at no cost; otherwise, you should stick with your current architecture as it will probably not make a huge difference either way).
We describe classical resources through descriptor set layouts, which we create with vkCreateDescriptorSetLayout. A single descriptor set layout stores a set of resource bindings represented by VkDescriptorSetLayoutBindings. Note that bindings contains a binding field: we should set it to the value we used for the corresponding parameter in the shader's code (for a shader parameter to be matched to an entry in the pipeline layout, we also need coherent set values: the set-id of a resource is simply the index of its parent set layout in the array that we pass to the pipeline layout). Additionally, we must specify the type of each resource we bind, as well as the stages they are accessible to (again, for compute tasks, there is only one: the compute stage). A single descriptor set layout binding may stand for an array of resources of a given type. We specify the size of these arrays using the descriptorCount field (1 when the binding is not actually an array).
Unlike any other resource, it is actually possible to bind combined or separate samplers when creating their descriptor set layout binding. We do this by providing an array of VkSamplers (one per descriptorCount, so 1 most of the time). This only works for samplers that remain unchanged throughout the pipeline's lifetime. The underlying image of an immutable combined sampler may however be updated.
B.3. The pipeline object
B.3.1. Creating compute pipelines
We create our compute pipeline through vkCreateComputePipelines. We can create several pipelines in a single call to this function. For each pipeline, a layout and a shader stage are expected.
Shader modules are raw SPIR-V code. To create a pipeline, we need to provide a VkPipelineShaderStageCreateInfo instead of a simple module. This is just a shader module with some added metadata: the name of its main function and what kind of shader it is (a compute one in our case). If we use specialization constants, it is at this point that we set their values (through a VkSpecializationInfo structure; we say how many constants there are in total, we provide a buffer with data, and we specify at what offset in the buffer each constant can be read).
B.3.2. Pipeline caching
Creating a pipeline is costly (each shader in the pipeline gets compiled to machine language). Pipeline caching allows saving compiled pipelines for later reuse. We create VkPipelineCache objects through vkCreatePipelineCache, (either empty or initialized with data cached from a previous run), and we serialize their contents through vkGetPipelineCacheData. Using pipeline caches to accelerate pipeline creation is straightforward: we just pass the appropriate cache handle to vkCreateComputePipelines (if the cache is empty, this fills it; otherwise, it may just use it as is, or maybe update it if the driver was updated since the last run).
B.3.3. Pipeline derivatives
Vulkan offers a way of creating new pipelines from related existing ones through the pipeline derivation mechanism. This feature is not supported by major implementers and the word out there is that it is best to ignore it.
B.4. Binding parameters and dispatching computations
Parameters binding occurs within a command buffer (so between a call to vkBeginCommandBuffer and a call to vkEndCommandBuffer).
Additionally, we need to bind our previously created pipeline to the command buffer using vkCmdBindPipeline. Pipeline binding is a relatively expensive operation, so we should group operations that make use of the same pipeline to minimize pipeline switches. We must of course bind a compute pipeline before registering any dispatch command, but we may bind resources (via descriptor sets) even when no pipeline is bound: the only thing that matters is that the required resources are bound whenever we use a pipeline object.
We bind descriptor sets to pipelines through vkCmdBindDescriptorSets (in fact, we may bind whole ranges of descriptor sets at a time). Resource bindings for graphical operations never impact those for compute operations (distinct pipelines use distinct binding points). Only one pipeline of a given type (e.g., compute) can be bound at a time. Binding a new pipeline replaces the previously bound pipeline of that type, but does not directly impact resource bindings. However, assuming that the layout of the newly bound pipeline differs from that of the previous one, then we need to bind new descriptor sets matching it before issuing dispatches anyways. Remember that a pipeline layout contains an array of descriptor set layouts, and that each such layout has an index (its position in that array). When we bind a descriptor set, two things happen:
- If the layout of the descriptor set being bound differs from that which it replaces, then all descriptor sets with a higher index become undefined (and need to be bound to something before a pipeline call uses them if we want to avoid undefined behavior).
- All descriptor sets with a lower index that is not compatible with the current pipeline's layout become undefined.
As always, the full detail can be found in the specification. The key takeaway is that less frequently changing descriptor sets should be placed closer to the beginning of the pipeline layout.
B.4.1. Binding push constants
vkCmdPushConstants updates the push constant data range with arbitrary data. We do not have to update the whole range at once: this function takes an offset and a size argument for partial updates. Furthermore, we explicitly specify which shader stages are concerned with the changes we perform (the compute stage is the only viable option for the compute pipeline). We must pass the layout of the pipeline for which this update is meant, and mention any stage for which even one of the updated bytes is accessible according to that layout (even if it is not accessed in practice).
Note that updating the push constant data range is only possible between different pipeline calls (such as dispatches for compute operations, or, as we will see, renders for graphical ones). To update the value for a precise dispatch (for instance for the real-time rendering of a scene with a moving object whose position is passed using push constants — we could also think of a compute-based example), we must record a new command buffer, as discussed above. Also, the push constant range is initially undefined: we have to update all relevant parts to avoid reading garbage.
There is a single push constant range for a command buffer. Even different types of pipelines interfere here. Remember that when we update a push constant range, we specify the pipeline layout that we are using. At any pipeline call, all accessible push constant bytes must have been last updated with the pipeline's layout or one with identical push constant ranges.
B.4.2. Binding other resources
A pipeline's layout contains descriptor sets layouts. For each descriptor set layout, we will eventually need to bind an actual descriptor set. We cannot create descriptor sets in one step. Indeed, they are objects that need to be allocated before we can set their contents or bind them to a pipeline.
We allocate descriptor sets (possibly several at once) using vkAllocateDescriptorSets. For each set, we provide the pipeline layout that structures it. We also specify a descriptor pool, a form of memory pool specialized for descriptor sets that we create using vkCreateDescriptorPool. When creating a descriptor pool, we must specify limits: the maximum number of descriptor sets that may be allocated from it, as well as per descriptor type restrictions (specified through VkDescriptorPoolSize objects). Declaring a generous amount of resources upfront instead of computing precise requirements is usually acceptable. When creating a descriptor pool, we may specify the VK_DESCRIPTOR_POOL_CREATE_FREE_DESCRIPTOR_SET_BIT flag, to enable freeing individual descriptor sets via vkFreeDescriptorSets (by default, you are only allowed to free all the resources allocated on the pool at once through vkResetDescriptorPool). Activating this flag may lead to a more complex and slower allocator being used, so only do it when you have to.
We initialize or update the contents of descriptor sets through vkUpdateDescriptorSets, which supports multiple write and copy operations in a single call, where writing describes resources bindings by linking resources directly, and copying duplicates such bindings from an existing descriptor set. We describe these operations using the VkWriteDescriptorSet and VkCopyDescriptorSet structures. Each operation focuses on a single binding within a descriptor set. Remember that each binding holds a whole array of resources (quite often, this array contains one element and is viewed as a single element, but still). Operations can act on subranges of their array, so we do not have to overwrite whole bindings every time.
Write operations support all types of descriptor set resources. Depending on the actual type of resource that we are binding, we will have to use one of three kind of write operation descriptors:
- Buffers use VkDescriptorBufferInfo. This structure allows us to pick a range in an existing buffer object: we can bind a resource to a subset of a larger buffer object (remember that the pipeline layout does not care about the size of buffers parameters: it is only here that we set it)
- Images and samplers use VkDescriptorImageInfo. This structure includes a sampler, an image layout, and an image view. We met samplers and image layouts in the previous chapter, but image views are newcomers. Views act as fat pointers that control access to the underlying raw data. For example, we can define image views that see only subsets of an image (e.g., a specific miplevel and/or layer) or reinterpret the data as another format (at our own risk), invert the values of some components, etc. We create image views through vkCreateImageView. Not all fields in the descriptor are used for write operations. For instance, for simple images (not combined samplers!), we do not have to provide a valid sampler, as this field is ignored by Vulkan.
- Texel buffers use VkBufferViews (which we create through vkCreateBufferView). Texel buffers are the buffer equivalent of image views (which are not used too commonly as they are quite limited: their primary feature is automatic format conversion; they may be handled as single-layer no-mipmap 1D images behind the scenes). In GLSL, texel buffer shader parameters are declared as uniform textureBuffer. The doc handles the topic more thoroughly.
B.4.3. Dynamic descriptors
The dynamic descriptors mechanism helps us avoid paying the CPU overhead of descriptor allocation as often throughout an individual command buffer. In short, it introduces dynamic variants for all types of descriptors. We bind their values to whole arrays/meta-resources, and we use a dynamic offset into these (which we specify while binding the descriptor sets) to setup the binding to the actual resource. We may use a different offset for each dynamic resource. We must also take special care of the device's alignment requirements. Dynamic descriptors are most useful for descriptors that vary per-object in a single command buffer. This mechanism is transparent to GLSL: dynamic descriptors are handled just as their static counterparts. You can find a more detailed treatment of this topic over at vkguide.dev.
B.4.4. Dispatching and running
vkCmdDispatch launches XxYxZ workgroups (there are variants of this command, but they are quite niche; see vkCmdDispatchIndirect and vkCmdDispatchBase).
Once all of this is done, we call vkEndCommandBuffer to finalize the registration of our command buffer. We then submit it for processing through vkQueueSubmit. Of course, we must target a queue that supports compute operations — and, just like that, we are done!
X. Bonus material
- Vulkan Shader Resource Binding page by NVIDIA.
- Comparing Uniform Data Transfer Methods in Vulkan post by Kyle Halladay.