Chapter 1: the ground concepts
In this chapter, we see how the Vulkan API is structured. We stay generic and do not yet focus on specific workloads such as graphics rendering or computing. This chapter is redundant by design: the first section is a high-level overview of Vulkan's structure; later sections cover the same ground in more detail.
A. A whirlwind tour
To interact with the Vulkan API, we use a handle known as the Vulkan instance. When creating this instance, we specify which Vulkan extensions and what degree of debugging information we require.
Through the instance, we are able to detect all Vulkan-compatible physical devices on the system (typically, GPUs). Physical devices can be probed for information, e.g., about the memory they are fitted with and their support for optional features (such as raytracing). A Vulkan device can be created from any detected physical device.
Vulkan devices do not run in-sync with the CPU. We control them by sending commands through their interfaces. Interfaces start processing commands in their order of arrival, functioning as queues for such commands. In fact, they are literally called device queues. A single device exposes multiple such queues, each of them supporting some types of tasks and running mostly independently from the others (barring explicit synchronization and computational resources contention). For instance, it is common to have queues that are limited to data transfers. We could imagine using one such queue in parallel to two others for graphical operations, where one would handle the real-time production of the frames of the graphical application and the other would handle the asynchronous updates of assets that only need be refreshed periodically (e.g., a minimap; as this resource would be shown on the frames rendered by the other queue, some synchronization would be required).
Device queues do not handle individual commands directly. We send commands in bulk through command buffers built using the command recording mechanism: there are special Vulkan functions for adding commands to command buffers. Special memory allocators (command pools) are used to reserve memory for the buffers both on host memory and on the targeted device.
A Vulkan device exposes multiple memories with different characteristics. Some are standard device-local memories, while others are merely a representation for synchronization mechanisms between the device and host memory (the memory may really live CPU-side while the device sees it as device-local: each memory access implicitly corresponds to a transfer). A device defines a set of memory heaps (the actual memories) and each heap has a set of memory types which describe the behavior that an allocation can have. Indeed, all memory has to be managed (allocated/freed) explicitly.
There are situations where synchronization is required. For instance, we may need to wait for a data transfer between the CPU and the device (CPU-GPU synchronization) or for the results of an operation on the device (GPU-GPU synchronization). For this purpose, Vulkan defines a bunch of synchronization primitives alongside related commands that can be inserted in command buffers. There are also functions for interacting with synchronization primitives from the host side.
B. A more detailed look
B.1. Instance
The Vulkan instance is the handle through which we access the Vulkan API. Its creation is very straightforward: see vkCreateInstance. The only thing worth noting is that instance extensions and validation layers are set up at this point.
Extensions extend Vulkan with additional capabilities. vkEnumerateInstanceExtensionProperties lists extensions supported by your Vulkan version. For instance, the most common instance extension is probably VK_KHR_surface. It enables rendering on the desktop environment's windows.
Layers are key to debugging Vulkan programs. By default, Vulkan is tailored for efficiency and does not stop for checks. Layers can be activated to add such checks. They work by hijacking calls to actual Vulkan functions, quite literally wrapping them in a verification layer. Layers can detect issues such as incorrect function parameters, memory leaks and thread safety violations. vkEnumerateInstanceLayerProperties lists layers supported by your Vulkan version. We disable them on release builds for performance reasons.
By default, layers write everything very verbosely to the standard output. Callbacks can be set up through vkCreateDebugUtilsMessengerEXT to change this so long as the VK_EXT_debug_utils instance extension has been activated. For instance, useless messages can be filtered out and the rest can be logged to a file.
B.2. Devices
vkEnumeratePhysicalDevices lists available Vulkan-compatible physical devices as lightweight handles of type VkPhysicalDevice. These handles cannot be used directly to control the devices. For this purpose, we would need to establish a deeper connection with the device through the creation of a VkDevice.
We only want to pay the price of creating a VkDevice for physical devices that match our needs. Vulkan offers various functions for getting information about devices. vkGetPhysicalDeviceProperties returns general information about a specific device, e.g., whether it is an actual GPU or a software emulation thereof or the maximum image size it can manage. vkGetPhysicalDeviceFeatures returns information regarding the support of features such as 64 bits floats, texture compression or geometry shaders. VkGetPhysicalDeviceMemoryProperties returns returns two lists, one of memory heaps and another of memory types. Memory heaps are the actual memories (chiefly characterized by their size) whereas memory types describe how the heaps can be accessed. A memory type is always bound to exactly one memory heap. We discuss memory in more detail in the next chapter.
Queues can be seen as ports for communicating with a VkDevice. Commands are sent to specific queues, with each queue supporting some kinds of commands only. Using different queues helps maximize parallelism insofar as it limits clogging. It is generally a good idea to use a specialized queue for different kinds of tasks (e.g., one queue for graphics rendering and another one for resource transfers, as we do not want slowdowns in the rendering pipeline to impact transfers). For commands that rely on the same device capabilities, whether several queues are beneficial becomes more situational. Forcefully splitting a rendering task across queues induces a need for synchronization that usually negates any potential benefit of parallelism.
vkGetPhysicalDeviceQueueFamilyProperties returns information about the queues offered by a given VkPhysicalDevice. This information is organized using the notion of queue families, representing collections of queues sharing the same properties. For each queue family, this function returns how many members it contains alongside its capabilities (e.g., whether it supports graphical operations or video encoding).
VkCreateDevice actually creates the device. In particular, this function expects information regarding which queues will be active: we have to fill out one VkDeviceQueueCreateInfo per used queue family. In this structure, we specify how many queues of the family to activate as well as their priority (0.0f for lowest, 1.0f for highest). There is no way of requesting more queues once the device has been created!
B.3. Commands
We send each command to a specific queue. This is done in four steps:
- We create a command pool, a special memory allocator reserved for command buffers
- We allocate command buffers
- We fill them with all sorts of commands
- We actually send the command buffers to the device
1. vkCreateCommandPool creates a memory pool for a whole family of queues. This step is trivial: we do not even need to specify what amount of memory we would like to set aside.
2. vkAllocateCommandBuffers allocates a bunch of command buffers in one fell swoop. For each kind of command buffer, we have to specify how many buffers of that kind we require and whether the buffers are primary or secondary, the difference being that primary buffers are directly submitted to queues whereas secondary ones are executed by primary command buffers and not submitted directly (you can view them as functions of sorts). vkFreeCommandBuffers undoes this allocation. We call this function when we are done using the command buffer.
3. To fill the command buffer, we sandwich the appropriate commands between calls to vkBeginCommandBuffer and vkEndCommandBuffer, which both take the targeted buffer as argument. Commands are functions of the form vkCmd* (in fact, they are not commands in and of themselves but functions that generate commands in the currently open command buffer). There are many commands to choose from, as can be seen by scrolling a bit on this page. We need to make sure that the queue we target supports our commands (as discussed in the previous section).
4. vkQueueSubmit tells a device to run the commands. It takes a queue and an array of command buffers to execute, plus some synchronization-related arguments — more on them in the next section.
vkResetCommandBuffer resets a command buffer, freeing it for reuse and sparing us the allocation of a new one (not a cheap operation, apparently; I guess that reuses limit fragmentation but do not quote me on that). This function is only available if flag VK_COMMAND_POOL_CREATE_RESET_COMMAND_BUFFER_BIT was passed to the command pool during its creation. Alternatively, vkResetCommandPool resets entire command pools, freeing all the command buffers they hold. We have to ensure that no command buffer is in flight as this point.
B.4. Synchronization
When discussing vkQueueSubmit, we postponed the description of synchronization. GPUs are massively parallel devices, so it is only natural that we are faced with questions of synchronization at some point.
Before starting, a quick caveat: the order of submission to queues only fixes the order in which the device starts handling the commands. Past this point, anything goes. It is common for later commands to complete before earlier ones. We manage synchronization entirely manually, supported by the Vulkan constructs introduced below.
B.4.a. Fences
VkFences are used for GPU-CPU synchronization. The GPU sets a fence to indicate that a task is over and the CPU can only read its state. For instance, they can be used in vkQueueSubmit to indicate that all the submitted commands have been completed. We read the state of a fence through vkGetFenceStatus or vkWaitForFences. Fences can be recycled through vkResetFences.
B.4.b. (Binary) semaphores
To understand semaphores, a detour through the notion of Vulkan pipelines is helpful. Vulkan tasks are split into different steps. For instance, when rendering, Vulkan starts by computing the on-screen mapping of vertices before turning to pixel colors (there are many more such steps that we will introduce in due time). We call these steps "stages" and the sums of such stages "pipelines". Different categories of tasks go through different pipelines (we do not need to compute pixel colors when transferring resources). There exist only a handful of pipelines and all stages are defined by Vulkan. For instance, the stage where we compute pixel colors is called VK_PIPELINE_STAGE_COMPUTE_SHADER_BIT, for reasons that will become clear later.
vkQueueSubmit relies on VkSemaphores, a GPU-GPU synchronization construct. Every submitted command buffer comes with a set of wait semaphores and one of signal semaphores. Before the commands reach a stage specified by pWaitDstStageMask, all wait semaphores need to be set. the signal semaphores are set once are set once all commands complete. Note that a command buffer submitted to some queue may depend on a semaphore set on another queue of the same device.
B.4.c. Timeline semaphores
In recent versions of Vulkan, fences and semaphores can be replaced by a single construct called a timeline semaphore, an extension of classical semaphores. We will not discuss these in more detail, as we can still use the good old fences/semaphores.
B.4.d. Barriers
The synchronization primitives we met up to this point were coarse-grained: we could signal semaphores only when whole commands buffers were done executing. What if we are dealing with a large single task with lots of tightly integrated subtasks? Surely we could benefit from finer-grained controls! This is precisely the role of vkCmdPipelineBarriers, synchronization constructs for ordering commands within the same queue. Unlike semaphores and fences, barriers are not Vulkan objects but special commands inserted directly into command buffers. They come in two flavors, execution and memory barriers.
Execution barriers say "before executing any stage X, Y or Z command that come after the barrier, ensure that all stage A, B and C commands that came before it are done executing". Remember that although commands are submitted linearly to a queue and start executing in their submission order, everything effectively runs in parallel and may complete in any order. Execution barriers help with putting some order to this madness. To set an execution barrier, simply provide arguments for srcStageMask and dstStageMask in vkCmdPipelineBarrier.
Memory barriers have to do with the resources synchronization. GPUs have a central memory and local caches. These caches can go out-of-sync and synchronization is manually enforced through memory barriers. These barriers allow us to control when data is pushed to the main memory from local caches and when it is pulled in the other direction. Building a memory barrier starts like building an execution barrier: we set srcStageMask and dstAccessMask in vkCmdPipelineBarrier. The barrier will wait for commands matching srcStageMask that came before it to complete before unblocking commands matching dstStageMask that come after it. To turn an execution barrier into a memory barrier, we insert objects of type VkMemoryBarrier in the barrier's arguments. Data modified by writes in commands matching the combination of srcStageMask and srcAccessMask is pushed to main memory after being written to the local cache, and data read in commands matching the combination of dstStageMask and dstAccessMask is pulled from main memory before the read occurs. Access masks are built out of VkAccessFlagBits such as VK_ACCESS_SHADER_WRITE_BIT.
Note that memory barriers are simply an extension of execution barriers and that even execution barriers have to do with memory synchronization. Indeed, all barriers are used to avoid parallelism issues, which always boil down to issues with memory syncing. Memory barriers are more explicit about this and finer-grained (thanks to access masks).
There are specialized memory barriers for special kinds of resources — we will discuss those in more detail in the next chapter.
Also note that the queue is not aware of the boundaries between command buffers: it only sees a stream of commands. Barriers apply to all commands submitted to the queue, not just to those of the buffer they are a part of.
I recommend checking this post by Maister for more fleshed out examples, tricks and caveats.
B.4.e. Events
VkEvents are yet another synchronization primitive used to insert fine-grained dependencies between the CPU and the GPU or within the same queue. What is interesting about them is that they can do things like (CMD A, SET_EVENT x, CMD B, WAIT_EVENT x, CMD C): here, commands A and B may execute in parallel. Only CMD C will effectively wait for event x. Events are less useful than the rest of primitives. Knowing that they exist does not hurt but do not think about them too much.
Events can be set from the CPU (vkSetEvent) or from the GPU (vkCmdSetEvent). Only GPUs can wait on events (vkCmdWaitEvents) although the status of an event can be checked from the CPU using vkGetEventStatus. Events can also be reset from the CPU (vkResetEvent) or the GPU (vkCmdResetEvent).
B.5. Cleanup
We must destroy the objects we created once we are done with them. We do this through functions of the form vkDestroyXxx (e.g., vkDestroyCommandPool). Similarly, we must free objects allocated through functions of the form vkAllocateXxx. We do this through functions of the form vkFreeXxx.