Chapter 1: the ground concepts
In this chapter, we see how the Vulkan API is structured. We stay generic and do not yet focus on specific workloads such as graphics rendering or computing. As every chapter from here on, it is redundant by design: the first section is a high-level overview of Vulkan's structure; the second one covers the same ground in more detail.
A. A high-level overview
All interaction with the Vulkan API go through a handle known as a Vulkan instance. When creating an instance, we specify which Vulkan extensions and what degree of debugging information we require (through a mechanism known as layers).
The instance can give us information about all the Vulkan-compatible physical devices detected on the system. Physical devices can be probed for information, e.g., about the memory they are fitted with and their support for optional features such as raytracing. A Vulkan device (a handle used for interacting with a physical device) can be created from any physical device.
Vulkan devices do not run in-sync with the CPU. Our only way of controlling them is by sending commands through their interfaces. Interfaces start processing commands in their order of arrival, functioning as queues for such commands. In fact, they are literally called device queues. A single device exposes multiple such queues, each of them supporting some types of tasks and running mostly independently from the others (barring explicit synchronization and computational resources contention).
Device queues do not handle individual commands directly. We send commands in bulk through command buffers built using the command recording mechanism: there are special Vulkan functions for adding commands to command buffers. Special memory allocators (command pools) are used to reserve memory for the buffers both on host memory and on the targeted device.
For instance, if we wanted to build a renderer for a game with a main view and a minimap, we could use a queue limited to data transfers in parallel with two others queues supporting graphical operations:
- The transfer queue would be used for the loading of the game's assets
- One of the graphical queues would be used for rendering the minimap; we don't have to refresh this on every frame — every 10 seconds should be enough
- The other graphical queue would be used for rendering the main view, with the minimap rendered through the other queue shown in some corner
GPUs are highly parallel devices, which can lead to all kinds of trouble with the scenario described above. What if the code rendering the main view accesses the data of the minimap while the minimap is being updated? As you can imagine, we have to make careful use of synchronization commands. These come in two different flavors:
- GPU-CPU: useful for, e.g., notifying the CPU of the end of a rendering task
- GPU-GPU: useful for, e.g., ensuring that the minimap is never displayed while it is being refreshed in our example above
For this purpose, Vulkan defines a bunch of synchronization primitives used through special commands that can be inserted in command buffers. There are also functions (not commands) for controlling some synchronization primitives controlled from the CPU.
B. A deeper dive
B.1. Instance
A Vulkan instance is a handle that allows us to interact with the Vulkan API. Its creation is very straightforward: see vkCreateInstance. The most noteworthy thing is that we configure two things at this point: instance extensions and validation layers.
Instance extensions extend Vulkan with additional capabilities. vkEnumerateInstanceExtensionProperties lists extensions supported by your Vulkan version. The most common instance extension is probably VK_KHR_surface, which enables rendering on the desktop environment's windows.
Validation layers are key to debugging Vulkan programs. Vulkan is tailored for efficiency and does not stop for any checks by default. Layers can be activated to check for issues such as incorrect function parameters, memory leaks and thread safety violations. They work by hijacking calls to actual Vulkan functions, quite literally wrapping them in a verification layer. vkEnumerateInstanceLayerProperties lists layers supported by your Vulkan version. We disable all layers in release builds for performance reasons.
By default, layers write everything very verbosely to the standard output. The VK_EXT_debug_utils instance extension offers the vkCreateDebugUtilsMessengerEXT function, which allows us to set custom callbacks for layers instead. For instance, useless messages can be filtered out and the rest can be logged to a file.
vkCreateInstance also takes a VkApplicationInfo struct, which stores more information about the application you are developing: title, version, … Everything is rather self-explanatory.
B.2. Devices
vkEnumeratePhysicalDevices lists all detected Vulkan-compatible physical devices as lightweight handles of type VkPhysicalDevice. These handles cannot be used directly to control the devices. For this purpose, we need to establish a deeper connection with the device through the creation of a VkDevice.
We only want to pay the price of creating a VkDevice for physical devices that match our needs. Vulkan offers several functions for getting information about devices:
- vkGetPhysicalDeviceProperties returns general information about a specific device, e.g., whether it is an actual GPU or a software emulation of one, or what image sizes it supports.
- vkGetPhysicalDeviceFeatures returns information regarding the support of specific features such as 64 bits floats, texture compression or geometry shaders.
- VkGetPhysicalDeviceMemoryProperties returns two lists, one of memory heaps and another of memory types. Memory heaps are the actual memories (chiefly characterized by their size) whereas memory types describe how the heaps can be accessed. A memory type is always bound to exactly one memory heap. We discuss memory in more depth in the next chapter.
- vkGetPhysicalDeviceQueueFamilyProperties returns information about the queues offered by a given VkPhysicalDevice. Remember that we interact with devices through these queues, which can be seen as ports of sorts. Information about them is organized using the notion of queue families, representing collections of queues sharing the same properties. For each queue family, this function returns how many members it contains alongside its capabilities (e.g., whether it supports graphical operations or video encoding).
- vkEnumerateDeviceExtensionProperties returns information about supported device-level extensions. These extensions can unlock features that are not part of the base Vulkan specification, such as ray tracing.
When creating a device, we have to declare the queues that we will use upfront; the other queues will be disabled. Remember that commands are sent to specific queues, with each queue supporting some kinds of commands only. Using different queues helps maximize parallelism insofar as it limits clogging. It is generally a good idea to use a specialized queue for different kinds of tasks (e.g., one queue for graphics rendering and another one for resource transfers, as we do not want slowdowns in the rendering pipeline to impact transfers). For commands that rely on the same device capabilities, whether several queues are beneficial becomes more situational. Forcefully splitting a rendering task across queues induces a need for synchronization that usually negates any potential benefit of parallelism.
VkCreateDevice actually creates the device. We fill out one VkDeviceQueueCreateInfo per used queue family. In this structure, we specify how many queues of the family to activate as well as their priority (0.0f for lowest, 1.0f for highest). Priority is only a hint for the driver, it may or may not have an effect. If you do not care about it you can just set all the priorities to the same value, say 0.0f. Note that there is no way of activating additional queues once the device has been created!
B.3. Commands
We control devices by sending commands to specific queues. This is done in four steps:
- We create a command pool, a special memory allocator reserved for command buffers
- We allocate command buffers
- We fill them with all sorts of commands
- We actually send the command buffers to the device
1. vkCreateCommandPool creates a memory pool for a whole family of queues. This step is trivial: we do not even need to specify what amount of memory we would like to set aside. A single command pool can be used multiple times.
2. vkAllocateCommandBuffers allocates a bunch of command buffers in one fell swoop. For each kind of command buffer, we specify how many buffers of that kind we require and whether the buffers are primary or secondary, the difference being that primary buffers are directly submitted to queues whereas secondary ones are executed by primary command buffers and not submitted directly (you can view them as functions of sorts). Again, no need to specify a size or such at this stage. vkFreeCommandBuffers undoes this allocation. We call this function when we are done using the command buffer.
3. To fill the command buffer, we sandwich the appropriate commands between calls to vkBeginCommandBuffer and vkEndCommandBuffer, which both take the targeted buffer as argument. Commands are functions of the form vkCmd* (in fact, they are not commands in and of themselves but functions that generate commands in the currently open command buffer). There are many commands to choose from, as can be seen by scrolling a bit on this page. We need to make sure that the queue we intend to target supports our commands (remember that commands used for rendering are only available on graphics queues, for instance). Also, vkBeginCommandBuffer takes some arguments. There is VkCommandBufferInheritanceInfo, which is used for describing the state that secondary command buffers inherit from the primary buffers they are called from (but I will not go into much more detail, as secondary command buffers are not really at the core of Vulkan), and then there is VkCommandBufferUsageFlagBits, which defines a bunch of flags. The most important flag is the one for indicating that a command buffer is used only once (it is optional, but it enables optimizations). The other two flags are very niche ones: there is one for commands buffers that can be resubmitted before they are done executing (this may reduce performance and seems to be generally frowned upon; also, synchronization between different instances of the buffer is obviously required and has to be handled manually) and another for secondary command buffers that are submitted as part of render passes (a topic that we will discuss in the graphics chapter).
4. vkQueueSubmit tells a device to run command buffers. It takes a queue and an array of command buffers to execute, plus some synchronization-related arguments — more on them in the next section. The commands in the buffers should be supported by the targeted queue (mind its type!).
vkResetCommandBuffer resets a command buffer, freeing it for reuse and sparing us the allocation of a new one (allocation is not a cheap operation, apparently; I guess that reuses limit fragmentation but do not quote me on that). This function is only available if flag VK_COMMAND_POOL_CREATE_RESET_COMMAND_BUFFER_BIT was passed to the command pool during its creation. Alternatively, vkResetCommandPool resets entire command pools, freeing all the command buffers they hold. We have to ensure that no command buffer is in flight as this point.
B.4. Synchronization
When discussing vkQueueSubmit, we postponed the description of synchronization. GPUs are massively parallel devices, so it is only natural that we are faced with questions of synchronization at some point.
Before starting, a quick caveat: the order of submission to queues only fixes the order in which the device starts handling the commands, i.e., the order in which it takes note of their existence. For the order in which they are actually executed, anything goes. It is common for later commands to complete before earlier ones. Our only tool for getting some order into this madness is synchronization. We manage synchronization entirely manually, with the help of the constructs introduced below.
B.4.a. Fences
VkFences are used for GPU-CPU synchronization, with the CPU having read-only access to them. The GPU sets a fence when it is done with a task. The CPU can read the status of such objects through vkGetFenceStatus or vkWaitForFences. For instance, fences are used in vkQueueSubmit to indicate that all the submitted commands have been completed. Fences can be recycled through vkResetFences.
B.4.b. (Binary) semaphores
To understand semaphores, a detour through the notion of Vulkan pipelines is helpful. Vulkan tasks are split into different steps. For instance, when rendering, Vulkan starts by projecting the geometry of the vertices that make up the scene from world-space to screen-space. Computing the colors of the screen's pixels occurs much later, after a bunch of similar intermediate steps. We call such steps "stages" and the sums of all stages make up pipelines. Different categories of tasks go through different pipelines (we do not need to compute pixel colors when using GPUs as general-purpose computing devices, for instance). There exist only a handful of pipelines and all pipelines and their stages are defined by Vulkan. For instance, the stage where we compute pixel colors is called VK_PIPELINE_STAGE_FRAGMENT_SHADER_BIT, for reasons that will become clear later.
vkQueueSubmit relies on VkSemaphores, a GPU-GPU synchronization construct. Every submitted command buffer comes with a set of wait semaphores and another set of signal semaphores. Before any commands matching the nth pWaitDstStageMask are executed, the wait semaphore needs to be set. As to the signal semaphores, they are set once all commands complete. Note that a command buffer submitted to some queue may depend on a semaphore set on another queue of the same device.
B.4.c. Timeline semaphores
In recent versions of Vulkan, fences and semaphores can be replaced by a single construct called a timeline semaphore, an extension of classical semaphores bringing some advantages in specific situations. We do not discuss these in more detail here, as the good old fences/semaphores work well enough for most purposes.
B.4.d. Barriers
The synchronization primitives we met up to this point were coarse-grained: we could signal semaphores only when whole commands buffers were done executing. What if we are dealing with a large single task with lots of tightly integrated subtasks? Surely we could benefit from finer-grained controls! This is precisely the role of vkCmdPipelineBarriers, synchronization constructs for ordering commands within the same queue. Unlike semaphores and fences, barriers are not Vulkan objects but special commands inserted directly into command buffers. They come in two flavors: execution barriers and memory barriers.
Execution barriers say "before executing any stage X, Y or Z command that comes after the barrier command itself, ensure that all stage A, B and C commands that came before it are done executing". Remember that although commands are submitted linearly to a queue and start executing in their submission order, everything effectively runs in parallel and may complete in any order. Execution barriers help with putting some order to this madness. We configure execution barriers through arguments srcStageMask and dstStageMask in vkCmdPipelineBarrier. This function offers more arguments, but we just give them default values. Indeed, both execution barriers and memory barriers are setup using the same function. Many of the arguments are only relevant for the latter.
Memory barriers have to do with resource synchronization. GPUs have a central memory and local caches. These caches can go out-of-sync, and synchronization is manually enforced through memory barriers. These barriers allow us to control when data is pushed to the main memory from local caches and when it is pulled in from it. Building a memory barrier starts like building an execution barrier: we set srcStageMask and dstStageMask in vkCmdPipelineBarrier. The barrier will wait for commands matching srcStageMask that come before it to complete before unblocking commands matching dstStageMask that come after it. To turn an execution barrier into a memory barrier, we insert objects of type VkMemoryBarrier in the function's arguments. Data modified by previous commands matching the combination of srcStageMask and srcAccessMask is pushed to main memory after being written to the local cache, and data read by previous commands matching the combination of dstStageMask and dstAccessMask is pulled from main memory before the reads occur. Access masks are built out of VkAccessFlagBits such as VK_ACCESS_SHADER_WRITE_BIT.
Note that memory barriers are simply an extension of execution barriers, and that even execution barriers are a memory synchronization tool. Indeed, all barriers are used to avoid parallelism issues, which always boil down to issues with memory syncing. Memory barriers are more explicit about this and finer-grained (thanks to access masks).
There are specialized memory barriers for special kinds of resources — we will discuss those in more detail in the next chapter. Also, note that several barriers may be submitted in a single vkCmdPipelineBarrier call, so long as they share the same srcStageMask and dstStageMask (although you can't submit both an execution barrier and memory barriers at the same time, since memory barriers are refinements of execution barriers).
The final argument is dependencyFlags. The only flag defined in Vulkan 1.0 is VK_DEPENDENCY_BY_REGION_BIT, which is used in conjunction with another, graphics-specific feature called tiling for even finer-grained dependencies. Tiling is basically about splitting images into small tiles and doing the rendering on each tile individually (more details in the graphics chapter, no need to sweat it out if this is not clear yet). A tile does not depend on previous commands having been run to a term for the entire image. Indeed, it only depends on the effects of previous commands regarding the small portion of the overall image it is concerned with. VK_DEPENDENCY_BY_REGION_BIT enables tracking dependencies on a per-tile basis.
Also note that the queue is not aware of the boundaries between command buffers: it only sees a stream of commands. Barriers apply to all commands submitted to the queue, not just to those of the buffer they are a part of.
I recommend checking this post by Maister for more fleshed out examples, tricks and caveats (or this one by Jeremy Ong).
B.4.e. Events
VkEvents are yet another synchronization primitive used to insert fine-grained dependencies between the CPU and the GPU or within the same queue. What is interesting about them is that they can do things like (CMD A, SET_EVENT x, CMD B, WAIT_EVENT x, CMD C): here, commands A and B may execute in parallel. Only CMD C will effectively wait for event x. Events are less useful than the rest of primitives. Knowing that they exist does not hurt but do not think about them too much.
Events can be set from the CPU (vkSetEvent) or from the GPU (vkCmdSetEvent). Only GPUs can wait on events (vkCmdWaitEvents) although the status of an event can be checked from the CPU using vkGetEventStatus. Events can also be reset from the CPU (vkResetEvent) or the GPU (vkCmdResetEvent).
B.5. Cleaning up
We must destroy the objects we created once we are done with them. We do this through functions of the form vkDestroyXxx (e.g., vkDestroyCommandPool). Similarly, we must free objects allocated through functions of the form vkAllocateXxx using functions of the form vkFreeXxx.