Chapter 1: the basics

In this chapter, we discuss the overall structure of the Vulkan API. We do not yet focus on specific workloads such as graphics rendering or computing.

A. A high-level overview

Any interaction with the Vulkan API happens through a handle known as a Vulkan instance. When creating an instance, we specify which Vulkan extensions and what degree of debugging information we require (through a mechanism known as layers).

We can ask the instance for information about all the Vulkan-compatible physical devices (which mostly means GPUs) detected by the runtime, and we can further probe physical devices for information about, e.g., the memory they are fitted with or their support for optional features such as raytracing. We may create a Vulkan logical device (a handle used for interacting with a physical device) for any of them.

Vulkan devices do not run in-sync with the CPU. Our only way of controlling them is by sending commands through their interfaces, which are structured as a set of device queues to which we can push commands; commands start being processed in their order of arrival. Each queue supports some types of tasks and runs mostly independently from the others (barring explicit synchronization and computational resources contention).

Device queues do not handle individual commands. We send commands in bulk through command buffers built using the command recording mechanism. We use special memory allocators called command pools to reserve memory for the buffers both in host memory and on the targeted device.

For instance, if we wanted to build a renderer for a game with a main view and a minimap, we could use a queue limited to data transfers in parallel with two others queues supporting graphical operations:

We could load the game's assets via the transfer queue

We could use one of the graphical queues to render the minimap (we don't have to refresh this on every frame; every few seconds is enough)

We would then have to use the other graphical queue to render the final image, including a view of the minimap rendered from the other queue

GPUs are highly parallel devices, which can lead to all kinds of trouble with the scenario described above. What if the code rendering the main view accesses the data of the minimap while the minimap is being updated? As you can imagine, we have to make careful use of synchronization commands. These come in two different flavors:

GPU-GPU: useful for, e.g., ensuring that the minimap is never displayed while it is being refreshed
GPU-CPU: useful for, e.g., notifying the CPU of the end of the rendering task

For this purpose, Vulkan defines a bunch of synchronization primitives used through special commands that can be inserted in command buffers. There are also functions (not commands) for controlling some synchronization primitives from the CPU.

You may be wondering how resource management and transfer work (be it between queues or even between the CPU and the GPU). We leave this topic for the next chapter.

B. A deeper dive

B.1. Instance

A Vulkan instance is a gateway to the Vulkan runtime. Its creation is quite straightforward, though it is verbose, as most things in Vulkan are: see vkCreateInstance. We mainly configure two things at this point: instance extensions and validation layers.

Instance extensions extend Vulkan with additional capabilities. vkEnumerateInstanceExtensionProperties lists extensions supported by your Vulkan version. The most common instance extension is probably VK_KHR_surface, which enables rendering on the desktop environment's windows (as discussed in a later chapter).

Validation layers are key to debugging Vulkan programs. Vulkan is tailored for efficiency and does not stop for any checks by default. Layers can be activated to check for issues such as incorrect function parameters, memory leaks and thread safety violations. They work by hijacking calls to actual Vulkan functions, quite literally wrapping them in a verification layer. vkEnumerateInstanceLayerProperties lists layers supported by our Vulkan version. We disable all layers in release builds for performance reasons.

By default, layers are very verbose and write everything to the standard output. The VK_EXT_debug_utils instance extension offers the vkCreateDebugUtilsMessengerEXT function, which allows us to define custom callbacks instead. For instance, we could filter out useless messages and log the rest to a file.

vkCreateInstance also takes a VkApplicationInfo struct, which stores metadata about the application we are developing: title, version, etc.

B.2. Devices

vkEnumeratePhysicalDevices lists all detected Vulkan-compatible physical devices as lightweight handles of type VkPhysicalDevice. These handles cannot be used directly to control the devices. For this purpose, we need to establish a deeper connection with them through the creation of VkDevice objects (also called "logical devices").

We only want to pay the (relatively high) price of creating a VkDevice for physical devices that match our needs. Vulkan offers several functions for getting information about physical devices; we use them to pick suitable candidates:

vkGetPhysicalDeviceProperties returns general information, e.g., whether the device is an actual GPU or a software emulation thereof, or the maximum dimensions of images it supports.
vkGetPhysicalDeviceFeatures returns information regarding the support of specific features such as 64 bits floats, texture compression or geometry shaders.
vkGetPhysicalDeviceMemoryProperties returns two lists, one of memory heaps and another of memory types. Memory heaps are the actual memories (chiefly characterized by their size) whereas memory types describe policies for accessing them. A memory type is always bound to exactly one memory heap. We discuss memory in more depth in the next chapter.
vkGetPhysicalDeviceQueueFamilyProperties returns information about the queues offered by a given VkPhysicalDevice. Remember that we interact with devices through these queues, which can be seen as ports of sorts. Information about them is organized using the notion of queue families, representing collections of queues sharing the same properties. For each queue family, this function returns how many members it contains alongside its capabilities (e.g., whether it supports graphical or video encoding operations).
vkEnumerateDeviceExtensionProperties returns information about supported device-level extensions. These extensions can unlock features that are not part of the base Vulkan specification, such as ray tracing.

When creating a device, we have to declare the queues that we will use upfront; the other queues will be disabled. Remember that commands are sent to specific queues, with each queue supporting some kinds of commands only. Using different queues helps maximize parallelism insofar as it limits clogging. It is generally a good idea to use specialized queues for different kinds of tasks (e.g., one queue for graphics rendering and another one for resource transfers, as we do not want slowdowns in the rendering pipeline to impact transfers). For commands that rely on the same device capabilities, whether several queues are beneficial becomes more situational. Forcefully splitting a rendering task across queues induces a need for synchronization that usually negates any potential benefit of parallelism.

vkCreateDevice actually creates the device. We fill out one VkDeviceQueueCreateInfo per used queue family. In this structure, we specify how many queues of the family to activate as well as their priority level (0.0f for lowest, 1.0f for highest). Priority is only a hint for the driver, it may or may not have an effect. If we do not care, we can just set all the priorities to the same value, say 0.0f. Note that there is no way of activating additional queues once the device has been created!

B.3. Commands

We control devices by sending commands to specific queues. This is done in four steps:

We create a command pool, a special memory allocator reserved for command buffers
We allocate command buffers
We fill them with all sorts of commands
We send the command buffers to compatible device queues

1. vkCreateCommandPool creates a memory pool for a whole family of queues. This step is trivial: we do not even need to specify what amount of memory we would like to set aside (command pools work through behind-the-scene magic). A single command pool can be used multiple times.

2. vkAllocateCommandBuffers allocates a bunch of command buffers in one fell swoop. For each kind of command buffer, we specify how many buffers of that kind we require and whether the buffers are primary or secondary, the difference being that primary buffers are directly submitted to queues whereas secondary ones are executed by primary command buffers and not submitted directly (you can view them as functions of sorts). Again, no need to specify a size or such at this stage. vkFreeCommandBuffers undoes this allocation. We call this function when we are done using the command buffer.

3. To fill the command buffer, we sandwich the appropriate commands between calls to vkBeginCommandBuffer and vkEndCommandBuffer, which both take the targeted buffer as argument. Commands are functions of the form vkCmd* (in fact, they are not commands in and of themselves but functions that generate commands in the currently open command buffer). There are many commands to choose from, as can be seen by scrolling a bit on this page. We need to make sure that the queue we target supports our commands (commands used for rendering are only available on graphics queues, for instance). Also, vkBeginCommandBuffer takes some arguments. There is VkCommandBufferInheritanceInfo, which is used for describing the state that secondary command buffers inherit from the primary buffers they are called from (but I will not go into much more detail, as secondary command buffers are one of those features that we should avoid — they are not well supported in practice, and AMD officially discourages their use), and then there is VkCommandBufferUsageFlagBits, which defines a bunch of flags. The most important one is the one that indicates that a command buffer is used only once (signalling that some nice optimizations are safe). The other two flags are very niche: there is one for commands buffers that can be resubmitted before they are done executing (this may reduce performance and seems to be generally frowned upon; also, synchronization between different instances of the buffer is obviously required and has to be handled manually), and another for secondary command buffers that are submitted as part of render passes (a topic that we will discuss in the graphics chapter).

4. vkQueueSubmit tells a device to run command buffers. It takes a queue and an array of command buffers to execute, plus some synchronization-related arguments — more on them in the next section. The commands in the buffers should be supported by the targeted queue (we must pay attention to its type).

vkResetCommandBuffer resets a command buffer, freeing it for reuse and sparing us the allocation of a new one (allocation is not a cheap operation, apparently; I guess that reuses limit fragmentation but do not quote me on that). This function is only available if flag VK_COMMAND_POOL_CREATE_RESET_COMMAND_BUFFER_BIT was passed to the command pool during its creation. Alternatively, vkResetCommandPool can always be used. It resets entire command pools, freeing all the command buffers they hold. We have to ensure that no command buffer is in flight at this point.

B.4. Synchronization

When discussing vkQueueSubmit, we postponed the description of synchronization. As GPUs are massively parallel devices, it is only natural that questions of synchronization play a large role in Vulkan.

Before starting, a quick caveat: the order of submission to queues only fixes the order in which the device starts handling the commands, i.e., the order in which it takes note of their existence. For the order in which they are actually executed, anything goes. It is common for later commands to complete before earlier ones (even on a single queue!). Our only tool for getting some order into this madness is synchronization, which we manage with the help of the constructs introduced below.

B.4.a. Fences

We use VkFences for GPU-CPU synchronization, with the CPU having read-only access to them. The GPU sets a fence when it is done with a task. The CPU can read the status of such objects through vkGetFenceStatus or vkWaitForFences. For instance, fences are used in vkQueueSubmit to indicate that all the submitted commands have been completed. We can recycle fences through vkResetFences.

B.4.b. (Binary) semaphores

Understanding semaphores requires a detour through the notion of Vulkan pipelines. Vulkan tasks are split into different steps. For instance, when rendering a 3D scene, Vulkan starts by projecting the geometry of the vertices of the scene from world-space to screen-space. Computing the colors of the screen's pixels occurs much later, after a bunch of similar intermediate steps. We call such steps "stages" and the sum of all stages make up pipelines. Different categories of tasks go through different pipelines (we do not need to compute pixel colors when using GPUs as general-purpose computing devices, for instance). There exist only a handful of pipelines and all pipelines and their stages are defined by Vulkan (e.g., the stage where we compute pixel colors is called VK_PIPELINE_STAGE_FRAGMENT_SHADER_BIT, for reasons that will become clear later).

vkQueueSubmit relies on VkSemaphores, a GPU-GPU synchronization construct. Every submitted command buffer comes with a set of wait semaphores and another set of signal semaphores. Before any command matching the nth pWaitDstStageMask (a mask of pipeline stages) can execute, the nth wait semaphore needs to be set. As to the signal semaphores, they are all set at the same time, once all commands complete. Note that a command buffer submitted to some queue may depend on a semaphore set on another queue (so long as it is located on the same device).

B.4.c. Barriers

The synchronization primitives we met up to this point were coarse-grained: we could signal semaphores only when whole command buffers were done executing. What if we are dealing with a large single task with lots of tightly integrated subtasks? Surely we could benefit from finer-grained controls! This is precisely the role of vkCmdPipelineBarriers, synchronization constructs for ordering commands within the same queue. Unlike semaphores and fences, barriers are not Vulkan objects but special commands inserted directly into command buffers. They come in two flavors: execution barriers and memory barriers.

Execution barriers say "before executing any stage X, Y or Z command that comes after the barrier command itself, ensure that all stage A, B and C commands that came before it are done executing" (remember that although commands are submitted linearly to a queue and start executing in their submission order, everything effectively runs in parallel and may complete in any order). We configure execution barriers through arguments srcStageMask and dstStageMask in vkCmdPipelineBarrier. This function offers more arguments, but we just give them default values, as they are only relevant for memory barriers. Indeed, both execution barriers and memory barriers are setup using the same function.

Memory barriers have to do with resource synchronization. GPUs have a central memory and local caches. These caches can go out-of-sync, and synchronization is manually enforced through memory barriers, which give us control over when data gets pushed to the main memory from local caches and when it gets pulled from it to local caches. Building a memory barrier starts like building an execution barrier: we set srcStageMask and dstStageMask in vkCmdPipelineBarrier. The barrier waits for all commands matching srcStageMask that were registered before it to complete before unblocking the commands matching dstStageMask that were registered after it. To turn an execution barrier into a memory barrier, we insert objects of type VkMemoryBarrier in the function's arguments. Data modified by previous commands matching the combination of srcStageMask and srcAccessMask is pushed to main memory after being written to the local cache, and data read by previous commands matching the combination of dstStageMask and dstAccessMask is pulled from main memory before the reads occur. Access masks are exclusive to memory barriers. They are built out of VkAccessFlagBits (such as VK_ACCESS_SHADER_WRITE_BIT) and allow us to refine our requests: not only can we match commands by the stage they occur in, but we can also match by command type.

Memory barriers are simply an extension of execution barriers, and even execution barriers are actually about memory. Indeed, the point of barriers is to prevent parallelism issues, which always boil down to memory syncing issues; memory barriers are just more explicit about it.

There are specific memory barriers specialized for a certain type of resources — we will discuss those in more detail in the next chapter. Also, note that we may submit several memory barriers in a single vkCmdPipelineBarrier call, so long as they share the same srcStageMask and dstStageMask.

The last remaining argument of vkCmdPipelineBarrier is dependencyFlags. The only relevant value defined in Vulkan 1.0 is VK_DEPENDENCY_BY_REGION_BIT, which is used in conjunction with another, graphics-specific feature called tiling for even finer-grained dependencies. Tiling is basically about splitting images into small tiles and doing the rendering on each tile individually (more details to follow in the graphics chapter). When we consider a single tile, we do not care about previous commands having run to completion for the entire image; we only care about their effects regarding that tile's small portion of the overall image. VK_DEPENDENCY_BY_REGION_BIT enables tracking dependencies on a per-tile basis.

Also note that queues are not aware of the boundaries between command buffers: they only see a stream of commands. Barriers apply to all commands submitted to the queue, not just to those of the buffer they happen to be submitted in.

I recommend checking this post by Maister for more fleshed out examples, tricks and caveats (also see this one by Jeremy Ong).

B.4.d. Events

VkEvents are yet another synchronization primitive used to insert fine-grained dependencies between the CPU and the GPU, or within some queue. What is interesting about them is that they can do things like (CMD A, SET_EVENT x, CMD B, WAIT_EVENT x, CMD C): here, commands A and B may execute in parallel. Only CMD C will effectively wait for event x. Events are less useful than the rest of primitives. Knowing that they exist does not hurt but no need to think about them too much.

Events can be set from the CPU (vkSetEvent) or from the GPU (vkCmdSetEvent). Only GPUs can wait on events (vkCmdWaitEvents) although the status of an event can be checked from the CPU using vkGetEventStatus. Events can also be reset from the CPU (vkResetEvent) or the GPU (vkCmdResetEvent).

B.4.e. Wait idle

vkQueueWaitIdle and vkDeviceWaitIdle are radical options for ensuring that a queue or a device is inactive (all commands submitted to them are done). They are blocking functions that do not care about subtlety. In practice, we almost always favor fences for GPU-CPU synchronization.

B.5. Cleaning up

We must destroy the objects we created once we are done with them. We do this through functions of the form vkDestroyXxx (e.g., vkDestroyCommandPool). Similarly, we must free objects allocated through functions of the form vkAllocateXxx using functions of the form vkFreeXxx. This is something we have to do all the time in Vulkan, though this guide won't be explicit about it in the future. Just remember that everything that is created/allocated must eventually be destroyed/freed.

Chapter 1: the basics

A. A high-level overview #

B. A deeper dive #

B.1. Instance #

B.2. Devices #

B.3. Commands #

B.4. Synchronization #

B.4.a. Fences #

B.4.b. (Binary) semaphores #

B.4.c. Barriers #

B.4.d. Events #

B.4.e. Wait idle #

B.5. Cleaning up #