Chapter 2: resources and transfers

In the previous chapter, we discussed the basics of Vulkan. We learned how to:

Before turning to concrete examples of workflows such as rendering or computing, we need to introduce resources and transfers. Indeed, all the interesting operations involving GPUs are about data (resources), and we require a way of exchanging data between the host and the GPU (transfers).

Along the way, we discuss how Vulkan represents memory, as this concept is not self-evident in a Vulkan context (as a start, a single device typically exposes several different memories). We also introduce two important kinds of resources: the very generic data buffers and the more specialized images.

Remember that Vulkan provides a unified API for interacting with very different forms of GPUs: not only classical GPUs (sometimes called discrete GPUs), but also integrated graphics chipsets or mobile GPUs (see the two foldables below for more information about them). Aspects of Vulkan that may seem needlessly contrived at first glance are often explained by its role as a facilitator for writing a single program that runs efficiently both on classical GPUs and on more exotic devices.

Integrated graphics chipsets

Integrated graphics chipsets are simple implementations of GPUs that share resources with the CPU. Although they are less powerful than equivalent discrete GPUs, they are cheaper and less power hungry. Such devices typically use CPU-addressable memory directly.

Mobile GPUs and tiled architectures

Mobile GPUs are subject to energy efficiency constraints. Although modern high-end desktop GPUs consume up to 450W of power, mobile phones (CPU + GPU) typically stay below 8W.

A common optimization found in such architectures is tile-based rendering, an approach that greatly reduces the weight of memory-related operations. Classical GPUs handle rendering for an entire image in one go. In contrast, tiled architectures break up the image in a set of small tiles (typically 16x16 pixels) that go through the pipeline individually, assembling the results into a whole image at the end only. The data required to compute the results for a tile is so limited that it may fit entirely in the cache (which is the key to limiting the memory load).

See this Arm documentation page or this blog post for more information.

A. A high-level overview

A Vulkan device exposes multiple memories with different characteristics. Some memories are only accessible to the GPU, others are physically shared between the GPU and the CPU, others still are physically located on the GPU but provide a mechanism for the host to view and set their contents. The actual memories of a device are called memory heaps. Each allocation on a heap is attached to a behavior called a memory type. Each heap supports a certain number of memory types. Different allocations on the same heap may be attached to different memory types.

We use allocation to reserve a portion of some memory of a device (just like allocating memory from the CPU reserves some portion of the machine's working memory). Allocating memory is a costly operation. In fact, devices define strict (and typically quite low) limits regarding how many concurrent allocations we can use. We will not be getting away with one allocation per object. Instead, we should allocate in bulk and reuse memory as much as possible, be it through manual management or the use of a third-party, custom allocator that handles sub-allocation transparently. Allocations are tied to a memory type (from which we can deduce the targeted heap); a memory type really corresponds to the behavior of precise allocations. The next paragraph gives more information as to how these behaviors may differ.

Data exchanges between the CPU and a device occur through the operation known as mapping. Mapping requires a memory heap that is shared between the device and the CPU. Setting up a mapping returns a pointer to CPU-addressable memory, which points either to memory that is physically shared between the CPU and the device, or to memory that is kept in sync with device-side memory, depending on the targeted heap. In the latter case, synchronization is either handled implicitly by the driver or explicitly by us (this depends on the targeted type; it is common to have a heap offering both a type for implicit synchronization and another one for explicit synchronization). Explicit synchronization is more efficient but more cumbersome. To send data to the device, we just write to the portion of memory designated by the pointer (and push those changes to the device explicitly when using explicit synchronization).

Mapping is the only way of sending data from CPU-land to GPU-land. However, the heaps that are mappable are typically not the fastest ones available. To get the data onto faster heaps, we use transfer operations. Loading data onto a GPU is therefore commonly done in two steps: first the import through mapping, and then the transfer to a more efficient heap. Vulkan defines distinct, specialized transfer operations for every type of resource it supports.

In this chapter, we introduce two kinds of resources: buffers and images. Buffers are the simplest kind of resource there is. They are simply wrappers around a linear array of data. Images are more complex. They may be 1D, 2D or 3D, they may be made of diffent kinds of pixels (e.g., black and white or RGB), they may use mipmapping (with different parameters), they may hold multisampling information and they may be laid out in memory in different ways.

Resources cannot be freely accessed from any queues without synchronization. Synchronization may be handled implicitly (concurrent resource) or explicitly (exclusive resource). For concurrent resources, the queue families granted access need to be declared upfront, and the driver handles everything behind the scenes from then on. On the other hand, exclusive resources are tied to the notion of ownership and ownership transfers. In general, a queue should access an exclusive resource only if this resource is currently owned by the family of this queue (in practice, we can skip the ownership transfer when we don't care about the previous contents of the resource, although we should still make sure that the resource is not being modified from two families at once; also, the first ever queue family to access a resource gains ownership implicitly, without the need for any transfer). Concurrent resources may come with some overhead, so it is safer to use exclusive resources and manual ownership transfers for performance. Note that ownership transfers and resource transfers are different concepts: the former is handled through specialized memory barriers (a pair of them, actually, one for releasing ownership and one for acquiring it, plus something like a semaphore to ensure that the release indeed happens before acquisition; my mental model for this is that we push changes from the cache used by the queue family we release from, and pull into the cache of the queue family we acquire from), and the latter is handled through special transfer commands.

Resource definition and resource initialization are not done in a single step. When we create the handle for a resource, we describe all the relevant information about the structure of this resource (e.g., its size), but we still have to create an actual binding to memory. The data eventually lives in some memory heap offered by the device. Note that resource handles and their associated memory mate for life (except for the "sparse" variants of resources, which may even be spread across multiple sections of memory, but are not supported by all GPUs, as we will see).

Each resource is tied to a set of supported usages (which we declare upfront and can never change). For instance, there is a special usage flag for indicating that a resource is a valid target for data transfers (there is also one for valid sources for transfers, and many others that we will meet in due time).

In addition to actual transfer operations (with names such as vkCmdCopyBuffer or vkCmdCopyImageToBuffer), there are other operations that are supported by transfer queues even though they do not properly correspond to transfers. These functions are about updating the contents of resources in some way (e.g., vkCmdFillBuffer fills a region of a buffer with copies of a single value, and vkCmdBlitImage can be used to create a scaled version of an image).

B. A deeper dive

B.1. Memories

B.1.1. Heaps and types

A Vulkan device exposes multiple memories with different characteristics. vkGetPhysicalDeviceMemoryProperties returns information about the memories owned by a physical device as a set of VkMemoryHeaps (the actual memories) and a set of VkMemoryTypes (behaviors that allocations can adopt). Although each memory type refers to a single heap, a heap may be tied to several types, and distinct allocations on the same heap may use different types.

The main information found in a VkMemoryHeap is its size and whether it is "device-local" (indicated by a flag), which is short for "among the fastest available heaps for accesses from the device itself". We can think of them as heaps that are physically located on the device, although each device is required to include at least one such local heap, so even devices that really are CPU-side emulations of GPUs will include such memories. VkMemoryTypes are more complex. They include a bitmask for flags of the form VK_MEMORY_PROPERTY_XXX_BIT that specify if the memory is:

The precise meaning of different flags combinations varies subtly from constructor to constructor. This blog post by Adam Sawicki gives concrete information as to the situation on the ground.

Each device is guaranteed to include one type that is both HOST_VISIBLE and HOST_COHERENT, as well as one that is DEVICE_LOCAL (there may even be a single type in total so long as it meets both these requirements).

B.1.2. Allocating memory

We use vkAllocateMemory to reserve a chunk of memory of a given size in some heap through a specific memory type. Successful allocations return VkDeviceMemory handles. In Vulkan, resources such as buffers or images are created in two steps: the declaration of the resource and the binding to memory that holds the data. We actually use a VkDeviceMemory object (plus an offset value) to initiate such bindings. When we are done using an allocation, we deallocate the memory chunk using vkFreeMemory.

Creating Vulkan memory allocations is slow and subject to limitations. Most drivers only support a limited number of simultaneous allocations (at least 4096 and rarely much more; see VkPhysicalDeviceProperties's limits field to get the exact value for a device). Even relatively simple tasks often make use of more resources than this. Consequently, the naïve strategy of allocating memory once per resource is not viable. Instead, we must sub-allocate: we reserve memory in bulk and manually split it into different sections dedicated to indidual resources.

The default memory allocator used by Vulkan is simple and naïve. We can import a third-party allocator such as Vulkan Memory Allocator to handle sub-allocation mostly transparently. See this blog post by Kyle Halladay for a walkthrough of a custom allocator implementation (it starts with general information about memory types in the wild).

B.1.3. Mapping memory

We use mapping to exchange data between the CPU and the GPU. Remember that HOST_VISIBLE memories can be written to directly from the CPU (the connection to the GPU is handled implicitly). vkMapMemory creates mappings, returning classical CPU pointers for memory slices through its ppData argument. We can then write to this memory from the CPU, and the changes get reflected to the bound GPU memory. vkUnmapMemory undoes this mapping.

Memory that is HOST_VISIBLE but not HOST_COHERENT needs to be synced manually. We use vkFlushMappedMemoryRanges to push the CPU's version of data to the device whereas vkInvalidateMappedMemoryRanges updates the CPU copy of the data with the device's version. In effect, the driver calls these functions implicitly for HOST_COHERENT memory (at the risk of inserting unnecessary calls to these costly functions).

B.2. Resources

Mapping only works for the typically slow HOST_VISIBLE memories. It is required to bring data onto the GPU, but we need something more to move this data to more efficient, non-mappable memory for frequent access. This is what the transfer operations are about. However, these do not work with raw memory but with resource objects, such as buffers or images. Before discussing the operations themselves, we need to introduce the objects they work with.

In this section, we introduce two kinds of resources: buffers and images. We describe how these resources are defined, but not yet how they are used. We will encounter some of the functions that make use of these resources in the next section.

B.2.1. Buffers

Buffers are a resource type with a bland personality. They consist of linear arrays of arbitrary data. We create buffers using vkCreateBuffer, specifying a size, an intended usage and a queue family ownership mode.

The usage flags of a buffer are set at the time of its creation. These flags determine which operations can be done on the object. For instance, if we want to be able to transfer a buffer between heaps, then we must set its VK_BUFFER_USAGE_TRANSFER_SRC_BIT flag (we also need to create another buffer object on the other heap, this time with the VK_BUFFER_USAGE_TRANSFER_DST_BIT flag).

For synchronizing accesses to the same buffer from different queues, we have two options: either we put the buffer in exclusive mode (VK_SHARING_MODE_EXCLUSIVE), where we accept the burden of enforcing synchronization between distinct queue families, or we use concurrent mode (VK_SHARING_MODE_CONCURRENT), where we declare a set of queue families that are allowed to access the resource and let the driver handles synchronization issues automatically for them. The concurrent mode may add some overhead and may do some unnecessary transfers behind the scenes, so exclusive mode should be preferred (although this is not that significant; we can get away with some laziness).

Ownership transfers are done through a specialized version of memory barriers (VkBufferMemoryBarriers, more detail in the last section of this chapter). More exactly, ownership is transferred through a pair of barriers: one is passed to the queue releasing ownership, and the other is passed to the queue acquiring it. Of course, it would be silly to acquire ownership before it has been fully released, so we should also synchronize these two operations (e.g., via a semaphore). Ownership really is about synchronization between different levels of memories in the hierarchy that goes from local caches to a device's main memory. The fact that we use a barrier for this purpose should therefore not come as surprise. Handling pushes to main memory and pulls from local cache is precisely what memory barriers are about. The only thing that may be surprising is what this has to do with queue families. Do queue families share some kind of cache? As it turns out, yes, they do (at least conceptually)! The detail can be found in the Vulkan memory model.

We tie a buffer to its data via vkBindBufferMemory. vkGetBufferMemoryRequirements gives us information about the memory requirements for a buffer: size, alignment and compatible memory types. We must bind the buffer before using it, and there is no way to undo that binding short of destructing the buffer. In particular, it is illegal to rebind an already bound buffer.

Sparse buffers

We must bind classical resources contiguously to one memory object before their first use, and this binding is final. Sparse resources such as sparse buffers relax these restrictions: they can bind non-contiguously to multiple memory objects, and we may update their bindings throughout their lifetime.

Not all devices support sparse binding. To check if a specific physical device does, we check the result of vkGetPhysicalDeviceFeatures. To create a sparse buffer, we pass flag VK_BUFFER_CREATE_SPARSE_BINDING_BIT to the buffer at the time of its creation. Additionally, we may pass the VK_BUFFER_CREATE_SPARSE_RESIDENCY_BIT (the buffer may be used even though only part of it is bound) and VK_BUFFER_CREATE_SPARSE_ALIASED_BIT (the same memory may be bound multiple time by the same or different buffers) flags, so long as vkGetPhysicalDeviceFeatures allows it.

vkQueueBindSparse creates or updates a sparse binding. For more detail and examples, including information about alignment or the handling of subresources, check this documentation entry.

B.2.2. Images

There are many similarities between images and buffers. We create images with vkCreateImage and bind them to memory using vkBindImageMemory. When creating images, we specify an usage (with a different family of flags but it is the same idea) and a sharing mode (again, exclusive or concurrent). We also check for memory requirements like for buffers, using vkGetImageMemoryRequirements.

However, there are also differences. When creating an image, we do not specify its size directly. Indeed, the image type is quite complex and the driver may store the data of an image in different manners throughout its lifetime, making its memory footprint variable. Here are the properties that are exclusive to images.

Devices may not support all settings for images. We use vkGetPhysicalDeviceImageFormatProperties to check whether a precise combination of type + format + tiling + usage (+ image create flags, as required for sparse images and some other niche features) is supported by a physical device. If the combination is supported at all, additional information about the maximum possible dimensions of the image, mipmapping levels, layers and samples are returned, as well as an upper bound on the maximum amount of memory that such an image is allowed to use on the device (the system may not report less than 2GiB).

Additionally, vkGetPhysicalDeviceFormatProperties indicates which commands/usages are supported for images of specific formats.

Just like for buffers, queue family ownership transfers work through a specialized version of VkMemoryBarrier called VkImageMemoryBarrier. This structure is also used for performing layout transitions, as we will discuss in the last section of this chapter.

Mipmapping and subresources

When we use mipmapping, we introduce a bunch of subimages attached to the main resource (like this or like this, depending on how we handle it). We can rely on vkGetImageSubresourceLayout to get information regarding how these images are laid out in memory, so long as they use linear tiling (which I repeatedly told you to avoid).

See this tutorial for more information about mipmapping, including the generation of mipmap data (you should probably finish reading this chapter first).

Sparse images

Just like there are sparse buffers, there are sparse images. These work in mostly the same way as sparse buffers. Subblocks of the image (and of its subresources) can be split between different bindings at a given level of granularity. See the available flags: more of the same (not all flags on this page are not related to sparse bindings). vkGetImageSparseMemoryRequirements returns information about memory requirements for sparse images.

Again, we use vkQueueBindSparse for handling bindings. For more detail and examples, including information about alignment requirements, check this documentation entry.

B.3. The transfer pipeline

Conceptually, all Vulkan commands submitted through command buffers go through pipelines. The transfer pipeline is the simplest of them all, to the point where calling it a pipeline feels a bit wrong:

Top of pipe
Transfer
Bottom of pipe

The command enters the pipeline, executes (in the Transfer stage, which does the heavy lifting; this is the only stage in this pipeline that feels like an actual stage) and exits the pipeline. That is all!

Transfer operations are so fundamental that they are supported by all queues; no check necessary. There are distinct commands for handling buffers and images, as discussed below.

B.3.1. Buffer operations

Buffer transfer commands do not operate on full buffers but on regions thereof. A buffer region is characterized by an offset and a size. To work on the full buffer, we set the offset of a region to 0 and the size to that of the full buffer. The read regions should not overlap with the regions that are written to (otherwise, we lose any guarantees regarding the state of the overlapping region).

Only a handful of commands are available for buffers:

B.3.2. Image operations

The notion of region gets more complex for images. They may be 1D, 2D or 3D and they are tied to a specific subresource of the image.

The following commands are available for images:

Additionally, there is a pair of commands for turning buffers into images and the other way around:

B.4. Specialized synchronization primitives

Here are the last two synchronization mechanisms we did not see in the previous chapter, VkBufferMemoryBarrier (mostly useless, bar for ownership transfers) and VkImageMemoryBarrier (used for ownership transfers and layout transitions).

These constructs are variants of VkMemoryBarrier, which we met previously. In theory, they can be used as memory barriers that only concern a single specific resource. In practice, this feature is not supported by any major driver (at least according to some sources). The resulting behavior is that of a classical memory barrier, so we are just as well off using just classical barriers for this purpose (I do not think we would be worse of, however, and exotic/future devices could benefit from more specialized barriers being used, so why not?).

We execute both of these primitives by passing them as arguments to vkCmdPipelineBarrier.

B.4.1. Buffer barriers

In practice, we only use VkBufferMemoryBarrier for one purpose: ownership transfers of buffers or subsets thereof. We only need to specify the index of the previous owner (remember that we are dealing with queue families) and that of the new owner and we are set (except that we need to do it twice, once for release and once for acquisition, and we also need to add synchronization between these two steps, as discussed above).

B.4.2. Image barriers

Just like buffer barriers, VkImageMemoryBarrier can be used for ownership transfers. However, this primitive has an additional trick up its sleeve, namely layout transitions.

When we want an image resource to take on a new roles, we do a layout transition. For this, we simply specify the original layout of the image as well as its new one. Note that layout transitions happen in local cache: we need to make sure that it is in sync with the main memory beforehand. Additionally, its effects are not automatically pushed to main memory. The layout transition occurs after the commands that the barrier waits on are done executing, and before the instructions waiting on the barrier run.