Chapter 2: resources and transfers
In the previous chapter, we discussed the basics of Vulkan. We learned how to:
- See available devices and their properties
- Create a connection to a device
- Send commands to a device
- Synchronize commands
Before turning to concrete examples of workflows such as rendering or computing, we need to introduce resources and transfers. Indeed, all the interesting operations involving GPUs are about data (resources), and we require a way of exchanging data between the host and the GPU (transfers).
Along the way, we discuss how Vulkan represents memories. The concept of memory in Vulkan is not self-evident. As a start, a single device typically exposes several different memories. We also introduce two important kinds of resources: the generic buffers and the more specialized images.
Remember that Vulkan provides a unified API for interacting with very different forms of GPUs: not only classical GPUs (sometimes called discrete GPUs), but also integrated graphics chipsets or mobile GPUs (see the two foldables below for more information about them). Some aspects of Vulkan that may seem needlessly contrived at first glance are often explained by its role as a facilitator for writing a single program that runs efficiently both on classical GPUs and on more exotic devices.
Integrated graphics chipsets
Integrated graphics chipsets are simple implementations of GPUs that share resources with the CPU. Although they are less powerful than equivalent discrete GPUs, they are cheaper and less power hungry. Such devices typically use CPU-addressable memory directly.
Mobile GPUs and tiled architectures
Mobile GPUs are subject to energy efficiency constraints. Although modern high-end desktop GPUs consume up to 450W of power, mobile phones (CPU + GPU) typically stay below 8W.
A common optimization found in such architectures is tile-based rendering, an approach that limits the amount of memory accesses. Classical GPUs handle rendering for an entire image in one go. In contrast, tiled architectures break up the image in a set of small tiles (typically 16x16 pixels) that go through the pipeline individually, assembling the results into a whole image at the end only. The data required to compute the results for a tile is so limited that it may fit entirely in the cache (which is the key to limiting the memory load).
See this Arm documentation page or this blog post for more information.
A. A high-level overview
A Vulkan device exposes multiple memories with different characteristics. Some memories are only accessible to the GPU, others are physically shared between the GPU and the CPU, others still are physically on the GPU but provide a mechanism for the host to view and set their contents. The actual memories of a device are called memory heaps. Each allocation on a heap is attached to a behavior called a memory type. Each heap supports a limited amount of memory types. Different allocations on the same heap may be attached to different memory types.
We use allocation to reserve a portion of a device's memories (just like allocation can be used to reserve portions of the machine's working memory from the CPU). Memory allocations should not be used carelessly, as they are costly operations. In fact, a device sets strict (and typically quite low) limitations regarding how many concurrent allocations it supports. You will not be getting away with doing an allocation per object. Instead, we should allocate in bulk and reuse memory as much as possible, be it through manual management or the use of a third-party, custom allocator that handles sub-allocation transparently. Allocations target a memory type (from which we can deduce the targeted heap); memory types really correspond to the behavior of precise allocations. The next paragraph gives more information as to how these behaviors may differ.
Data exchanges between the CPU and a device occur through the operation known as mapping. Mapping requires a memory heap that is shared between the device and the CPU. Setting up a mapping returns a pointer to CPU-addressable memory, which points either to memory that is physically shared between the CPU and the device, or to memory that is kept in sync with device-side memory, depending on the targeted heap. In the latter case, synchronization is either handled implicitly by the driver or explicitly by us (this depends on the targeted type; it is common to have a heap offering both a type for implicit synchronization and another one for explicit synchronization). Explicit synchronization is more efficient but more cumbersome. To send data to the device, we just write to the portion of memory designated by the pointer (and push those changes to the device explicitly when using explicit synchronization).
Mapping is the only way of sending data from CPU-land to GPU-land. However, the heaps that are mappable are typically not the fastest ones available. To get the data onto faster heaps, we use transfer operations. Loading data onto a GPU is therefore commonly done in two steps: first the import through mapping, and then the transfer to a more efficient heap. Vulkan defines distinct, specialized transfer operations for every type of resource it supports.
In this chapter, we introduce two kinds of resources: buffers and images. Buffers are the simplest kind of resource there is. They are simply wrappers around a linear array of data. Images are more complex. They may be 1D, 2D or 3D, they may be made of diffent kinds of pixels (e.g., black and white or RGB), they may use mipmapping (with different parameters), they may hold multisampling information and they may be laid out in memory in different ways.
Resources can not be freely accessed from any queues without synchronization. Synchronization may be handled implicitly (concurrent resource) or explicitly (exclusive resource). For concurrent resources, the queue families granted access need to be declared upfront, and the driver handles everything behind the scenes from then on. On the other hand, exclusive resources are tied to the notion of ownership and ownership transfers. In general, a queue should access an exclusive resource only if this resource is currently owned by the family of this queue (in practice, we can skip the ownership transfer when we don't care about the previous contents of the resource, although we should still make sure that the resource is not being modified from two families at once; also, the first ever queue family to access a resource gains ownership implicitly, without the need for any transfer). Concurrent resources may come with some overhead, so it is safer to use exclusive resources and manual ownership transfers for performance. Note that ownership transfers and resource transfers are different concepts: the former is handled through specialized memory barriers (a pair of them, actually, one for releasing ownership and one for acquiring it, plus something like a semaphore to ensure that the release indeed happens before acquisition; my mental model for rationalizing this is that we push changes from the cache used by the queue family we release from, and pull into the cache of the queue family we acquire from), and the latter is handled through special transfer commands.
Resource definition and resource initialization are not done in a single step. When we create the handle for a resource, we describe all the relevant information about the structure of this resource (e.g., its size), but we still have to create an actual binding to memory. The data may live in any memory heap offered by the device. Note that resource handles and memory mate for life (except for the "sparse" variants of resources which may even be spread across multiple sections of memory, but are not supported by all GPUs).
Each resource is tied to a set of supported usages (which is declared upfront and can never be changed). For instance, there is a special usage flag for indicating that a resource is a valid target for data transfers (there is also one for valid sources, and many others that we will meet in due time).
In addition to actual transfer operations (with names such as vkCmdCopyBuffer or vkCmdCopyImageToBuffer), there are other operations that are registered as transfer operations even though they do not necessarily imply any real transfer. These functions are still about updating the contents of resources in some way (e.g., vkCmdFillBuffer fills a region of a buffer with copies of a single value, and vkCmdBlitImage can be used to create a scaled version of an image).
B. A deeper dive
B.1. Memories
B.1.1. Heaps and types
A Vulkan device exposes multiple memories with different characteristics. vkGetPhysicalDeviceMemoryProperties returns information about the memories owned by a physical device as a set of VkMemoryHeaps (the actual memories) and a set of VkMemoryTypes (behaviors that allocations can adopt). Although each memory type refers to a single heap, a heap may be referred to by several types (distinct allocations on this heap may be attached one of several distinct behaviors).
The main information found in a VkMemoryHeap is its size and whether it is "device-local", which is short for "among the fastest available heaps for accesses from this device". I usually think of them as heaps that are physically located on the device, although each device is required to include at least one such local heap, so even devices that really are CPU-side emulations of GPUs will include such memories. VkMemoryTypes are more complex. They include a bitmask for flags of the form VK_MEMORY_PROPERTY_XXX_BIT that specify if the memory is:
-
DEVICE_LOCAL types have the fastest access times (from the device).
non DEVICE_LOCAL memories represent slower mechanisms that interact
with CPU-side memory, requiring some synchronization. This flag is a prerequisite for:
-
LAZILY_ALLOCATED types are used for tiled architectures, where
transient resources do not need to be allocated in full as the memory allocated for managing
one tile can be reused for different portions of the image (the system handles this
automatically). Precludes HOST_VISIBLE.
-
HOST_VISIBLE: mappable from the CPU (precludes
LAZILY_ALLOCATED); always comes with at least one of the following
specializations:
- HOST_COHERENT: mapped memory is kept in sync automatically, although at the cost of eager and oftentimes excessive synchronization by the driver (leading to degraded performance).
- HOST_CACHED: mapped memory is cached on the host, making CPU-side access efficient, though the memory may fall out-of-sync; requires manual synchronization.
The precise meaning of different flags combinations varies subtly from constructor to constructor. This blog post by Adam Sawicki gives concrete information as to the situation on the ground.
Each device is guaranteed to include one type that is both HOST_VISIBLE and HOST_COHERENT, as well as one that is DEVICE_LOCAL (these requirements are compatible, so there may be only one type in total).
B.1.2. Allocating memory
We use vkAllocateMemory to reserve a memory chunk of a specific size in some heap through a specific memory type, as discussed above. Successful allocations return valid VkDeviceMemory handles. In Vulkan, resources such as buffers or images are created in two steps: the declaration of the resource and the binding to the memory that holds the actual data. This binding step requires one such device memory object and an offset within it. When we are done using it, we deallocate the memory chunk using vkFreeMemory.
Creating Vulkan memory allocations is slow and subject to limitations. Most drivers only support a limited number of simultaneous allocations (at least 4096 and rarely much more; see vkGetPhysicalDeviceProperties's limits field to see the actual limit for a concrete device). Even relatively simple tasks often make use of more resources than this. Consequently, the naive strategy of allocating memory once per resource is not viable. Instead, we must sub-allocate: we reserve memory in bulk and manually split it into different sections dedicated to indidual resources.
The default memory allocator used by Vulkan is simple and naive. We can import a third-party allocator such as Vulkan Memory Allocator to handle sub-allocation mostly transparently. See this blog post by Kyle Halladay for a walkthrough of a custom allocator implementation (it starts with general information about memory types in the wild).
B.1.3. Mapping memory
We use mapping to exchange data between the CPU and the GPU. Remember that HOST_VISIBLE memories can be written to directly from the CPU (the connection to the GPU is handled implicitly). vkMapMemory creates a mapping, returning a (classical CPU) pointer through its ppData argument for a certain slice of a memory. The CPU can then write to this memory and the changes are reflected to the bound GPU memory. vkUnmapMemory undoes this mapping.
Memory that is HOST_VISIBLE but not HOST_COHERENT needs to be synced manually. We use vkFlushMappedMemoryRanges to pull the CPU's version of data whereas vkInvalidateMappedMemoryRanges pushes local changes to the CPU. In effect, the driver calls these functions implicitly for HOST_COHERENT memory (at the risk of inserting unnecessary calls to these costly functions).
B.2. Resources
Mapping only works for the typically slow HOST_VISIBLE memories. It is required to bring data onto the GPU, but we need something more to move this data to more efficient, non-mappable memory for frequent access. This is what the transfer pipeline is about. The transfer operations do not work on raw memory but on resource objects such as buffers and images. We will soon get to these operations, but we need to discuss how the resources they work on behave before we get to this point.
In this section, we introduce two kinds of resources: buffers and images. We describe how these resources are defined, but not yet how they are used. We will encounter some of the functions that make use of these resources in the next section.
B.2.1. Buffers
Buffers are a resource type with a bland personality. They consist of linear arrays of data that may correspond to any kind of data. We create buffers using vkCreateBuffer. A buffer is characterized by its size, its intended usage and its queue family ownership mode.
The usage flags of a buffer are set at the time of its creation. These flags determine which operations can be done on the object. For instance, if we want to be able to transfer a buffer to another heap, we need to set the VK_BUFFER_USAGE_TRANSFER_SRC_BIT flag (we also need to create another buffer object on the other heap, this time with the VK_BUFFER_USAGE_TRANSFER_DST_BIT flag).
For synchronization, two options are available: either the buffer is in exclusive mode (VK_SHARING_MODE_EXCLUSIVE), where we accept the burden of enforcing synchronization between distinct queue families at a time, or the buffer is in concurrent mode (VK_SHARING_MODE_CONCURRENT), where we declare a set of queue families that are allowed to access the resource, and the driver handles synchronization issues automatically for queues from these families. The concurrent mode may add some overhead and may do some unnecessary transfers behind the scenes, so exclusive mode should be preferred (although the word on the street seems to be that it does not matter all that much; as always, so you can probably get away with some laziness).
Ownership transfers are done through a specialized version of memory barriers (VkBufferMemoryBarriers, more detail in the last section of this chapter). More exactly, ownership is tranferred through a pair of barriers: one is passed to the queue releasing ownership, and the other is passed to the queue acquiring it. Of course, it would be silly to acquire ownership before it has been fully released, so we should also synchronize these two operations, which can, e.g., be done with a semaphore. Ownership really is about synchronization between different levels of memories in in the hierarchy that goes from local caches to the main memory of a device. The fact that we use a barrier for this purpose should therefore not come as surprise. Handling pushes to main memory and pulls from local cache is precisely what memory barriers are about. The only thing that may be surprising is what this has to do with queue families. Do queue families share some kind of cache? As it turns out, yes, they do (at least conceptually)! The detail can be found in the Vulkan memory model.
A buffer object is not attached to specific memory at the time of its creation. Instead, we must call vkBindBufferMemory to tie a buffer to its data. We use vkGetBufferMemoryRequirements to get information about memory requirements for such bindings: size, alignment and compatible memory types. We must bind the buffer before using it, and there is no way to undo that binding short of destructing the buffer. In particular, it is illegal to bind an already bound buffer.
Sparse buffers
We must bind classical resources contiguously to one memory object before their first use, and this binding is final. Sparse resources such as sparse buffers relax these restrictions: they can bind non-contiguously to multiple memory objects, and we may update their bindings throughout their lifetime.
Not all devices support sparse binding. To check if a specific physical device does, we check the result of vkGetPhysicalDeviceFeatures. To create a sparse buffer, we pass flag VK_BUFFER_CREATE_SPARSE_BINDING_BIT to the buffer at the time of its creation. Additionally, we may pass the VK_BUFFER_CREATE_SPARSE_RESIDENCY_BIT (the buffer may be used even though only part of it is bound) and VK_BUFFER_CREATE_SPARSE_ALIASED_BIT (the same memory may be bound multiple time by the same or different buffers) flags, so long as vkGetPhysicalDeviceFeatures allows it.
vkQueueBindSparse creates or updates a binding. For more details and examples, including information about alignment or the handling of subresources, check this documentation entry.
B.2.2. Images
There are many similarities between images and buffers. We create images with vkCreateImage and bind them to memory using vkBindImageMemory. When creating images, we specify their usage (with a different family of flags but it is the same idea) and their sharing mode (again, exclusive or concurrent). We also check for memory requirements like for buffers, using vkGetImageMemoryRequirements.
However, there are also differences. When creating an image, we do not specify its size directly. Indeed, the image type is quite complex and the driver may store the data of an image in different manners throughout its lifetime, making its memory footprint variable. Here are the properties that are exclusive to images.
- Type: dimensionality of the image (Vulkan supports both 1D, 2D and 3D "images")
- Extent: actual dimensions of the image, e.g., 1920x1080
- Format: type of a pixel, e.g., 8 bits for red + 8 for green + 8 for blue (+ 8 for transparency level) or just 16 bits per pixel for a high-precision grayscale image
- Mipmap levels count: number of mipmapping levels
- Layer count: actual number of images (an image may stand for an array of images, referred to as layers; e.g., cubemaps use 6 layers)
- Sample count: number of samples per pixel (used for multisampling). You may find this field to be out of place. After all, the number of samples is only relevant internally to the renderer and we do not commonly handle images with multiple samples per pixel. However, Vulkan gives us access to the internals of the renderer, including stages where we forward images with multisampling information before resolving them.
- Tiling: immutable constraints regarding how the data is laid out in memory: linear (pixels are laid out in memory in row-major order, like a classical array) or optimized (the driver may do anything with the memory). In practice, we always use optimized (we do not even import images from the CPU as linear images but as buffers; linear is slow and comes with a bunch of restrictions)
- Initial layout: just like tiling, layout has to do with the organization of memory. Unlike tiling, it can change over time (we will see how at the end of this chapter). Layouts reflect the intended usage of an image at the current time. For instance, we would use VK_IMAGE_LAYOUT_TRANSFER_SRC_OPTIMAL to indicate that an image is the source in a transfer (whatever changes in layout mean for the hardware — maybe the data is shuffled around in memory, maybe the actual memory layout is left unchanged). Assuming that the tiling is set to optimized, the initial layout of an image may only be the undefined one (and we should always use optimized tiling)
Devices may not support all settings for images. We use vkGetPhysicalDeviceImageFormatProperties to check whether a precise combination of type + format + tiling + usage (+ image create flags, as required for sparse images and some other niche features) is supported by a physical device. If the combination is supported at all, additional information about the maximum possible dimensions of the image, mipmapping levels, layers and samples are returned, as well as an upper bound on the maximum amount of memory that such an image is allowed to use on the device (the system may not report less than 2GiB).
Additionally, vkGetPhysicalDeviceFormatProperties indicates which commands/usages are supported for images of specific formats.
Just like for buffers, queue family ownership transfers work through a specialized version of VkMemoryBarrier called VkImageMemoryBarrier. In addition, this structure is also used for performing layout transitions, as we will discuss in the last section of this chapter.
Mipmapping and subresources
When we use mipmapping, we introduce a bunch of subimages attached to the main resource (like this or like this, depending on how we handle it). vkGetImageSubresourceLayout can be used to get information regarding how these images are laid out in memory, so long as they use linear tiling (which I repeatedly told you to avoid).
See this tutorial for more information about mipmapping (although you should probably finish reading this chapter first).
Sparse images
Just like there are sparse buffers, there are sparse images. These work in mostly the same way as sparse buffers. Subblocks of the image (and of its subresources) can be split between different bindings at a given level of granularity. See the available flags: more of the same (note that some of the flags on this page are not related to sparse bindings). vkGetImageSparseMemoryRequirements returns information about memory requirements for sparse images.
Again, we use vkQueueBindSparse for handling bindings. For more details and examples, including information about requirements in alignment, check this documentation entry.
B.3. The transfer pipeline
Conceptually, all Vulkan commands submitted through command buffers go through pipelines. The transfer pipeline is the simplest of them all, to the point where calling it a pipeline feels a bit wrong:
The command enters the pipeline, executes (in the Transfer stage, which does the heavy lifting; this is the only stage in this pipeline that feels like an actual stage) and exits the pipeline. That is all!
Transfer operations are fundamental, to the point where they are supported by all queues. There are distinct commands for handling buffers and images, as introduced below.
B.3.1. Buffer operations
Buffer transfer commands do not operate on full buffers but on regions thereof. A buffer region is characterized by an offset and a size. To work on the full buffer, we set the offset of a region to 0 and the size to that of the full buffer. The read regions that are read from should not overlap with regions that are written to, otherwise we lose all guarantees regarding the values read from the region where the overlap occurs.
Only a handful of commands are available for buffers:
- vkCmdCopyBuffer: copies the contents of regions of a buffer to regions of another buffer.
- vkCmdFillBuffer: fills a region of a buffer with repeated copies of a 32 bit value.
- vkCmdUpdateBuffer: refreshes a region of a buffer by pulling in changes from host memory (this only works for small amounts of data: up to 65536 bytes).
B.3.2. Image operations
The notion of region gets more complex for images. They may be 1D, 2D or 3D and they are tied to a specific subresource of the image.
Commands are available for images are:
- vkCmdCopyImage: copies the contents of regions of an image to regions of another image.
- vkCmdBlitImage: a powerful image manipulation command (for images with only one sample of a supported format. It is a copy command with a twist: the source and destination regions may not have the same size! This comes in handy for building mipmaps. Different filters may be specified for upscaling (note that linear filtering is only available if the format supports VK_FORMAT_FEATURE_SAMPLED_IMAGE_FILTER_LINEAR_BIT).
- vkCmdClearColorImage: overwrites regions of an image with the requested color.
- vkCmdClearDepthStencilImage: overwrite regions of images representing depth information with the driver-defined default value standing for "no depth in information".
- vkCmdResolveImage: resolves multisampling for regions of an image by leaving only one sample with an appropriate value per pixel. This command takes two important inputs: a source image with multiple samples and a destination image with only one sample.
Additionally, there is a pair of commands for turning buffers into images and the other way around:
B.4. Specialized synchronization primitives
Here are the last two synchronization mechanisms we did not see in the previous chapter, VkBufferMemoryBarrier (mostly useless, bar for ownership transfers) and VkImageMemoryBarrier (used for ownership transfers and layout transitions).
These constructs are variants of VkMemoryBarrier, which we met previously. In theory, they can be used as memory barriers that only concern a single specific resource. In practice, this feature is not supported by any major driver (at least according to this or that). The resulting behavior is that of a classical memory barrier so we are just as well off using just classical barriers for this purpose (I do not think we would be worse of, however, and exotic/future devices could benefit from more specialized barriers being used, so why not?).
Both of these primitives are executed by passing them as arguments to vkCmdPipelineBarrier.
B.4.1. Buffer barriers
In practice, we only use VkBufferMemoryBarrier for one purpose: ownership transfers of buffers or subsets thereof. We only need to specify the index of the previous owner (remember that we are dealing with queue families) and that of the new owner and we are set (except that we need to do it twice, once for release and once for acquisition, and we also need to add synchronization between these two steps, as discussed above).
B.4.2. Image barriers
Just like buffer barriers, VkImageMemoryBarrier can be used for ownership transfers. However, this primitive has an additional trick up its sleeve, namely layout transitions.
When we want to change the current role of an image resource, we do a layout transition. To actually do a layout transition, we simply specify the original layout of the image as well as its new one. Note that layout transitions happen in local cache: we need to make sure that the local cache is in sync with the main memory before running the operation. Additionally, its effects are not automatically pushed to main memory. The layout transition occurs after the commands that the barrier waits on are done executing, and before the instructions waiting on the barrier run.