Chapter 2: resources and transfers

Warning

This chapter requires some proofreading

In the previous chapter, we discussed the basics of Vulkan. We learned how to:

See available devices
Create a connection to a device
Send commands to a device
Add synchronization between commands

Before turning to concrete examples of workflows such as rendering or computing, we need to introduce resources and transfers. Indeed, all the interesting operations that we may run on GPUs are about data (resources). We therefore require a way of exchanging data between the host and the GPU (transfers).

Along the way, we discuss how Vulkan represents memories. The concept of memory in Vulkan is not self-evident. Vulkan natively supports very different kinds of devices such as classical discrete GPUs, integrated graphics chipsets and mobile GPUs. Something as simple as a memory access differ across distinct system. Furthermore, we introduce two important kinds of resources: the generic buffer type and the more specialized image type.

Remember that Vulkan provides a unified API for interacting with very different forms of GPUs: not only classical GPUs, but also integrated graphics chipsets or mobile GPUs (see the two foldables below for more information about them). Some aspects of Vulkan that may seem needlessly contrived at a glance are often explained by its role as a facilitator for writing single programs that run efficiently both on classical GPUs and on these more exotic devices.

Integrated graphics chipsets

Integrated graphics chipsets are simple implementations of GPUs that share resources with the CPU. Although they are less powerful than equivalent discrete (i.e., non-integrated) GPUs, they are cheaper and less power hungry. Such devices typically use the CPU's memory directly.

Mobile GPUs and tiled architectures

Mobile GPUs are subject to energy efficiency constraints. Although modern high-end GPUs consume up to 450W of power, mobile phones (CPU + GPU) typically stay below 8W.

A common optimization found in such architectures is tile-based rendering, an approach that reduces the amount of memory accesses (which are not cheap!) when rendering graphics. Classical GPUs handle rendering for the entire image in one go. In contrast, tiled architectures break up the image in a set of small tiles (typically 16x16 pixels) that go through the pipeline individually, assembling the results into a whole image at the end only. The data required to compute the results for a tile is so limited that it may fit entirely in the cache.

See this Arm documentation page or this blog post for more information.

1. Memories

1.1. Heaps and types

A Vulkan device exposes multiple memories with different characteristics. vkGetPhysicalDeviceMemoryProperties returns information about the memories owned by a physical device as a set of VkMemoryHeaps (the actual memories) and a set of VkMemoryTypes (behaviors that allocations can adopt). Although each memory type refers to a single heap, a heap may be referred to by several types.

The main information found in a VkMemoryHeap is its size and whether it is located physically on the device (if it is, then accessing this memory from within the device is fast; note that each device is required to include at least one such local heap, so even devices that really are CPU-side emulations of GPUs will include such "local" memories). VkMemoryTypess are more complex. They include a bitmask for flags of the form VK_MEMORY_PROPERTY_XXX_BIT that specify if the memory is:

DEVICE_LOCAL: physically located on the device; this flag is set if and only if the heap referenced by this type is local, and it is required for:
- LAZILY_ALLOCATED: only accessible from within the device
HOST_VISIBLE: mappable from the CPU (precludes LAZILY_ALLOCATED); always comes with at least one of the following specializations:
- HOST_COHERENT: mapped memory is kept in sync automatically
- HOST_CACHED: mapped memory is cached on the host, making CPU-side access efficient, though the memory may fall out-of-sync

DEVICE_LOCAL memories are located on the GPU and are therefore fast to access (from the device). On the other hand, non DEVICE_LOCAL memories represent slower mechanisms that interact with CPU-side memory with implicit synchronization. LAZILY_ALLOCATED types are used for tiled architectures, where transient resources do not need to be allocated in full as the memory allocated for managing one tile can be reused for different portions of the image (the system handles this automatically). HOST_COHERENT types are always safe to read from the CPU although they may lead to unnecessary synchronization by the driver. HOST_CACHED types are typically more efficient but require manual synchronization. There may be types matching both HOST_COHERENT and HOST_CACHED. This is typically the case for types representing mechanisms for interacting with the CPU cache directly.

The meaning of different flags combinations varies subtly from constructor to constructor. This blog post by Adam Sawicki gives concrete information as to the situation on the ground.

Each device is guaranteed to include one type that is both HOST_VISIBLE and HOST_COHERENT, as well as one that is DEVICE_LOCAL (there may be only one type in total as long as it matches both these requirements).

Manually syncing HOST_VISIBLE memory that is not HOST_COHERENT

vkFlushMappedMemoryRanges is used to pull the CPU's version of data whereas vkInvalidateMappedMemoryRanges pushes local changes to the CPU. In effect, Vulkan calls these functions implicitly for HOST_COHERENT memory (at the risk of inserting unnecessary calls to these costly functions).

1.2. Allocating memory

We use vkAllocateMemory to reserve a memory chunk of a specific size through a specific type. When we are done using it, we deallocate the memory chunk using vkFreeMemory.

Vulkan memory allocations are slow and subject to limitations. Most drivers only support a limited number of simultaneous allocations (at least 4096 and rarely much more; see vkGetPhysicalDeviceProperties's limits field to see the concrete limit for a device). Even simple tasks may make use of more resources than this. Consequently, a naive strategy allocating memory once per resource is not viable. Instead, we must sub-allocate: we only reserve memory in bulk and manually split it into different sections dedicated to indidual resources.

The default memory allocator used by Vulkan is simple and naive. We can import a third-party allocator such as Vulkan Memory Allocator to perform sub-allocation transparently. See this blog post by Kyle Halladay for a walkthrough of a custom allocator implementation (it starts with interesting general information about memory types in the wild).

1.3. Mapping memory

We use mapping to exchange data between the CPU and the GPU. Remember that HOST_VISIBLE memories can be written to directly from the CPU (the connection to the GPU is handled implicitly). vkMapMemory creates a mapping, returning a (classical CPU) pointer through its ppData argument for a certain slice of a memory. The CPU can then write to this memory and the changes are reflected to the bound GPU memory. vkUnmapMemory undoes this mapping.

2. Resources

Mapping only works for the typically slow HOST_VISIBLE memories. It is required to bring data onto the GPU, but we need something more to move this data to more efficient, non-mappable memory for frequent access. Up to this point, we did not see how to access such memories! This is what the transfer pipeline is about. Before we see how transfers are done, we need to define resources to transfer.

In this section, we introduce two kinds of resources: buffers and images. We describe how these resources are defined, but not yet how they are used. The first functions using these resources that we will see will be the transfer commands presented in the next section.

2.1. Buffers

Buffers are a resource type with a bland personality. They consist of linear arrays of data and may hold any sort of information. We create buffers using vkCreateBuffer. A buffer is characterized by its size, its intended usage and how accesses from different queues are synchronized.

The usage flags of a buffer are set once and for all at the time of its creation. These flags determine which operations can be done on the object. For instance, if we want to transfer a buffer to another queue, we need to set the VK_BUFFER_USAGE_TRANSFER_SRC_BIT flag (we also create another buffer object on the other queue, this time with the VK_BUFFER_USAGE_TRANSFER_DST_BIT flag).

For synchronization, two options are available: either the buffer is accessed only from a single queue (mode VK_SHARING_MODE_EXCLUSIVE) or it is accessible to a set of queues (mode VK_SHARING_MODE_CONCURRENT). In the latter case, we specify exactly which queues may access the object. Buffers with the exclusive sharing mode (i.e., those that are not shared) are very efficient: the device's driver can read and write their memory without additional checks.

There is a notion of ownership for resources subject to the exclusive sharing policy. As long as only one queue uses the data, we need not care about it. However, as soon as we want to use such a resource in several distinct queues, we need to add synchronization (otherwise, its contents may become invalid; if they were not valid to start with, then we are not actually obligated to synchronize anything — the exclusive sharing mode only means that only a single queue can access the resource with guaranteed coherence in the absence of explicit synchronization, not that only a single queue can access the resource in general). When multiple queues are involved, we release ownership in one queue before acquiring it from another queue and accessing it from there. This is done using a specialized version of memory barriers (actually a pair thereof: one for the queue releasing ownership and one for the queue acquiring it), VkBufferMemoryBarriers, which we introduce in the last section of this chapter.

A buffer object is not attached to specific memory at the time of its creation. Instead, we must call vkBindBufferMemory to initiate such a binding. vkGetBufferMemoryRequirements is used to get information about memory requirements for such bindings: size, alignment and compatible memory types. We must bind the buffer before using it, and there is no way to undo that binding short of destructing the buffer. In particular, it is illegal to bind an already bound buffer.

Sparse buffers

By default, we must bind resources contiguously to one memory object before their first use and this binding is final. Sparse buffers relax these restrictions: they can bind non-contiguously to multiple memory objects, and we may update their bindings throughout their lifetime.

Not all devices support sparse binding. To check if a specific physical device does, we need to check the result of vkGetPhysicalDeviceFeatures. To create a sparse buffer, we pass flag VK_BUFFER_CREATE_SPARSE_BINDING_BIT. Additionally, the VK_BUFFER_CREATE_SPARSE_RESIDENCY_BIT (the buffer may be used even though only part of it is bound) and VK_BUFFER_CREATE_SPARSE_ALIASED_BIT (the same memory may be bound multiple time by the same or different buffers) flags may be specified if vkGetPhysicalDeviceFeatures allows it.

vkQueueBindSparse creates or updates a binding. For more details and examples, including information about alignment or the handling of subresources, check this documentation entry.

2.2. Images

There are similarities between images and buffers. We create images with vkCreateImage and bind them to memory using vkBindImageMemory. When creating images, we specify their usage (with a different family of flags but it is the same idea) and their sharing (again, exclusive or concurrent). We also check for memory requirements like for buffers, using vkGetImageMemoryRequirements.

However, there are also differences. When creating an image, we do not specify its size directly. Indeed, the image type is quite complex and the driver may store the information different manners requiring with a variable memory footprint.

Type: dimensionality of the image (Vulkan supports both 1D, 2D and 3D "images")
Extent: actual dimensions of the image, e.g. 1920x1080
Format: type of a pixel, e.g. 8 bits for red + 8 for green + 8 for blue (+ 8 for transparency level) or just 16 bits per pixel for a high-precision grayscale image
Mipmap levels count: number of mipmapping levels
Layer count: actual number of images (an image may be not just one but an array of images, referred to as layers; e.g., cubemaps use 6 layers)
Sample count: number of samples per pixel (used for multisampling). You may find this field to be out of place. After all, the number of samples is only relevant internally to the renderer and we do not commonly handle images with multiple samples per pixel. However, Vulkan gives us access to the internals of the renderer, including stages where we forward images with multisampling information before resolving them.
Tiling: immutable constraints regarding how the data is laid out in memory: linear (pixels are laid out in memory in row-major order, like a classical array) or optimized (the driver may do anything with the memory). In practice, we always use optimized (we do not even import images from the CPU as linear images but as buffers; linear is slow and comes with a bunch of restrictions)
Initial layout: the layout is an additional setting pertaining to the organization of memory. Unlike tiling, it can change over time (we will see how at the end of this chapter). Layouts reflect the intended usage of an image at the current time. For instance, we would use VK_IMAGE_LAYOUT_TRANSFER_SRC_OPTIMAL to indicate that an image is the source in a transfer (whatever this means for the hardware — maybe the data is shuffled around in memory, maybe the actual memory layout is left unchanged). Assuming that the tiling is set to optimized, the initial layout of an image may only be the undefined one (and you should always use optimized tiling)

Devices may not support all settings for images. vkGetPhysicalDeviceImageFormatProperties can be used to check whether a precise combination of type + format + tiling + usage (+ image create flags, which I did not mention previously as they are only used for certain advanced features) is supported by a physical device. If the combination is supported at all, additional information about the maximum possible dimensions of the image, mipmapping levels, layers and samples are returned, as well as an upper bound on the maximum amount of memory that such an image is allowed to use on the device (the system may not report less than 2GiB).

Additionally, vkGetPhysicalDeviceFormatProperties indicates which commands/usages are supported for images of specific formats.

Mipmapping and subresources

When we use mipmapping, we introduce a bunch of subimages attached to the main resource (like this or like this, depending on how we handle it). vkGetImageSubresourceLayout can be used to get information about these subimages.

See this tutorial for more information about mipmapping (although you should probably finish reading this chapter first).

Sparse images

Just like there are sparse buffers, there are sparse images. These work in mostly the sam way than sparse buffers. Subblocks of the image (and of its subresources) can be split between different bindings at a given level of granularity. See the available flags: more of the same (note that some of the flags on this page are not related to sparse bindings). vkGetImageSparseMemoryRequirements returns information about memory requirements for sparse images.

Again, vkQueueBindSparse is used for handling bindings. More details and examples, including information about requirements in alignment, check this documentation entry.

3. The transfer pipeline

Vulkan commands go through pipelines. The transfer pipeline is the simplest of them all, to the point where calling it a pipeline feels a bit wrong:

Top of pipe

↓

Transfer

↓

Bottom of pipe

The command enters the pipeline, executes (in the Transfer stage, which does the heavy lifting) and exits the pipeline. That is all!

Transfer operations are fundamental, to the point where they are supported by all queues. There are distinct commands for handling buffers and images, as introduced below.

3.1. Buffer operations

Commands do not work on full buffers but on regions thereof. A buffer region is characterized by an offset and a size. To work on the full buffer, we set the offset of a region to 0 and the size to that of the full buffer. The read regions read from should not overlap with regions written to, otherwise we lose all guarantees about the read values.

Only a handful of commands are available for buffers:

vkCmdCopyBuffer: copies the contents of regions of a buffer to regions of another buffer.
vkCmdFillBuffer: fills a region of a buffer with repeated copies of a 32 bit value.
vkCmdUpdateBuffer: refreshes a region of a buffer by pulling in changes from host memory (this only works for small amounts of data: up to 65536 bytes).

3.2. Image operations

The notion of regions gets more complex for images. They may be 1D, 2D or 3D and they are tied to a specific subresource of the image.

Commands are available for images are:

vkCmdCopyImage: copies the contents of regions of an image to regions of another image.
vkCmdBlitImage: a powerful image manipulation command (for images with only one sample of a supported format. It is a copy command with a twist: the source and destination regions may not have the same size! This is used for scaling images (which comes for instance in handy to build mipmaps). Different filters may be specified for upscaling (note that linear filtering is only available if the format supports VK_FORMAT_FEATURE_SAMPLED_IMAGE_FILTER_LINEAR_BIT).
vkCmdClearColorImage: overwrites regions of an image with the requested color.
vkCmdClearDepthStencilImage: overwrite regions of images representing depth information with the driver-defined default value standing for "no depth in information".
vkCmdResolveImage: resolves multisampling for regions of an image by leaving only one sample with an appropriate value per pixel. This command takes both a source image with multiple samples and a destination image with only one sample as input.

Additionally, there is a pair of commands for turning buffers into images and the other way around:

4. Specialized synchronization primitives

Here are the last two synchronization mechanisms we did not see in the previous chapter, VkBufferMemoryBarrier (mostly useless, bar for ownership transfers) and VkImageMemoryBarrier (used for ownership transfers and layout transitions).

These constructs are variants of VkMemoryBarrier, which we met previously. In theory, they can be used as memory barriers that only concern a single specific resource. In practice, this feature is not supported by any major driver (at least according to this or that). The resulting behavior is that of a classical memory barrier so we are better off using just that for this purpose.

Both of these primitives are executed by passing as arguments them to vkCmdPipelineBarrier.

4.1. Buffer barriers

In practice, we only use VkBufferMemoryBarrier for one purpose: ownership transfers of buffers or subsets thereof. We only need to specify the family index of the previous owner and that of the new owner and we are set.

4.2. Image barriers

Just like buffer barriers, VkImageMemoryBarrier can be used for ownership transfer. However, this primitive has an additional trick up its sleeve, namely layout transitions.

When we want to change the current role of an image resource, we do a layout transition. To actually do a layout transition, we simply specify the original layout of the image as well as its new one. Note that layout transitions happen in the local cache: we need to make sure that the local cache is in sync with the main memory before running the operation. Additionally, its effects are not automatically pushed to main memory. Note that the layout transition occurs after the commands that the barrier waits on are done executing and before it unblocks the blocked ones.