Chapter 2: resources and transfers
This chapter is currently being written
In the previous chapter, we discussed the basics of Vulkan. We learned how to:
- see available devices
- create a connection to a device
- send commands to a device
- add synchronization between commands
Before turning to concrete examples of workflows such as rendering or computing, we need to introduce resources and transfers. Indeed, all the operations that we are interested in running on GPUs are about data (resources), and we need to exchange them the host and the GPU (transfers). In particular, for discussing how Vulkan represents memories as well as specific kinds of resources (the generic buffer type and the more specialized image type).
The concept of memory in Vulkan is not self-evident. Vulkan natively supports very different kinds of devices such as classical discrete GPUs, integrated graphics chipsets and mobile GPUs (which are typically based on tiled architectures). One of the main differences between these systems lies in how they handle memory access.
insert high-level overview
Integrated graphics chipsets
Integrated graphics chipsets are simple implementations of GPUs that share resources with the CPU. Although they are less powerful than equivalent discrete (i.e., non-integrated) GPUs, they are cheaper and less power hungry. Such devices typically use the CPU's memory directly.
Mobile GPUs and tiled architectures
Mobile GPUs are subject to energy efficiency constraints. Although modern high-end GPUs consume up to 450W of power, mobile phones (CPU + GPU) typically stay below 8W.
A common optimization found in such architectures is tile-based rendering, an approach that reduces the amount of (costly) memory accesses when rendering graphics. Classical GPUs handle rendering for the entire image in one go. In contrast, tiled architectures break up the image in a set of small tiles (typically 16x16 pixels) that go through the pipeline individually, assembling the results into a whole image at the end only. The data required to compute the results for a tile is so limited that it may fit entirely in the cache.
See this Arm documentation page or this blog post for more information.
1. Memories
1.1. Heaps and types
A Vulkan device exposes multiple memories with different characteristics. vkGetPhysicalDeviceMemoryProperties returns information about the memories offered by a physical device as a set of VkMemoryHeaps (the actual memories) and a set of VkMemoryTypes (behaviors that allocations can adopt). Although a memory type refers to a single heap, a heap may be referred to by several types.
The main information that we can extract from a memory heap is its size and whether it is physically local to the device (making it faster to access within the device; note that each device is required to include at least one such local heap, so even CPU-side emulations of devices will include such "local" memories). Memory types are more complex. They include a bitmask for flags of the form VK_MEMORY_PROPERTY_XXX_BIT that specify if the memory is:
- DEVICE_LOCAL: physically located on the device;
this is required for:
- LAZILY_ALLOCATED: only accessible from within the device
- HOST_VISIBLE: mappable from the CPU (precludes
LAZILY_ALLOCATED); always comes with at least one
of the following specializations:
- HOST_COHERENT: mapped memory is kept in sync automatically
- HOST_CACHED: mapped memory is cached on the host, making CPU-side access efficient, though the memory may fall out-of-sync
DEVICE_LOCAL memories are located on the GPU and are therefore fast to access. On the other hand, non DEVICE_LOCAL memories represent slower mechanisms that interact with CPU-side memory with implicit synchronization. LAZILY_ALLOCATED types are used for tiled architectures, where transient resources do not need to be allocated in full as the memory allocated for managing one tile can be reused for different portions of the image (the system handles this automatically). HOST_COHERENT types are always safe to read from the CPU although they may lead to unnecessary synchronization buy the driver. HOST_CACHED types are typically more efficient but require manual synchronization. There may be types matching both HOST_COHERENT and HOST_CACHED. This is typically the case for types representing mechanisms for interacting with the CPU cache directly.
The meaning of different flags combinations varies subtly from constructor to constructor. This cool blog post by Adam Sawicki gives concrete information as to the situation on the ground.
Each device is guaranteed to include one type that is both HOST_VISIBLE and HOST_COHERENT, as well as one that is DEVICE_LOCAL (there may be a single type as long as it matches both these requirements).
Manually syncing HOST_VISIBLE memory that is not HOST_COHERENT
vkFlushMappedMemoryRanges is used to pull the CPU's version of data whereas vkInvalidateMappedMemoryRanges pushes local changes to the CPU. In effect, Vulkan calls these functions implicitly for HOST_COHERENT memory (at the risk of inserting unnecessary calls to these costly functions).
1.2. Allocating memory
We use vkAllocateMemory to reserve a memory chunk of a specific size through a specific type. When we are done using it, we deallocate the memory chunk using vkFreeMemory.
Vulkan memory allocations are slow and subject to limitations. Most drivers only support a limited number of simultaneous allocations (at least 4096 and rarely much more; see vkGetPhysicalDeviceProperties's limits field to see the concrete limit for a device). Even simple tasks may make use of more resources than there can be allocations. Consequently, a naive strategy allocating memory once per resource is not viable. Instead, we must sub-allocate: we only reserve memory in bulk and manually split it into different sections dedicated to indidual resources.
The default memory allocator used by Vulkan is simple and naive. We can import a third-party allocator such as Vulkan Memory Allocator to perform sub-allocation transparently. See this blog post by Kyle Halladay for a walkthrough of a custom allocator implementation (it starts with interesting general information about memory types in the wild).
1.3. Mapping memory
We use mapping to exchange data between the CPU and the GPU. Remember that HOST_VISIBLE memories can written to directly from the CPU (the connection to the GPU is handled implicitly). vkMapMemory creates a mapping, returning a (classical CPU) pointer through its ppData argument for a certain slice of a memory. The CPU can then write to this memory and the changes can be observed from the GPU. vkUnmapMemory undoes this mapping.
Mapping only works for the typically slow HOST_VISIBLE memories. It brings data onto the CPU, but we need something more to move this data to more efficient memory for frequent access. This is what the transfer pipeline is about.
2. Resources
2.1. Buffers
Buffers are a resource type with a bland personality. They consist of linear arrays of data and may hold any sort of data. We create buffers using vkCreateBuffer. A buffer is characterized by its size, its intended usage and how accesses from different queues are synchronized. For instance, to transfer a buffer to another queue we set the usage flag to VK_BUFFER_USAGE_TRANSFER_SRC_BIT (we also create another buffer object on the other queue, this time with VK_BUFFER_USAGE_TRANSFER_DST_BIT). For synchronization, two options are available: either the buffer is accessible to a single queue (mode VK_SHARING_MODE_EXCLUSIVE) or it is accessible to a set of queues (mode VK_SHARING_MODE_CONCURRENT). In the latter case, we specify exactly which queues may access the object. Buffers with the exclusive sharing mode (i.e., those that are not shared) are very efficient: the device's driver can read and write their memory without additional checks.
A buffer object is not attached to specific memory at the time of its creation. Instead, we must call vkBindBufferMemory to initiate such a binding. vkGetBufferMemoryRequirements is used to get information about memory requirements for such bindings: size, alignment and compatible memory types. We must bind the buffer before using it, and there is no way to undo that binding short of destructing the buffer. In particular, it is illegal to bind an already bound buffer.
Sparse buffers
By default, we must bind resources contiguously to one memory object before their first use and this binding is final. Sparse buffers relax these restrictions: they can bound non-contiguously to multiple memory objects, and we may update their bindings throughout their lifetime.
Not all devices support sparse binding. To check if a specific physical device does, we need to check the result of vkGetPhysicalDeviceFeatures. To create a sparse buffer, we pass flag VK_BUFFER_CREATE_SPARSE_BINDING_BIT. Additionally, the VK_BUFFER_CREATE_SPARSE_RESIDENCY_BIT (the buffer may be used even though only part of it is bound) and VK_BUFFER_CREATE_SPARSE_ALIASED_BIT (the same memory may be bound multiple time by the same or different buffers) flags may be specified if vkGetPhysicalDeviceFeatures allows it.
vkQueueBindSparse creates or updates a binding. For more details and examples, including information about alignment or the handling of subresources, check this documentation entry.
2.2. Images
There are similarities between images and buffers. We create images with vkCreateImage and bind them to memory using vkBindImageMemory. When creating images, we specify their usage (with a more specific family of flags than before but it is the same idea) and their sharing (again, exclusive or concurrent). We also check for memory requirements like for buffers, using vkGetImageMemoryRequirements.
However, there are also differences. When creating an image, we do not specify its size directly. Indeed, the image type is quite complex and the driver may store the information different manners requiring with a variable memory footprint.
- Type: dimensionality of the image (Vulkan supports both 1D, 2D and 3D "images")
- Extent: actual dimensions of the image, e.g. 1920x1080
- Format: type of a pixel, e.g. 8 bits for red + 8 for green + 8 for blue (+ 8 for transparency level) or just 16 bits per pixel for a high-precision grayscale image
- Mipmap levels count: number of mipmapping levels
- Layer count: actual number of images (an image may be not just one but an array of images, referred to as layers)
- Sample count: number of samples per pixel (used for multisampling). You may find this field to be out of place. After all, the number of samples is only relevant internally to the renderer and we do not commonly handle images with multiple samples per pixel. However, Vulkan gives us access to the internals of the renderer, including stages where we forward images with multisampling information before resolving them.
- Tiling: immutable constraints regarding how the data be laid out in memory: linear (pixels are laid out in memory in row-major order, like a classical array) or optimized (the driver may do anything with the memory). In practice, we always use optimized (we do not even import images from the CPU as linear images but as buffers; linear is slow and comes with a bunch of restrictions)
- Initial layout: the layout is an additional setting pertaining to the organization of memory. Unlike tiling, it can change over time (we will see how at the end of this chapter). Layouts reflect the intended usage of an image at the current time. For instance, we would use VK_IMAGE_LAYOUT_TRANSFER_SRC_OPTIMAL to indicate that an image is the source in a transfer (whatever this means for the hardware — maybe the data is shuffled around in memory, maybe the actual memory layout is left unchanged). Assuming that the tiling is set to optimized, the initial layout of an image may only be the undefined one (and you should always use optimized tiling)
Devices may not support all settings for images. vkGetPhysicalDeviceImageFormatProperties can be used to check whether a precise combination of type + format + tiling + usage (+ image create flags, which I did not mention previously as they are only used for certain advanced features) is supported by a physical device. If the combination is supported at all, additional information about the maximum possible dimensions of the image, mipmapping levels, layers and samples are returned, as well as an upper bound on the maximum amount of memory that such an image is allowed to use on the device (the system may not report less than 2GiB).
Mipmapping and subresources
When we use mipmapping, we introduce a bunch of subimages attached to the main resource (like this or like this, depending on how we handle it). vkGetImageSubresourceLayout can be used to get information about these subimages.
Sparse images
Just like there are sparse buffers, there are sparse images. These work in mostly the sam way than sparse buffers. Subblocks of the image (and of its subresources) can be split between different bindings at a given level of granularity. See the available flags: more of the same (note that some of the flags on this page are not related to sparse bindings). vkGetImageSparseMemoryRequirements returns information about memory requirements for sparse images.
Again, vkQueueBindSparse is used for handling bindings. More details and examples, including information about requirements in alignment, check this documentation entry.
3. The transfer pipeline
Vulkan commands go through pipelines. The transfer pipeline is the simplest of them all, to the point where calling it a pipeline feels a bit wrong.
Transfer operations are fundamental, to the point where they are supported by all queues.
3.1. Buffer operations
vkCmdFillBuffer fills a region of a buffer with repeated copies of an 32 bit value. vkCmdUpdateBuffer refreshes a buffer by pulling in changes from host memory (this only works for small amounts of data, at most 65536 bytes). vkCmdCopyBuffer copies the contents of a buffer to another buffer.
Buffers s
3.2. Image operations
vkCmdCopyImage copies the contents of an image to another image. vkCmdCopyBufferToImage and vkCmdCopyImageToBuffer are used to convert data between buffers and images.
vkCmdBlitImage copies regions of a source image into a destination image vkCmdClearColorImage vkCmdClearDepthStencilImage vkCmdResolveImage resolve multisampling: leaves only one sample with an appropriate value per pixel. Takes a source image with multiple samples and a destination one with only one sample. Regions
4. Specialized synchronization primitives for resources and layout transitions
VkBufferMemoryBarrier (mostly useless) VkImageMemoryBarrier