Chapter 6: modern Vulkan

Warning

This chapter is a work in progress (ETA October 2025)

The previous chapters in this guide mostly stuck to vanilla Vulkan 1.0, which was released in 2016. Many things changed since then: manufacturers thought up some new features for graphics cards, and Vulkan users and developers proposed changes to the API. These changes are made available through updates to the core specification for changes to the base API and feature considered important enough, and through extensions for the rest.

Many of these changes aim at making the lives of developers using Vulkan easier, be it by building alternative for existing constructs or by introducing higher level interfaces around GPU concepts. This is a balancing act: we would not want to want to get a bloated specification nor to reduce the amount of control that it gives us over GPUs. New versions mostly add things, but some features got deprecated over time, meaning that we can still use them, but we should feel bad if we do (see this list of deprecated features).

A. Dynamic rendering

In the graphics chapter, we went over Vulkan's classical rendering pipeline. Although most of that chapter remains valid, an important component thereof is now deprecated: exit render passes, subpasses and framebuffers, enter dynamic rendering. Dynamic rendering is a more flexible and less verbose interface for rendering, and it comes without any performance cost. In other words, it is plainly a superior interface, which explains the deprecation of render passes and associated constructs.

The main idea behind dynamic rendering is that we can reference rendering attachments directly instead of going through render passes. We ditch vkCmdBeginRenderPass/vkCmdEndRenderPass pairs in favor of vkCmdBeginRendering and vkCmdEndRendering. In addition to the command buffer, the former function takes a VkRenderingInfo structure, where we specify which attachments are available for the current rendering operations (a single depth and a single stencil attachment, and an arbitrary number of color attachments; input attachments are not mentioned as they are handled entirely through descriptors), as well as the dimensions of the actual render, a view mask (for use with multiview) and a layer count (for the rest of the time). Note that we used to provide similar information via VkRenderPassBeginInfo's framebuffer object. We also provide rendering flags (which we use to suspend/resume rendering operations or to indicate that the draw calls are emitted from secondary command buffers). We can have multiple draw operations in a single rendering block; all of them use the same depth and stencil attachments (also, although the depth and stencil attachments are considered as distinct attachments for future-proofing reasons, they should point to the same attachment in practice). Color attachments are also shared globally; we control which of them get written to by a rendering operation through the GLSL location keyword only. We describe (non-input) attachments using VkRenderingAttachmentInfo objects, which specify an image view and the layout it will be in at rendering time, resolve information for multisampling (a resolve mode, a target image view and a layout; resolving runs after rendering, and the resolve mode gives us a fair amount of control over how it runs), and a pair of load/store operations (plus a color for clearing the attachment on load, if this is the load operation we request). VkRenderingAttachmentInfo is pretty much the new VkAttachmentDescription. The main difference is that the image view is referenced directly and that the layout transitions are not handled automatically.

When we create pipelines, we use the pNext field of VkGraphicsPipelineCreateInfo to pass a VkPipelineRenderingCreateInfo (we also pass a null handle instead of the render pass object). This object specifies the format of the attachments (plus some additional information for multiview rendering only). This is simpler than having to deal with render pass compatibility.

Since there are no more explicitly encoded dependencies, we have to handle synchronization and transitions ourselves through memory barriers. It used to be that render passes yielded better performance on tiled implementations, but the addition of the VK_IMAGE_LAYOUT_RENDERING_LOCAL_READ image layout brought parity with the render pass-based approach (we need to perform the transitions to this layout explicitly). This layout indicates that only pixel-local accesses are to be allowed.

Khronos provides two relevant samples: one about forward shading and another one about deferred shading/local reads. Note that in these samples, the features are presented as if they were extensions, although they have now been made part of Vulkan's core.

B. Beyond descriptor sets

Another pain point of Vulkan 1.0 is descriptor sets. We have all of these objects with different lifetimes around, and we are responsible for grouping them into sets based on their update frequency (and tracking the lifetimes of these descriptor set objects can be a pain). Also, they are subject to some limitations: we can not update descriptors that are bound in a (non fully executed) command buffer, all descriptors must be valid (even when they are not used), there is a limit on descriptor counts that we can actually hit, etc. Furthermore, the entire descriptor model is quite cryptic: we do not handle memory directly, but we deal with the arcane concept of descriptor pools.

In this section, we discuss three Vulkan extensions that were proposed to make descriptors more flexible and less magical. The first two extensions are now part of Vulkan's core, and the last one may get the same treatment at some point.

The extensions that we will discuss are the following:

B.1. Descriptor indexing/bindless descriptors

What if we were to store all the required textures for a set of objects in a large array? Then, we could bind this array to a descriptor once, pass appropriate indices in this array to all objects, and use them to access their data (from the shader itself). That way, we would not have to constantly bind and unbind the texture data. Well, actually we already can do something a bit like that, however the lack of flexibility of descriptor sets limitates our options. The descriptor indexing extension, which is now a core part of Vulkan, was built to improve the situation.

Note that, unlike dynamic rendering, using bindless descriptors has a performance cost (as low as it may be: we must just pay the cost of the additional indirection, so it is not too bad). It is therefore not the default way of doing things, but something that we need to enable explicitly.

Khronos provides a sample showcasing this feature. In that sample, the descriptor ids are passed as per-vertex data (using the flat GLSL keyword to ensure that different indices do not get interpolated). If we do not need our data to vary per-vertex, we can send such ids through push constants instead.

Activation

To enable descriptor indexing, we pass a VkPhysicalDeviceDescriptorIndexingFeatures structure to vkCreateDevice, via VkDeviceCreateInfo's pNext pointer.

Update-after-bind

With the descriptor indexing feature active, we can update bound descriptors (we can even update different descriptors from different threads; also, the GPU driver can only make weaker assumptions for optimizations, but the flexibility gains can easily make up for that). We must activate some flags for each descriptor set layout whose contents we want to update that way: a VK_DESCRIPTOR_SET_LAYOUT_CREATE_UPDATE_AFTER_BIND_POOL_BIT, via the VkDescriptorSetLayoutCreateInfo's flags field, and a VK_DESCRIPTOR_BINDING_UPDATE_AFTER_BIND_BIT (which guarantees that the implementation observes descriptor updates: the most recent version at submission time is used) via a VkDescriptorSetLayoutBindingFlagsCreateInfo structure that we provide through VkDescriptorSetLayoutCreateInfo's pNext field; we provide one such structure per binding in our layout. Some more related flags are available:

Similarly, we must create the descriptor pool with the VK_DESCRIPTOR_POOL_CREATE_UPDATE_AFTER_BIND_BIT flag (via VkDescriptorPoolCreateInfo's flag field). Finally, we perform the actual update in the usual way, i.e., using vkUpdateDescriptorSets.

Non-uniform indexing

By default, we cannot index into descriptor arrays from shaders in any way we want: we must either use a constant index or a so-called "dynamically uniform" one (and we need to ensure that the right ArrayDynamicIndexing physical device features are supported for the latter). For the compute pipeline, dynamic uniformity is about all invocations in the same workgroup sharing the same value. For the graphic one, that value should be the same for all threads spawned from the same draw command (even for multiple instances). Non-uniform indexing makes all indexing-related restrictions go away: we just have to mark non-dynamically uniform indices with the nonuniformEXT GLSL keyword. We need to activate this feature from GLSL first, which we do as #extension GL_EXT_nonuniform_qualifier : require.

B.2. Buffer device address

What if we were able to manipulate GPU virtual addresses from our applications? Then, we could use pointers to read from buffers in Vulkan with the usual pointer arithmetic tricks applying. In particular, we could use such a feature to build a buffer with all the data required by our shaders. Then, instead of binding different descriptor sets for different meshes, we could just use push constants to provide them with addresses that map in the relevant portion of this buffer. The buffer itself never needs to be bound to a descriptor set. This is what buffer device addresses are about.

To enable this feature for a specific buffer, we must create it with the VK_BUFFER_USAGE_SHADER_DEVICE_ADDRESS_BIT flag (from VkBufferUsageFlagBits). Similarly, we must create the memory that we eventually bind this buffer to with a VkMemoryAllocateFlagsInfo that includes VK_MEMORY_ALLOCATE_DEVICE_ADDRESS_BIT. We provide this structure via VkMemoryAllocateInfo's pNext. With all of this done, we can query the address of our buffer through vkGetBufferDeviceAddress.

To use buffer addresses within GLSL shaders, we need to activate a GLSL extension. We do this by adding a #extension GL_EXT_buffer_reference : require line at the beginning of the appropriate shader file. We then use buffer_reference and buffer_reference_align (the latter being optional, we only use it if we require aligned addresses), which gives us something like:

layout(std430, buffer_reference, buffer_reference_align = 8) readonly buffer Pointers { vec2 positions[]; };

The above example does not describe a buffer but the type of a pointer to a buffer (in particular, we do not provide bindings for such declarations). We use such pointers by referencing them through a more usual object, such as the push constant in the example below:

layout(std430, push_constant) uniform Registers { Pointers pointers; } registers;

We can use buffer references with push constants, uniform buffers and storage buffers alike, and we are responsible for ensuring that the shaders never read from addresses that are not part of addressable buffers. Also, we cannot store images or such resources inside buffers, so we cannot use this feature for all of our descriptors. Once again, Khronos provides a sample for this feature.

B.3. Descriptor buffers

The two extensions presented in the previous subsections punched holes through the classical descriptor set abstraction to enable some neat techniques. What if we were to go further and do away with descriptor sets and pools entirely? This is what descriptor buffers are about. With this extension, we can store descriptors inside normal buffers, and we are responsible for the memory management (though we keep using descriptor set layout to describe interfaces). By going low-level, we gain some advantages in flexibility, though it comes at the price of more complex code.

Descriptor buffers are not part of Vulkan's core. Instead, we must activate this functionality through the VK_EXT_descriptor_buffer device extension (which is not universally supported). Note that this extension builds upon the notion of buffer device addresses described above. Also, as per usual, Khronos provides a sample illustrating this feature.

We create descriptors buffers just like we would create a normal buffer, except that we pass the VK_BUFFER_USAGE_RESOURCE_DESCRIPTOR_BUFFER_BIT_EXT flag to VkBufferCreateInfo. We then store a kind of lookup table for our actual resources in such buffers.

We use vkGetDescriptorSetLayoutSizeEXT to get information about the amount of memory needed to store all descriptors from a given descriptor set layout, and vkGetDescriptorSetLayoutBindingOffsetEXT to get the offset of a binding in that space. Finally, we obtain the data corresponding to a descriptor in a buffer using vkGetDescriptorEXT. This function takes a VkDescriptorDataEXT, which is a union of (mostly) VkDescriptorImageInfo/VkDescriptorAddressInfoEXT objects (one per descriptor kind), and it writes the descriptor at a given address (we have to do some pointer arithmetic to compute this one). With this last step done, we have a working descriptor buffer.

To bind our descriptor buffer to a command buffer, we use vkCmdBindDescriptorBuffersEXT. Then, to index into that buffer, we turn to vkCmdSetDescriptorBufferOffsetsEXT.

We must query and constantly remain mindful of the limits described in VkPhysicalDeviceDescriptorBufferPropertiesEXT (which we get through vkGetPhysicalDeviceProperties2, which behaves just like the deprecated vkGetPhysicalDeviceProperties but also returns information about extensions or new features through a pNext chain).

So, descriptor buffers are much more low-level than the previous extensions. What do we get in exchange for this tedium? As it turns out, this allows us to update descriptors directly from the GPU. This is nice in principle, but the usecases are limited in practice. For more detail on the topic, see this blog post by Khronos.

C. Improving the shaders/pipelines situation

Building a pipeline is costly: all shaders are compiled and optimized as much as possible for a certain rendering/computing operation. We should strive to compile our pipelines in advance to avoid performance hitches. However, pipelines lack modularity, and combinatorial explosions are all too common with this construct.

In this section, we discuss two extensions that are concerned with making pipelines more modular. The first extension (shader objects) is quite radical: it gets rid of the concept of pipelines altogether, and proposes a return to what is basically the OpenGL model. Instead, we build individual shader objects that we link (and cross-optimize) on the fly. On the other hand, graphics pipeline libraries allow us to split pipelines in four pieces, and to combine these pieces on the file (again, with cross-optimization).

These functionalities are not (yet) part of Vulkan's core. Whether we should use them depends on our targets and priorities.

C.1. Shader objects

The VK_EXT_shader_object device extension makes it possible to specify shaders and state without pipeline objects. The shader objects extension is based on the observation that we do not need to do all of the compilation in one go. Instead, we can do separate compilation and link/further optimize the shaders based on context only as required. (note that in practice, drivers may already optimize things a bit for us behind the scenes). This is closer to the way OpenGL and older APIs work, and it comes with some performance cost (low for linked shaders, higher for unlinked ones, as we will see). Is this performance cost an acceptable tradeoff for avoiding the pipeline combinatorial explosion problem? Your call.

To enable this extension, we must both enable the VK_EXT_SHADER_OBJECT_EXTENSION_NAME device extension (through VkDeviceCreateInfo's ppEnabledExtensionNames field) and pass a VkPhysicalDeviceShaderObjectFeaturesEXT structure with shaderObject set to VK_TRUE (through VkDeviceCreateInfo's pNext chain).

We create shader objects using vkCreateShadersEXT, which takes a set of VkShaderCreateInfoEXT structures as arguments. This structure specifies all there is to know about the shader: its code, its interface (descriptor set layout, push constants ranges, specialization constants), and information for linking (in this context, linking mostly means "optimizing a shader based on its successors"). This function returns a set of VkShaderEXT handles.

To link shader objects, we just set the VK_SHADER_CREATE_LINK_STAGE_BIT_EXT flag in the VkShaderCreateInfoEXT objects representing a logical sequence of stages. We can define as many unlinked shader objects as we want in a call to vkCreateShadersEXT, but we can link at most one such sequence of stages; there are complex restrictions to the function when creating a mix of linked and unlinked shader objects in a single call — using one call for linking each sequence of shaders and another one for all the unlinked ones seems to be the easiest way around.

We bind shader objects via vkCmdBindShadersEXT. We can bind linked and unlinked objects alike, although we should not expect optimal performance for the latter. When using linked shader objects, we must make sure that we bind every one present in the link sequence. Bound shaders are used by the following compute/draw calls.

In addition to the shader objects, we also need to provide the state information that was originally passed in the pipeline object. We now consider all of this state dynamic and we set it using the appropriate functions, as described in the spec. Launching an operation on the GPU before all the required state has been bound is an error.

Khronos provides a sample and a blog post for this extension. Also, note that shader objects do not (yet) support ray tracing.

C.2. Graphics pipeline libraries

In the previous section, we discussed a way of getting rid of pipeline objects entirely. This was maybe a bit harsh on them; after all, though they are annoying monolitic, they come with good performance once compiled, and we may not want to compromise on performance. Could we not keep pipelines in but make them more modular? This would enable all kind of reuse which would help with the combinatorial explosion problem. This is what the VK_EXT_graphics_pipeline_library device extension is about.

After loading the extension (by passing VK_EXT_GRAPHICS_PIPELINE_LIBRARY_EXTENSION_NAME via VkDeviceCreateInfo's ppEnabledExtensionNames field), we can start defining independent pieces of pipeline objects. We cannot break up our pipelines in any way we want. Instead, there are four predefined parts (aka libraries; see the spec for more detail):

To create one or several such parts (creating several parts at once does not link them), we use vkCreateGraphicsPipelines the usual way, except that we only provide the information relevant to the part we want to create, which we name explicitly in a VkGraphicsPipelineLibraryCreateInfoEXT structure that we provide via VkGraphicsPipelineCreateInfo's pNext field. If we want to be able to optimize the result of the linking operation for our pipeline parts, we should ask Vulkan to keep additional information about all of them using the VK_PIPELINE_CREATE_RETAIN_LINK_TIME_OPTIMIZATION_INFO_BIT_EXT flag.

The graphics pipelines library extension deprecates vkCreateShaderModule. Instead of using this function, we should just pass our VkShaderModuleCreateInfo directly through VkPipelineShaderStageCreateInfo's pNext field.

To link all parts together, we use a VkPipelineLibraryCreateInfoKHR that we provide to VkGraphicsPipelineCreateInfo via its pNext field. We can (and usually should) provide the VK_PIPELINE_CREATE_LINK_TIME_OPTIMIZATION_BIT_EXT to ensure that Vulkan fully optimizes the resulting pipeline. A technique that we could use here is to use an unoptimized pipeline while waiting for the optimized version to get compiled in the background.

If two different pipeline parts access different sets, the compiler may end up doing funky descriptor sets aliasing as it does not have a global view anymore. For instance, if the vertex and the fragment shader use distinct sets, the driver may only remember that the fragment shader uses one set and that the vertex fragment only uses one set as well. The fact that these sets are distinct can get lost along the way. We can use the VK_PIPELINE_LAYOUT_CREATE_INDEPENDENT_SETS_BIT_EXT flag for pipeline layouts to tell the compiler to be extra careful about this.

Once again, Khronos provides a sample and a blog post for this extension. Note that a similar technique is supported for ray tracing (as discussed here).

D. Synchronization

Modern Vulkan brings some quality of life features for synchronization. The new features are nice, but the changes are not that impactful for most engine designs. They are now part of Vulkan's core.

D.1. Timeline semaphores

Timeline semaphores are a generalization of semaphores (the GPU-GPU synchronization primitive) and of fences (the CPU-GPU synchronization primitive). They are an extension of semaphores, and we create them by passing a VkSemaphoreTypeCreateInfo structure with field semaphoreType set to VK_SEMAPHORE_TYPE_TIMELINE via VkSemaphoreCreateInfo's pNext chain. Note that the type of timeline semaphores remains VkSemaphore.

Whereas plain semaphores were basically booleans, timeline semaphores carry a payload of type int64. We can pick the initial value of this payload at creation time (through VkSemaphoreTypeCreateInfo's initialValue field). We are only ever allowed to increase this value, and we use precise values to represent certain states.

We can interact with timeline semaphores both from the GPU (using them as semaphores). For instance, we can set their value to something arbitrary after some task has been completed by inserting a VkTimelineSemaphoreSubmitInfo structure to VkSubmitInfo's pNext field. In this structure, we store a wait (resp. signal) value for each wait (resp. signal) semaphore that we pass to VkSubmitInfo (we also must store values for binary semaphores that way, though in practice we almost never really mix timeline and binary semaphores so this is not a real issue). A wait finishes when the semaphore reaches the specified value (note that we must ensure that the value we wait for is smaller than the payload's value at the point where the command executes).

We can also interact with timeline semaphores from the CPU (using them as fences). We wait on them using vkWaitSemaphores (see VkSemaphoreWaitInfo), and we signal them using vkSignalSemaphore (see VkSemaphoreSignalInfo). We can also just look at the current value of a timeline semaphore using vkGetSemaphoreCounterValue.

There may be a device-dependent limit on the maximum value between the current value of a semaphore and that of any pending wait or signal operation, which we can read in the maxTimelineSemaphoreValueDifference field of the VkPhysicalDeviceTimelineSemaphoreProperties structure, which we can obtain via vkGetPhysicalDeviceProperties2.

Khronos provides both a sample and a blog post on this topic.

D.2. Synchronization 2

Synchronization 2 improves pipeline barriers, events, image layout transitions and queue submissions. It does all of this through the introduction of the VkDependencyInfo structure, which centralizes all barrier information. It defines new constructs that make use of this representation:

A VkDependencyInfo is a collection of VkMemoryBarrier2, VkBufferMemoryBarrier2, and VkImageMemoryBarrier2 structures. These look like the barriers we are familiar with, except that we also store stage information in them (as VkPipelineStageFlags2 objects, which themselves are like VkPipelineStageFlags, except that the stages are split differently; note that out/bottom of pipes stages have been replaced by VK_PIPELINE_STAGE_2_NONE_KHR).

Synchronization 2 introduces vkQueueSubmit2, an alternative submission command that takes a VkSubmitInfo2 argument, which is the same as the VkSubmitInfo, save for its use of VkSemaphoreSubmitInfo structures (for wait and signal alike) and the presence of a flags field. VkSemaphoreSubmitInfo makes timeline semaphores management more natural, and it also defines a deviceIndex field for when we use device groups (with device groups being a core feature that we briefly discuss in the going further chapter — in short, it is about handling distinct physical devices as a single logical one, which is mostly useful for doing things à la NVLink; when we are not using this feature, i.e., almost always, we just leave it at 0).

If we use the now deprecated render passes, we should make use of the VkSubpassDependency2 structure (in practice, we use render passes only if we are targeting mobiles; then, we should actually stick to Vulkan 1.0, so yeah).

Finally, synchronization brings in some new image layouts (VK_IMAGE_LAYOUT_ATTACHMENT_OPTIMAL_KHR and VK_IMAGE_LAYOUT_READ_ONLY_OPTIMAL_KHR). This is just a quality of life feature (before, we would have to spell out, e.g., VK_IMAGE_LAYOUT_DEPTH_STENCIL_ATTACHMENT_OPTIMAL_KHR in full; now, Vulkan can just deduce that the attachment the transition is applied to is a depth/stencil buffer from the context).

The Vulkan guide contains a page with more information on the topic.

E. Mesh shading

The main idea behind mesh shading basically is that we can produce primitives from compute shaders instead of going through the traditional pre-rasterization steps (including the vertex, geometry, and tesselation shaders). This approach is more flexible than the traditional pipeline, as we can generate the primitive

Mesh shading makes use of two kinds of shaders: tasks and mesh shaders. They are both glorified compute shaders, and they cooperate to generate meshes in parallel within a workgroup. Only mesh shaders are strictly required, and it is from them that the primitives are actually generated. However, mesh shaders are subject to some limitations regarding how many primitives they can output. If we are rendering a very large/detailed object, we should break it down into meshlets first. Meshlets are submeshes: (we should ). Buiding good meshlets is a costly operations, that we should almost never be done live (instead, we should store pre-computed meshlet information in the game files). For instance, if we want to render an object built out of 1300 faces and our meshes shaders can output only 256 primitives each, then we should split that object into 6 meshlets. In many situations, the amount of geometric detail

Task shader Additional outputs forwarded to mesh shader (accessible to all created workgroups)

Mesh shader Additional outputs

Splitting things into meshlets: costly, should not be done live GPU-controlled LOD/tesselation

spec. sample

F. Misc

Indirect rendering

Subgroups tutorial

X. Additional resources