Chapter 6: modern Vulkan
In the previous chapters, we mostly discussed core Vulkan 1.0, which was released all the way back in 2016. Since then, Vulkan went through four major updates (as of 2026) that improved the developer experience and accommodated new features introduced by GPU constructors. Changes to the base API or new features considered important enough got merged into the core specification, and extensions were introduced to cover features that are either uncommon or still in a state of flux.
Many of these changes aim at making the lives of Vulkan users easier, be it by building alternatives for existing abstractions or by introducing higher level interfaces around GPU concepts (which is a balancing act: we would not want to want to get a bloated specification nor to reduce the amount of control over GPUs that Vulkan currently provides). New versions mostly add things, but some features get deprecated over time, meaning that we can still use them, but we should feel bad if we do (see this list of deprecated features).
A. Dynamic rendering
In the graphics chapter, we went over Vulkan's classical rendering pipeline. Although most of that chapter remains valid, some important components thereof are now deprecated: exit render passes, subpasses and framebuffers, enter dynamic rendering. Dynamic rendering is a more flexible and less verbose interface for rendering, that comes without any performance cost. In other words, it is a plainly superior abstraction, which explains the deprecation of render passes and associated constructs. An update that actually simplifies things is a rare thing that we should cherish! However, mobile drivers are lagging behind as far as support for modern Vulkan goes, so dynamic rendering can only be recommended for desktop-only development as of 2026.
The main idea behind dynamic rendering is that we could reference rendering attachments directly from the command buffer instead of declaring subpasses upfront. We ditch vkCmdBeginRenderPass/vkCmdEndRenderPass pairs in favor of the vkCmdBeginRendering and vkCmdEndRendering commands. When we open such a rendering context, we provide a VkRenderingInfo structure, which plays a role similar to that of VkRenderPassBeginInfo's framebuffer object. It specifies:
-
Which attachments are available to the rendering operations:
- A single depth and a single stencil attachment. These are considered distinct for future-proofing reasons, though they should almost always point to the same resource in practice (note that some image formats like VK_FORMAT_D32_SFLOAT cover exclusively either depth or stencil info)
- An arbitrary number of color attachments
Note that we do not mention input attachments (these are handled differently)
- The render's dimensions
- A view mask (for use with the now core multiview; it provides a simple switch for disabling some of the views on the fly)
- A layer count (only used when the view mask is left to 0)
- The active rendering flags; we use them to suspend/resume rendering operations or to indicate that the draw calls are emitted from secondary command buffers
We describe (non-input) attachments using VkRenderingAttachmentInfo objects, which are pretty much the new VkAttachmentDescription. The main difference is that the image view is referenced directly and that the layout transitions are not handled automatically. They are made of:
- An image view
- The layout the image will be in at the time of the render
- Resolve information for multisampling (a resolve mode, a target image view and a layout; resolution happens after rendering, and the resolve mode gives us a fair amount of control over it)
- A pair of load/store operations (plus a color for clearing the attachment on load, used if we set the load operation to clear)
All draw operations emited from the same rendering context share their depth and stencil attachments. Color attachments are also shared globally: we control which of them gets written to by a rendering operation through the GLSL location keyword. The location that identifies a color attachment is its index in VkRenderingInfo's list of color attachments.
When creating pipelines, we pass a VkPipelineRenderingCreateInfo object through VkGraphicsPipelineCreateInfo's pNext field. This object specifies the format of the attachments (plus some additional information for multiview rendering). We do not have to deal with render pass compatibility anymore! Also, we should pass a null handle instead of the render pass object we used to provide.
Since there are no more explicitly encoded dependencies, we have to handle synchronization and layout transitions ourselves through memory barriers.
It used to be that render passes yielded better performance on tiled implementations, but parity was finally achieved through the addition of the VK_IMAGE_LAYOUT_RENDERING_LOCAL_READ image layout in Vulkan 1.4 (see the proposal page for this feature). This layout indicates that only pixel-local accesses are allowed. Prior to its introduction, there was no way of using input attachments in a dynamic rendering context. We declare these attachments as part of our graphics pipeline's descriptor set layout in the usual way. To refer to a color attachment via an input attachment from GLSL, we set the input_attachment qualifier to the position of that color attachment within VkRenderingInfo's pColorAttachments array. For depth/stencil attachments, we declare the input attachment in the standard way, except that we drop the input_attachment qualifier altogether (!). There is a way of overriding these default indices, which can be useful when porting applications initially built using subpasses — see the the proposal page for more information.
In the graphics chapter, we discussed how we could implement a deferred renderer using subpasses. Let's revisit this example in a dynamic rendering setting. First, we need to define the rendering resources. We need five images: four for the G-buffer, including one for storing the depth information, and one for the final render. We declare these in VkRenderingInfo. Then, for the first pass, we render into the G-buffer images only (explicitly for the color attachments, and the depth attachment is handled automatically). We emit as many draw calls as is necessary to cover all visible objects. For the second pass, we load the G-buffer images as input attachments (three of them used to be color attachments, and the last one was the depth attachment). We render a quad that covers the whole camera. This is a neat trick that ensures that the fragment shader for this pass is triggered once per pixel (with a single draw call that does not use the depth attachment). To access the G-buffer contents as input attachments from the quad's fragment shader, we register them as such in the quad pipeline's descriptor set layout, and we bind appropriate descriptor sets. Between the two passes (which are not organized as subpasses), we synchronize things explicitly: we should not read from the inputs in the second pass before the first one is done writing to its color/depth outputs. For this purpose, we use a by-region memory barrier (VK_DEPENDENCY_BY_REGION_BIT). Of course, we should have the attachments corresponding to the G-buffer's contents in the VK_IMAGE_LAYOUT_RENDERING_LOCAL_READ layout.
Khronos provides two relevant samples: one about forward shading and another one about deferred shading/local reads (the code is quite enlightening). Note that in these samples, the features are presented as if they were part of extensions, although they are in Vulkan's core since version 1.4.
B. Beyond descriptor sets
Descriptor sets are another major pain point of Vulkan 1.0: we have to manage heterogeneous collections of objects, and we are responsible for grouping them into sets based on the update frequency of the bindings. Descriptor sets are subject to many limitations: we cannot update descriptors that are bound in a (non executed) command buffer, all descriptor bindings must be valid (even when they end up not being accessed), there is a maximum number of descriptors, etc. Furthermore, the entire descriptor model is quite cryptic: unlike everywhere else in Vulkan, we do not handle memory directly. What is the arcane concept of descriptor pools really hiding?
In this section, we discuss three Vulkan extensions that make descriptors more flexible and less magical. The first two extensions are now part of Vulkan's core, and the last one may get the same treatment at some point:
- Descriptor indexing relaxes constraints about descriptors: it enables updating bound descriptors, non-uniform array indexing, having invalid descriptors bound (so long as they are not used), and it introduces unbounded (variable size) descriptor arrays.
- Buffer device addresses enables accessing buffers directly through a raw address. This enables building descriptors for storing arrays of per-object data. When rendering a certain object, we just give it the offset of its data into such an array via a push constant. This brings us even closer to truly bindless rendering.
- The two previous extensions punched holes into the descriptor set abstraction. The descriptor buffers one gets rid of this abstraction completely. It makes us responsible for handling the storage of descriptors in buffer objects. This enables advanced techniques such as building descriptor buffers from the GPU itself. However, it comes at the cost of a lot of tedium — in most cases, we are better off without this extension.
B.1. Descriptor indexing/bindless descriptors
What if we were to store all textures required for a set of objects in a large array? Then, we could bind this array to a descriptor once, and only pass indices into this array to all objects. That way, we would not have to constantly bind and unbind the texture data. This is actually something that we could do in Vulkan 1.0, but the lack of flexibility of descriptor sets greatly limited our options. The descriptor indexing extension, which is now a core part of Vulkan, improves the situation. Bindless descriptors have an impact on performance, as low as it may be (there is just the cost of the additional indirection that arises from the data being stored at some offset in the array).
Khronos provides a sample showcasing this feature. There, the descriptor ids are passed as per-vertex data (using the flat GLSL keyword to disable interpolating data of different vertices: each fragment will inherit the exact data of one of its surrounding vertices). If we do not need the data to vary per-vertex, we can send such ids through push constants instead.
Activation
To enable descriptor indexing, we pass a VkPhysicalDeviceDescriptorIndexingFeatures structure to vkCreateDevice, via VkDeviceCreateInfo's pNext pointer.
Update-after-bind
With the descriptor indexing feature active, we can update bound descriptors (we can even update different ones from different threads). We must activate some flags for each descriptor set layout whose contents we want to update that way: a VK_DESCRIPTOR_SET_LAYOUT_CREATE_UPDATE_AFTER_BIND_POOL_BIT, via the VkDescriptorSetLayoutCreateInfo's flags field, and a VK_DESCRIPTOR_BINDING_UPDATE_AFTER_BIND_BIT (which guarantees that the implementation observes descriptor updates: the most recent version at submission time is used) via a VkDescriptorSetLayoutBindingFlagsCreateInfo structure that we provide through VkDescriptorSetLayoutCreateInfo's pNext field; we provide one such structure per binding in our layout. Some additional flags are available:
- VK_DESCRIPTOR_BINDING_PARTIALLY_BOUND_BIT: allows for having invalid descriptors so long as they are not actively used.
- VK_DESCRIPTOR_BINDING_UPDATE_UNUSED_WHILE_PENDING_BIT: enables updates to descriptors that are not used by an active command buffer; by default, updates are only permitted before submission. Moreover, when the previous flag is also active, the property of being used by a shader is defined dynamically, as opposed to the static default definition.
- VK_DESCRIPTOR_BINDING_VARIABLE_DESCRIPTOR_COUNT_BIT: makes the size of the descriptor array variable; i.e., it will only be known when an actual descriptor set is bound to this layout, and the descriptorCount field is interpreted as a maximum — we can only use this for the last binding in a layout.
Similarly, we must create the descriptor pool with the VK_DESCRIPTOR_POOL_CREATE_UPDATE_AFTER_BIND_BIT flag (via VkDescriptorPoolCreateInfo's flag field). Finally, we perform the actual update in the usual way, i.e., using vkUpdateDescriptorSets. With this feature active, the GPU driver can only make weaker assumptions for optimizations, but the flexibility gains can easily make up for that.
Non-uniform indexing
Indexing into descriptor arrays from shaders is quite limited in Vulkan 1.0: only constant indexing is guaranteed to be supported. Assuming that a device supports the appropriate ArrayDynamicIndexing physical device features, it may also use "dynamically uniform" indexing, i.e.:
- In a compute context: the index may be a variable, but it must resolve to the same value for all invocations in the same workgroup.
- In a graphics context: the index may be a variable, but it must resolve to the same value for all threads spawned from the same draw command (yes, that is more limiting).
Non-uniform indexing makes all indexing-related restrictions go away: we just have to mark non-dynamically uniform indices with the nonuniformEXT GLSL keyword. This qualifier is defined in a GLSL extension, which we load via #extension GL_EXT_nonuniform_qualifier : require. We use it as arr[nonuniformEXT(idx)].
B.2. Buffer device addresses
What if we were able to manipulate GPU virtual addresses from our applications? Then, we could use GPU-side pointers to read from Vulkan buffers, with all the usual pointer arithmetic tricks applying. We could use such a feature to build a buffer with all the data required by all invocations of a given shader. Then, instead of binding and unbinding descriptor sets for each mesh that relies on this shader, we could simply forward the address of the relevant portion of the buffer via a push constant. Note that the buffer itself never needs to be bound to a descriptor set. This is what buffer device addresses are about (and Khronos once again provides a sample for this feature).
To enable this feature for a specific buffer, we must create it with the VK_BUFFER_USAGE_SHADER_DEVICE_ADDRESS_BIT flag (from VkBufferUsageFlagBits). Similarly, when we allocate memory that we eventually want to bind such a buffer to, we pass a VkMemoryAllocateFlagsInfo structure including VK_MEMORY_ALLOCATE_DEVICE_ADDRESS_BIT to VkMemoryAllocateInfo's pNext field. With all of this done, we can query the address of our buffer through vkGetBufferDeviceAddress (we can then do pointer arithmetic, so long as we remain in-bounds).
To use buffer addresses within GLSL shaders, we again need to activate a GLSL extension (#extension GL_EXT_buffer_reference : require). This extension introduces the buffer_reference and buffer_reference_align qualifiers (the latter being optional, we only use it if we require aligned addresses):
Note that the above does not describe a buffer but the type of a pointer to a buffer (in particular, we do not provide bindings for such declarations). We use such pointers by referencing them through other objects, such as the push constant in the example below:
We can use buffer references with push constants, uniform buffers and storage buffers alike, and we are responsible for ensuring that the shaders never read from addresses that are not part of addressable buffers. Since we cannot store every resource type inside buffers (e.g., we cannot use them for images), we cannot this feature for all of our descriptors.
B.3. Descriptor buffers
The two extensions above punched holes through the classical descriptor set abstraction. What if we were to go further and do away with descriptor sets and pools entirely? This is what the descriptor buffers extension is about: it enables storing descriptors inside normal buffers (though we keep using descriptor set layout to describe shader interfaces), and puts us in charge of their memory management. This brings flexibility benefits, though at the price of more complex code.
Descriptor buffers are not part of Vulkan's core: to use this functionality, we must enable the VK_EXT_descriptor_buffer device extension (which is not universally supported). Note that this extension builds upon the previously defined notion of buffer device addresses. Also, as per usual, Khronos provides a sample illustrating this feature (as well as a blog post).
We create descriptors buffers just like we would create a normal device addressable buffer, except that we pass the VK_BUFFER_USAGE_SAMPLER_DESCRIPTOR_BUFFER_BIT_EXT flag to VkBufferCreateInfo in addition to VK_BUFFER_USAGE_SHADER_DEVICE_ADDRESS_BIT (if we want to store combined image samplers in buffers, we also need the VK_BUFFER_USAGE_RESOURCE_DESCRIPTOR_BUFFER_BIT_EXT flag).
Descriptor buffers store descriptor data, but different devices encode this information in different ways, so we have to go through a little song and dance to write the data in a generic way. We use vkGetDescriptorSetLayoutSizeEXT to get the amount of memory required to store all descriptors from a given descriptor set layout, and vkGetDescriptorSetLayoutBindingOffsetEXT to get the offset of a binding in that space. Finally, we put the data corresponding to a descriptor in a buffer using vkGetDescriptorEXT. This function takes a VkDescriptorDataEXT, which is a union of (mostly) VkDescriptorImageInfo/VkDescriptorAddressInfoEXT objects (one per descriptor kind); it writes this data at the address we provide as a last argument (pointer arithmetic comes in handy for computing this one: we use the previously discussed functions to figure out the exact address where the descriptor belongs).
We use vkCmdBindDescriptorBuffersEXT to bind descriptor buffers to a command buffer, and we turn to vkCmdSetDescriptorBufferOffsetsEXT to index into a bound buffer.
We must respect the limits described in VkPhysicalDeviceDescriptorBufferPropertiesEXT (we get this structure through vkGetPhysicalDeviceProperties2, which behaves just like the deprecated vkGetPhysicalDeviceProperties but also returns information about extensions or new features through its pNext chain).
So, descriptor buffers are much more low-level than the previous extensions. What do we get in exchange for this tedium? Well, it merely enables updating descriptors directly from the GPU. This is nice in principle, but the usecases are limited in practice. We are better off without descriptor buffers in most scenarios.
B.X. Additional resources
- A blog post about descriptor indexing by Chunk Stories
- A note by DethRaid about descriptor indexing
- A talk by Sean Barrett about virtual textures
- A video by Aurailus about sparse bindless texture arrays (in OpenGL, also touches upon texture compression). The video feels clear BUT it contains a bunch of mistakes, as pointed out by a commenter; see this bonus page for a link, and the whole picture
- A blog post by Faith Ekstrand about descriptors and the underlying hardware models they are meant to abstract away
C. Improving the shaders/pipelines situation
Building pipelines is costly, as it is at this stage that all shaders are compiled and optimized. Ideally, we should compile all our pipelines in advance to avoid performance hitches. This is not always possible in practice, but we still strive to minimize the amount of runtime work required. We are limited in that pursuit by the rigidity of pipelines, which leads to an absurd amount of duplicated work.
In this section, we discuss two extensions that are concerned with making pipelines more modular. The first extension (shader objects) is quite radical, in that it gets rid of the concept of pipelines altogether, and proposes a return to what is basically the OpenGL model: we build individual shader objects that we link (and cross-optimize) on the fly. The alternative (graphics pipeline libraries) allows us to split pipelines in four pieces, and to combine (and cross-optimize) these pieces on the fly. Neither of these extensions is part of Vulkan's core yet.
C.1. Shader objects
The VK_EXT_shader_object device extension enables specifying pipeline shaders and state without pipeline objects. It makes it possible to break compilation in two parts: we can precompile all shaders separately, and only link/further optimize the shaders based on context at runtime. This is closer to the way OpenGL and older APIs work, and it comes with some performance cost. This alternative tradeoff is positive in many situations. Khronos provides a sample and a blog post for this extension. Also, note that shader objects do not yet support ray tracing.
To use shader objects, we must both enable the VK_EXT_SHADER_OBJECT_EXTENSION_NAME device extension (through VkDeviceCreateInfo's ppEnabledExtensionNames field) and pass a VkPhysicalDeviceShaderObjectFeaturesEXT structure with shaderObject set to VK_TRUE (through its pNext chain).
We create shader objects using vkCreateShadersEXT, which takes a set of VkShaderCreateInfoEXT structures as arguments. This structure specifies all there is to know about the shader: its code, its interface (descriptor set layout, push constants ranges, specialization constants), which stages may follow the current one. This function returns a set of VkShaderEXT handles.
To link shader objects, we just set the VK_SHADER_CREATE_LINK_STAGE_BIT_EXT flag in the targeted VkShaderCreateInfoEXT objects of the vkCreateShadersEXT. A single call to this function can link together a single sequence of stages (as in, the stages need to actually follow each other in the graphics pipeline). We can define as many unlinked shader objects as we want to in a single call. We can even create a mixture of linked and unlinked shader objects through the same call, although, since complex restrictions apply in this setting, using one call for linking each sequence of shaders and another one for all the unlinked ones is less of a headache.
We bind shader objects via vkCmdBindShadersEXT. We can bind linked and unlinked objects alike, although we should not expect optimal performance for the latter. When using a linked shader object, we must also bind each and every shader that it was linked to. Bound shaders are used by any compute/draw calls that come before the next binding.
In addition to the shader objects, we must provide any additional state information that was originally passed in the pipeline object. We now consider all of this state dynamic and we set it using the appropriate functions, as described in the spec. Launching an operation on the GPU before all the required state has been bound is an error.
C.2. Graphics pipeline libraries
In the previous section, we discussed a way of getting rid of pipeline objects entirely. This was maybe a bit harsh on them; after all, though they are annoying monolithic, they come with better performance than what was possible in OpenGL once compiled, and we may not want to compromise on performance. Could we not keep pipelines in but make them more modular? Reusing even portions of pipelines could help with the combinatorial explosion problem. This is what the VK_EXT_graphics_pipeline_library device extension is about, which Khronos describes in a sample and a blog post. Note that another extension brings extends this method to ray tracing pipelines (as discussed here).
After loading the extension (by passing VK_EXT_GRAPHICS_PIPELINE_LIBRARY_EXTENSION_NAME to VkDeviceCreateInfo's ppEnabledExtensionNames field), we can start defining independent pieces of pipeline objects. We cannot break up our pipelines in any way we want. Instead, there are four predefined parts (aka libraries; see the spec for more detail):
- Vertex input interface: covers VkPipelineVertexInputStateCreateInfo and VkPipelineInputAssemblyStateCreateInfo. This part contains no shaders and is therefore fast to create.
- Pre-rasterization shaders: covers the vertex shader (plus tesselation and geometry, when they are used), as well as VkPipelineViewportStateCreateInfo, VkPipelineRasterizationStateCreateInfo, VkPipelineTessellationStateCreateInfo, and VkRenderPass (or VkPipelineRenderingCreateInfo when we use dynamic rendering). This is a lot of information, but we can get away with giving just the shader code and the pipeline layout when we use dynamic state.
- Fragment shader: covers the fragment shader, as well as VkPipelineDepthStencilStateCreateInfo and VkRenderPass (or a VkPipelineRenderingCreateInfo when we use dynamic rendering; actually, that structure's viewMask field is the only one we need to provide in this context).
- Fragment output interface: covers VkPipelineColorBlendStateCreateInfo, VkPipelineMultisampleStateCreateInfo, and VkRenderPass (that last one only when dynamic rendering is not used). This part contains no shaders and is therefore fast to create.
To build such pipeline parts/libraries, we use vkCreateGraphicsPipelines the usual way, except that we only provide the information relevant to the parts we are actually creating. We specify which ones these are explicitly, via a VkGraphicsPipelineLibraryCreateInfoEXT structure we store in VkGraphicsPipelineCreateInfo's pNext chain. Note that creating several parts in a single call does not link them together. If we want to be able to optimize the result of the linking operation for our pipeline parts later, we should ask Vulkan to keep additional information about all of them using the VK_PIPELINE_CREATE_RETAIN_LINK_TIME_OPTIMIZATION_INFO_BIT_EXT flag.
The graphics pipelines library extension deprecates vkCreateShaderModule. We should just pass our VkShaderModuleCreateInfo directly through VkPipelineShaderStageCreateInfo's pNext field instead.
To link all parts together, we use a VkPipelineLibraryCreateInfoKHR that we pass via VkGraphicsPipelineCreateInfo's pNext chain. We can (and usually should) enable the VK_PIPELINE_CREATE_LINK_TIME_OPTIMIZATION_BIT_EXT flag to ensure that Vulkan fully optimizes the resulting pipeline. Using an unoptimized pipeline while waiting for the optimized version of it to be ready makes sense in some contexts.
If two different pipeline parts access different sets, the compiler may end up doing funky descriptor sets aliasing as it does not have a global view. For instance, if the vertex and the fragment shader use distinct sets, the driver may only remember that the fragment shader uses one set and that the vertex fragment only uses one set as well. Critically, the fact that these sets are distinct can get lost along the way. We can use the VK_PIPELINE_LAYOUT_CREATE_INDEPENDENT_SETS_BIT_EXT flag for pipeline layouts to tell the compiler to be extra careful about this.
D. Synchronization
Modern Vulkan brings some quality of life features for synchronization. The new features are nice, but the changes are not that impactful for most engine designs. Everything described in the first two subsections (timeline semaphores and synchronization) is now part of Vulkan's core.
D.1. Timeline semaphores
Timeline semaphores are a generalization of semaphores (the GPU-GPU synchronization primitive) and of fences (the CPU-GPU synchronization primitive). Khronos provides both a sample and a blog post on this topic.
To create a timeline semaphore, we pass a VkSemaphoreTypeCreateInfo structure with field semaphoreType set to VK_SEMAPHORE_TYPE_TIMELINE via VkSemaphoreCreateInfo's pNext chain. Note that the type of timeline semaphores remains VkSemaphore.
Whereas plain semaphores were basically booleans, timeline semaphores carry a payload of type int64. We can pick the initial value of this payload at creation time (through VkSemaphoreTypeCreateInfo's initialValue field). We are only ever allowed to increase this value, and we use precise values to represent certain states, using an encoding of our own.
We can interact with timeline semaphores from the GPU, using them as semaphores. We can wait for them to reach some value before starting a task, and set their value to something arbitrary once it has been completed. We do this by inserting a VkTimelineSemaphoreSubmitInfo structure in VkSubmitInfo's pNext chain. In this structure, we store a wait (respectively, signal) value for each wait (respectively, signal) semaphore that we pass to VkSubmitInfo (we must also store values for binary semaphores that way, though in practice we almost never mix timeline and binary semaphores; this is not a real issue). A wait finishes when the semaphore reaches the specified value (we must ensure that the value we wait for is smaller than the target's value at the point where the command executes).
We can also interact with timeline semaphores from the CPU, using them as fences. We wait on them using vkWaitSemaphores (see VkSemaphoreWaitInfo), and we signal them using vkSignalSemaphore (see VkSemaphoreSignalInfo). We can also just read the current value of a timeline semaphore using vkGetSemaphoreCounterValue.
There may be a device-dependent limit on the maximum value between the current value of a semaphore and that of any pending wait or signal operation. We can read this limit in the maxTimelineSemaphoreValueDifference field of the VkPhysicalDeviceTimelineSemaphoreProperties structure (obtained via vkGetPhysicalDeviceProperties2's pNext chain).
D.2. Synchronization 2
Synchronization 2 improves pipeline barriers, events, image layout transitions and queue submissions. It does all of this through the introduction of the VkDependencyInfo structure, which centralizes all barrier information. The Vulkan guide contains a page with more information on the topic. Synchronization 2 also introduces constructs that make use of this structure:
- vkCmdPipelineBarrier2 inserts memory dependencies
- vkCmdWaitEvents2/vkCmdSetEvent2 interact with events (we discussed events in chapter 1; they are basically split barriers)
A VkDependencyInfo is a collection of VkMemoryBarrier2, VkBufferMemoryBarrier2, and VkImageMemoryBarrier2 structures. These look like the barriers we are familiar with, except that we also store src/dst stage masks in them (as VkPipelineStageFlags2 objects, which themselves are like VkPipelineStageFlags, except that the stages are split differently; note that out/bottom of pipes stages have been replaced by VK_PIPELINE_STAGE_2_NONE_KHR; we used to provide the stage information via arguments of vkCmdPipelineBarrier).
Furthermore, synchronization 2 introduces vkQueueSubmit2, an alternative submission command that takes a VkSubmitInfo2 argument. This argument is defined just like VkSubmitInfo, save for its use of a pair of arrays of VkSemaphoreSubmitInfo for describing the wait and the signal operations, and the presence of a flags field. In addition to making timeline semaphores management more natural, VkSemaphoreSubmitInfo defines a deviceIndex field for when we use device groups (with device groups being a Vulkan 1.1 core feature that we briefly discuss in section H — in short, it is about handling distinct physical devices as a single logical one; when we are not using this feature, i.e., almost always, we just leave it at 0).
If we use the now deprecated render passes, we should make use of the VkSubpassDependency2 structure (in practice, we only ever use render passes when we target mobile devices, which we can't really expect to support anything newer than Vulkan 1.0 for now, which implies no synchronization 2, so yeah).
Finally, synchronization brings in some new image layouts (VK_IMAGE_LAYOUT_ATTACHMENT_OPTIMAL_KHR and VK_IMAGE_LAYOUT_READ_ONLY_OPTIMAL_KHR). This is just a quality of life feature (before, we would have to spell out, e.g., VK_IMAGE_LAYOUT_DEPTH_STENCIL_ATTACHMENT_OPTIMAL_KHR; Vulkan is now smart enough to deduce the exact transition to perform from the kind of attachment it is applied).
D.3. Getting rid of image layouts
Most (if not all) modern GPUs use a single image layout in practice. For these devices, setting up barriers all over the place to leave room for layout transitions is an exercise in futility. The VK_KHR_unified_image_layouts device extension (released in 2025) enables using the VK_IMAGE_LAYOUT_GENERAL layout almost anywhere: only initialization and presentation still require layout transitions. Devices that implement this extension guarantee that this does not come with downsides for performance. Khronos has a blog post on the topic. Support for it is still limited, and some older devices are fundamentally incompatible with it (this is not just a matter of updating drivers).
E. Indirect rendering
Indirect rendering is a generalization of instancing where the meshes are allowed to differ since the draw parameters are provided through a GPU-side buffer. It is of course not as efficient as bona fide instancing, but it still gives us a way of rendering multiple objects via a single draw call, which is a good thing as each draw call incurs some cost. The trick is to store the mesh data for all the objects in the same index/vertex buffer (the ones that are tied at the point where we emit the indirect draw call), and to provide the draw call arguments indirectly via a VkBuffer object (which we could generate from the GPU: this technique enables a form of GPU-directed rendering; it is possible to go even further). Indirect rendering is core since Vulkan 1.2. We can access this functionality through commands such as vkCmdDrawIndirect (as vkCmdDraw takes four uint32_t arguments, we should fill our command buffer with groups of four uint32_t; if we were using vkCmdDrawIndexedIndirect instead, it would be five). The mesh data itself is still stored in the buffers bound via vkCmdBindIndexBuffers/vkCmdBindVertexBuffers.
Khronos has a sample related to this technique. There is also this great video by Aurailus, and this vkguide.dev page.
F. Subgroups
Subgroups are a variant of the shared variables we discussed in the compute chapter (remember that these are variables whose value is shared among all invocations in the same workgroup). Subgroups can be smaller than workgroups, but they support shared memory with much better performance (as they basically stand for individual compute units). Moreover, the subgroup mechanism can be used in all shader types instead of just compute ones. This is a feature that became core in version 1.1 of the Vulkan standard.
For instance, we can get the sum of all values of a GLSL variable in the invocations of the current subgroup through a simple operation. Alternatively, we may check if a condition is true for all invocations, or sum the values of a variable only for the invocations with an id smaller than something, or broadcast a value from a precise invocation to all other ones in the subgroup, or shuffle values, or pick the maximum one, or apply a 2D operation working on groups of 2x2 invocations, etc. This extension is almost only GLSL-side (the Vulkan API just changes to enable querying for subgroup support for devices).
There are many things left to say on this topic, but this is where this section ends. You should turn to additional resources for more information. This blog post by Khronos is a great starting point. There is also this 2018 Vulkan Developer Day presentation by Daniel Koch. You may be interested in the maximal reconvergence extension, which makes the semantics of, well, reconvergence more intuitive. I defer the explanation of this extension (whose utility is not limited to subgroup operations) to the Khronos blog post released alongside it.
G. Mesh shading
G.1. Working principle
GPUs have evolved from processors for a fixed rendering pipeline based on predetermined functions (where only some parameters could be tweaked) to much more general and flexible devices integrating user-defined programs (shaders). Modern GPUs even support arbitrary parallel computations through compute shaders. In a sense, mesh shaders are a continuation of this evolution process. With the traditional pipeline, the rasterization of geometric primitives happens once all of the input assembly, vertex shading, tesselation, and geometry shading steps are over. Pre-rasterization steps typically rely on hardwired behavior. Long story short, the (quite complex) traditional pipeline is very efficient for typical workloads, but its rigidity can be quite limiting in specific contexts (the fixed input assembly and tesselation steps are especially likely to lead to avoidable bottlenecks). The main idea behind mesh shading is that we could skip the pre-rasterization portion of the pipeline and produce primitives from compute shaders instead. The VK_EXT_mesh_shader device extension introduces an alternative, more flexible pipeline for graphics.
Mesh shading is a power tool that we should use very sparingly: it makes things even more low-level than they are by default, without guaranteing any increase in performance; in fact, for classical workloads, we should expect it to make it worse. Mesh shading makes sense in contexts when the bottleneck is due to the hardwiredness of the pre-rasterization steps. Typical usecases include loading very detailed geometry (as it allows for very efficient culling and Nanite-style dynamic level of detail shenanigans) or generating isosurfaces. Mesh shading can give us something with the behavior of a geometry shader but without its awful performance. The ray tracing pipeline (not covered by this guide) is distinct from the standard graphical pipeline, so there is no direct way of combining this technique with mesh shading. Furthermore, mesh shading has poor performance on tiling architectures. It is also hard to write a one-size-fits-all mesh shader; it is common to implement distinct versions of the same shader for distinct manufacturers.
Mesh shading makes use of two kinds of shaders: task and mesh shaders. Both of them are glorified compute shaders, and they cooperate to generate meshes in parallel within a workgroup. Only mesh shaders are strictly required, and it is from them that the primitives are actually generated. However, workgroups running mesh shaders are subject to some limitations regarding how many primitives they can output. This has two main consequences:
- If we are rendering a very large/detailed object, we should break it down into smaller submeshes, which we call meshlets. Building good meshlets is a costly operations that should almost never be done live. Instead, we should store pre-computed meshlet information in the game files. What is hard about generating meshlets is that we want each of them to be nice and local (an ideal meshlet is a set of primitives forming a circular patch, not a thin stripe); that is not a trivial problem, but there are good third-party tools out there (such as meshoptimizer). Having good meshlets makes culling more efficient.
- We have to schedule an appropriate number of workgroups runs for each mesh. For instance, if we want to render an object built out of 1300 faces and each workgroup can output only 128 primitives, then we should split that object into 6 meshlets and schedule six workgroups.
With techniques such as tesselation, the amount of geometric detail of an object varies dynamically. Unlike the more common technique of switching the model for a more detailed one depending on the distance, tesselation happens entirely on the GPU. If we want to emulate tesselation using mesh shaders, we have an issue, as we are still tied by the limits of mesh shaders: if the model becomes more detailed, then we need to split it into meshlets in a different way (since we are bound by the maximum number of primitives per meshlet). We discussed earlier how meshlets should be generated statically; it is actually possible to devise schemes to precompute information that can be used to efficiently produce dynamically optimized meshlets, as exemplified by Nanite. Furthermore, for some applications, efficient heuristics can produce good enough meshlets at a low cost without any form of precomputation. This is where the (optional) task shader comes into play: the role of this shader is to dynamically schedule runs of the mesh shader and provide them with arguments.
G.2. Mesh shading in practice
G.2.1. Enabling mesh shading
We enable the mesh shading extension for a device by passing VK_EXT_MESH_SHADER_EXTENSION_NAME through VkDeviceCreateInfo's ppEnabledExtensionNames. Moreover, we can check whether a device supports all features of this extension by calling vkGetPhysicalDeviceFeatures2 and passing a pointer to a VkPhysicalDeviceMeshShaderFeaturesEXT structure through VkPhysicalDeviceFeatures2's pNext chain (we especially care about its taskShader and meshShader fields). To enable select features, we pass a VkPhysicalDeviceFeatures2 through vkCreateDevice's VkDeviceCreateInfo field's pNext chain. To query additional information about mesh shading support for a specific device (e.g., limits), we pass a pointer to a VkPhysicalDeviceMeshShaderPropertiesEXT structure through VkPhysicalDeviceProperties2's pNext chain, and we use this structure in a call to vkGetPhysicalDeviceProperties2.
G.2.2. Shaders
Both task and mesh shaders should include the #extension GL_EXT_mesh_shader: require directive. We also specify the dimensions of a workgroup (i.e., of the grid of invocations) in the same manner as for compute shaders, as compute shaders is what they actually are (so, something like layout(local_size_x = 2, local_size_y = 2, local_size_z = 1) in;). The typical limit for the size of a workgroup is 128.
Task shaders take no inputs (except if we count the builtin workgroup identifiers, which are precisely the same as those of compute shaders). A single invocation typically processes one meshlet (i.e., it decides whether it is to be rendered or dropped; of course, this is not really an option for complex renderers that generate meshlets on the fly). All invocations within a workgroup cooperate to emit an appropriate number of mesh tasks via EmitMeshTasksEXT(x, y, z);. This is a GLSL command whose parameters represent the number of mesh workgroups to generate (although this command appears in all task shader invocations, it is only ever evaluated in the first one; see here for details). As you can imagine, shared memory or subgroup operations can come in handy for synchronizing work across the different invocations of the task shader. Additionally, we may define data to be forwarded to the mesh tasks; this data is uniformly accessible to all created workgroups. We do this by declaring a variable of the form taskPayloadSharedEXT SharedData sharedData; (assume that we defined a structure type called SharedData prior to that point) globally in the task shader, and by assigning a value to it in its contents. We can define at most one of those variables, and we should strive to keep payloads as compact as possible for performance reasons.
In mesh shaders, we set a maximum number of vertices and of primitives built out of these vertices that the workgroup may emit, as in layout(max_vertices = 128, max_primitives = 128) out; (we always reason in terms of workgroups since this is where the parallelism comes from, and we want parallelism). The actual number of vertices/primitives may vary dynamically, but it must almost be within the limits we defined; we use SetMeshOutputsEXT(vertexCount, primCount); to communicate what we actually output from the workgroup. We also specify what kind of primitives our mesh shader produces with a statement such as layout(triangles) out; (the only alternatives being points or lines). A typical mesh shader invocation handles one or two primitives. We may output additional data for the fragment shader, as in layout(location = 0) out vec3 vertColor[];; this is an array with one value per vertex. To output such additional data on a per-primitive basis instead, we use perprimitiveEXT, as in perprimitiveEXT layout(location = 1) out vec3 primNormal[];. Mesh shaders take the payload output from the task shader as a read-only input. To access this data, we need to include a declaration similar to that found in the task shader; e.g., taskPayloadSharedEXT SharedData sharedData;.
GLSL defines some write-only variables for use by mesh shaders: we should write the vertices we create in gl_MeshVerticesEXT, write the triangles we create in gl_PrimitiveTriangleIndicesEXT (alternatively, gl_PrimitiveLineIndicesEXT or gl_PrimitivePointIndicesEXT; we give, e.g., triangles as triplets of indices into gl_MeshVerticesEXT), and optionally share some predefined per-primitive data for us by later stages (those that carry over from the traditional graphics pipeline) using gl_MeshPrimitivesEXT. All of these variables have an array type; check the spec for details (in particular, gl_MeshVerticesEXT and gl_MeshPrimitivesEXT are defined as structs defined in there). The main function of a simple mesh shader may look something like:
To cull specific primitives, we can do something like gl_MeshPrimitivesEXT[i].gl_CullPrimitiveEXT = true (doc 1, doc 2) from a mesh shader. Arseny Kapoulkine has a nice blog post on the topic.
G.2.3. Building and using a mesh shading pipeline object
We define the configuration of a mesh shading pipeline using a standard graphics pipeline object, except that we set both pVertexInputState and pInputAssemblyState to null in VkGraphicsPipelineCreateInfo. Furthermore, we provide VkPipelineShaderStageCreateInfo objects corresponding to our task/mesh shaders. We use vkCreateGraphicsPipelines to create a mesh pipeline (yes, this is the same function as for the traditional graphics pipeline) and vkCmdBindPipeline.
We emit mesh shaders-based draw calls in a command buffer using vkCmdDrawMeshTasksEXT, which simply takes the workgroup grid's dimensions (for the task shader if present, otherwise for the mesh shader) as arguments (we may also use one of its indirect variants: vkCmdDrawMeshTasksIndirectEXT and vkCmdDrawMeshTasksIndirectCountEXT; the buffer should hold VkDrawMeshTasksIndirectCommandEXT objects).
G.X. Additional resources
Khronos provides two samples related to mesh shading: a very basic one (without the usual detailed README) and a more advanced one. They also have a blog post on the topic. Moreover, the related GLSL/OpenGL extension has its own specification (information about finer aspects of task/mesh shaders is sparse outside of this document). You may also want to take a look at the following materials:
- A very clear XDC 2022 presentation by Ricardo Garcia (the short format implies that it is not too detailed; still, it contains good practical advice).
- A nice blog post by Jglrxavpok.
- A Vulkanised 2023 presentation by Timur Kristóf about mesh shading in Vulkan (also see this blog post of his, though it predates the Vulkan extension he helped develop).
- A SIGGRAPH presentation by Brian Karis about Nanite, a component of Unreal Engine 5 that relies on mesh shading. Note that Nanite also handles rasterization in a custom manner: large enough triangles go through the classical hardware rasterizer, but they handle small triangles through a custom, compute shader-based rasterizer. They do this because the hardwired rasterizers of modern GPUs have bad performance for small/thin triangles (SimonDev has a great video on the topic). The results for both types of triangles are then assembled in another compute shader, where color information also gets computed (the presentation discusses some interesting tricks). A tangentially related blog post by Maister contains some additional information on implementing a Vulkan-based Nanite-like renderer; this series of posts by Jglrxavpok also looks very interesting, and so does this blog post by John Hable (whose entire blog is worth checking out). For justification as to why rendering pipelines are based on blocks of 2x2 pixels (which is the root of most hardware rasterizer limitations), see this blog post on derivatives.
H. Device groups
Device groups were introduced in an extension which is part of core Vulkan since version 1.1. It enables using distinct physical GPUs as if they were one and the same. This is mostly useful for doing things à la NVLink. This extension is very niche so we will not discuss it in further detail: just know that it exists.
X. Additional resources
Charles Giessen gave a presentation about modern renderers at Vulkanised 2025, where he discusses most of the techniques mentioned above. It can serve as a good recap of this chapter.