Chapter 6: modern Vulkan
This chapter is going through its second proofreading pass right now
In the previous chapters, we mostly discussed core Vulkan 1.0, which was released all the way back in 2016. Since then, Vulkan went through four major updates (as of 2026) that improved the developer experience and accommodated new features introduced by GPU constructors. Changes to the base API or new features considered important enough get merged into the core specification, and extensions are introduced to cover features that are either uncommon or still in a state of flux.
Many of these changes aim at making the lives of Vulkan users easier, be it by building alternative for existing abstractions or by introducing higher level interfaces around GPU concepts (which is a balancing act: we would not want to want to get a bloated specification nor to reduce the amount of control over GPUs that Vulkan currently provides). New versions mostly add things, but some features get deprecated over time, meaning that we can still use them, but we should feel bad if we do (see this list of deprecated features).
A. Dynamic rendering
In the graphics chapter, we went over Vulkan's classical rendering pipeline. Although most of that chapter remains valid, some important components thereof are now deprecated: exit render passes, subpasses and framebuffers, enter dynamic rendering. Dynamic rendering is a more flexible and less verbose interface for rendering that comes without any performance cost. In other words, it is a plainly superior abstraction, which explains the deprecation of render passes and associated constructs. An update that actually simplifies things is a rare thing that we should cherish! However, mobile drivers are lagging behind as far as support for modern Vulkan goes, so dynamic rendering can only be recommended for desktop-only development as of 2026.
The main idea behind dynamic rendering is that we could reference rendering attachments directly from the command buffer instead of declaring subpasses upfront. We ditch vkCmdBeginRenderPass/vkCmdEndRenderPass pairs in favor of the vkCmdBeginRendering and vkCmdEndRendering commands. We must provide a VkRenderingInfo structure when we open such a rendering context. This structure plays a role similar to that of VkRenderPassBeginInfo's framebuffer object. It specifies:
-
Which attachments are available to the rendering operations:
- A single depth and a single stencil attachment. These are considered distinct for future-proofing reasons, though they should point to the same resource in practice
- An arbitrary number of color attachments
We do not mention input attachments, as these are handled entirely through descriptors
- The render's dimensions
- A view mask (for use with multiview)
- A layer count (only used when the view mask is left to 0)
- The active rendering flags; we use them to suspend/resume rendering operations or to indicate that the draw calls are emitted from secondary command buffers
We can have multiple draw operations in a single rendering block, with the restriction that all of them share the same depth and stencil attachments. Color attachments are also shared globally: we control which of them gets written to by a rendering operation through the GLSL location keyword only. We describe (non-input) attachments using VkRenderingAttachmentInfo objects, which specify an image view and the layout it will be in at rendering time, resolve information for multisampling (a resolve mode, a target image view and a layout; resolving runs after rendering, and the resolve mode gives us a fair amount of control over how it runs), and a pair of load/store operations (plus a color for clearing the attachment on load, used if we set the load operation to clear). VkRenderingAttachmentInfo is pretty much the new VkAttachmentDescription. The main difference is that the image view is referenced directly and that the layout transitions are not handled automatically.
When creating pipelines, we pass a VkPipelineRenderingCreateInfo object through VkGraphicsPipelineCreateInfo's pNext field. This object specifies the format of the attachments (plus some additional information for multiview rendering). We do not have to deal with render pass compatibility anymore! Also, we should pass a null handle instead of the render pass object we used to provide.
Since there are no more explicitly encoded dependencies, we have to handle synchronization and transitions ourselves through memory barriers. It used to be that render passes yielded better performance on tiled implementations, but parity with the render pass-based approach was achieved through the addition of the VK_IMAGE_LAYOUT_RENDERING_LOCAL_READ image layout, which indicates that only pixel-local accesses are allowed (we need to perform the transitions to this layout explicitly, but this is less work than defining render passes).
Khronos provides two relevant samples: one about forward shading and another one about deferred shading/local reads. Note that in these samples, the features are presented as if they were part of extensions, although they have been made part of Vulkan's core in version 1.4.
This chapter has only been proofread once past this point
B. Beyond descriptor sets
Descriptor sets are another major pain point of Vulkan 1.0: we have to manage these heterogeneous collections of objects, and we are responsible for grouping them into sets based on the update frequency of the bindings. Descriptor sets are subject to many limitations: we cannot update descriptors that are bound in a (non executed) command buffer, all descriptors must be bound to valid (even when they end up not being accessed), there is a maximum number of descriptors, etc. Furthermore, the entire descriptor model is quite cryptic: unlike everywhere else in Vulkan, we do not handle memory directly. What is the arcane concept of descriptor pools really hiding?
In this section, we discuss three Vulkan extensions that make descriptors more flexible and less magical. The first two extensions are now part of Vulkan's core, and the last one may get the same treatment at some point:
- Descriptor indexing allows us to define unbounded arrays of descriptors, to update bound descriptors, to do non-uniform array indexing, and it relaxes the requirement about unused descriptors needing to be valid.
- Buffer device addresses enables accessing buffers directly through a raw address. We can use this technique to avoid having to bind one descriptor per resource.
- The two previous extensions punched holes into the descriptor set abstraction. The descriptor buffers one gets rid of this abstraction in its entirety, and it makes us responsible for handling the storage of descriptors in buffer objects. This enables advanced techniques such as building descriptor buffers from the GPU itself. However, it comes at the cost of a lot of tedium — in most cases, we are better off without this extension.
B.1. Descriptor indexing/bindless descriptors
What if we were to store all the required textures for a set of objects in a large array? Then, we could bind this array to a descriptor once, and only pass indices into this array to all objects. That way, we would not have to constantly bind and unbind the texture data. This is actually something that we could do in Vulkan 1.0, but the lack of flexibility of descriptor sets limitates our options. The descriptor indexing extension, which is now a core part of Vulkan, improves the situation. Bindless descriptors have an impact on performance, as low as it may be: there is just the cost of that one additional indirection.
Khronos provides a sample showcasing this feature. There, the descriptor ids are passed as per-vertex data (using the flat GLSL keyword to disable interpolating data of different vertices: each fragment will inherit the exact data of one of its surrounding vertices). If we do not need the data to vary per-vertex, we can send such ids through push constants instead.
Activation
To enable descriptor indexing, we pass a VkPhysicalDeviceDescriptorIndexingFeatures structure to vkCreateDevice, via VkDeviceCreateInfo's pNext pointer.
Update-after-bind
With the descriptor indexing feature active, we can update bound descriptors (we can even update different ones from different threads). We must activate some flags for each descriptor set layout whose contents we want to update that way: a VK_DESCRIPTOR_SET_LAYOUT_CREATE_UPDATE_AFTER_BIND_POOL_BIT, via the VkDescriptorSetLayoutCreateInfo's flags field, and a VK_DESCRIPTOR_BINDING_UPDATE_AFTER_BIND_BIT (which guarantees that the implementation observes descriptor updates: the most recent version at submission time is used) via a VkDescriptorSetLayoutBindingFlagsCreateInfo structure that we provide through VkDescriptorSetLayoutCreateInfo's pNext field; we provide one such structure per binding in our layout. Some more related flags are available:
- A VK_DESCRIPTOR_BINDING_PARTIALLY_BOUND_BIT (to allow having invalid descriptors so long as they are not actively used)
- A VK_DESCRIPTOR_BINDING_UPDATE_UNUSED_WHILE_PENDING_BIT (to enable updates to descriptors that are not used by an active command buffer; by default, updates are only permitted before submission — when the previous flag is also active, the property of being used by a shader is defined dynamically, as opposed to the static default definition)
- A VK_DESCRIPTOR_BINDING_VARIABLE_DESCRIPTOR_COUNT_BIT (to make the size of the descriptor variable; i.e., it will only be known when an actual descriptor set is bound to this layout, and the descriptorCount field is interpreted as a maxium — we can only use this for the last binding in a layout)
Similarly, we must create the descriptor pool with the VK_DESCRIPTOR_POOL_CREATE_UPDATE_AFTER_BIND_BIT flag (via VkDescriptorPoolCreateInfo's flag field). Finally, we perform the actual update in the usual way, i.e., using vkUpdateDescriptorSets.
the GPU driver can only make weaker assumptions for optimizations, but the flexibility gains can easily make up for that.
Non-uniform indexing
Indexing into descriptor arrays from shaders is quite limited in Vulkan 1.0: only constant indexing are is guaranteed to be supported. Assuming that a device supports the appropriate ArrayDynamicIndexing physical device features, it may also use "dynamically uniform" indexing, i.e.:
- In a compute context: the index may be a variable, but it must resolve to the same value for all invocations in a same workgroup.
- In a graphics context: the index may be a variable, but it must resolve to the same value for all threads spawned from the same draw command (yes, that is more limiting).
Non-uniform indexing makes all indexing-related restrictions go away: we just have to mark non-dynamically uniform indices with the nonuniformEXT GLSL keyword. This qualifier is defined in a GLSL extension, which we load via #extension GL_EXT_nonuniform_qualifier : require.
B.2. Buffer device addresses
What if we were able to manipulate GPU virtual addresses from our applications? Then, we could use GPU-side pointers to read from Vulkan buffers, with all the usual pointer arithmetic tricks applying. We could use such a feature to build a buffer with all the data required by all invocations of a given shader. Then, instead of binding and unbinding descriptor sets for each mesh that relies on this shader, we could simply forward the address of the relevant portion of the buffer via a push constant. Note that the buffer itself never needs to be bound to a descriptor set. This is what buffer device addresses are about (and Khronos once again provides a sample for this feature).
To enable this feature for a specific buffer, we must create it with the VK_BUFFER_USAGE_SHADER_DEVICE_ADDRESS_BIT flag (from VkBufferUsageFlagBits). Similarly, when we allocate memory that we eventually want to bind such a buffer to, we pass a VkMemoryAllocateFlagsInfo structure including VK_MEMORY_ALLOCATE_DEVICE_ADDRESS_BIT to VkMemoryAllocateInfo's pNext field. With all of this done, we can query the address of our buffer through vkGetBufferDeviceAddress (we can then do pointer arithmetic, so long as we remain in-bounds).
To use buffer addresses within GLSL shaders, we again need to activate a GLSL extension (#extension GL_EXT_buffer_reference : require). This extension introduces the buffer_reference and buffer_reference_align qualifiers (the latter being optional, we only use it if we require aligned addresses):
Note that the above does not describe a buffer but the type of a pointer to a buffer (in particular, we do not provide bindings for such declarations). We use such pointers by referencing them through other objects, such as the push constant in the example below:
We can use buffer references with push constants, uniform buffers and storage buffers alike, and we are responsible for ensuring that the shaders never read from addresses that are not part of addressable buffers. We cannot store every resource type inside buffers (e.g., this is not possible for images), so we cannot this feature for all of our descriptors.
B.3. Descriptor buffers
The two extensions above punched holes through the classical descriptor set abstraction. What if we were to go further and do away with descriptor sets and pools entirely? This is what the descriptor buffers extension is about: it enables storing descriptors inside normal buffers (though we keep using descriptor set layout to describe shader interfaces), and puts us in charge of their memory management. This brings flexibility benefits, though at the price of more complex code.
Descriptor buffers are not part of Vulkan's core: to use this functionality, we must enable the VK_EXT_descriptor_buffer device extension (which is not universally supported). Note that this extension builds upon the previously defined notion of buffer device addresses. Also, as per usual, Khronos provides a sample illustrating this feature (as well as a blog post).
We create descriptors buffers just like we would create a normal device addressable buffer, except that we pass the VK_BUFFER_USAGE_SAMPLER_DESCRIPTOR_BUFFER_BIT_EXT flag to VkBufferCreateInfo in addition to VK_BUFFER_USAGE_SHADER_DEVICE_ADDRESS_BIT (if we want to store combined image samplers in buffers, we also need the VK_BUFFER_USAGE_RESOURCE_DESCRIPTOR_BUFFER_BIT_EXT flag).
Descriptor buffers store descriptor data, but different devices encode this information in different ways, so we have to go through a little song and dance to write the data in a generic way. We use vkGetDescriptorSetLayoutSizeEXT to get the amount of memory required to store all descriptors from a given descriptor set layout, and vkGetDescriptorSetLayoutBindingOffsetEXT to get the offset of a binding in that space. Finally, we obtain the data corresponding to a descriptor in a buffer using vkGetDescriptorEXT. This function takes a VkDescriptorDataEXT, which is a union of (mostly) VkDescriptorImageInfo/VkDescriptorAddressInfoEXT objects (one per descriptor kind); it writes this data at a given address (pointer arithmetic comes in handy for computing this one).
We use vkCmdBindDescriptorBuffersEXT to bind descriptor buffers to a command buffer, and we turn to vkCmdSetDescriptorBufferOffsetsEXT to index into a bound buffer.
We must respect the limits described in VkPhysicalDeviceDescriptorBufferPropertiesEXT (we get this structure through vkGetPhysicalDeviceProperties2, which behaves just like the deprecated vkGetPhysicalDeviceProperties but also returns information about extensions or new features through its pNext chain).
So, descriptor buffers are much more low-level than the previous extensions. What do we get in exchange for this tedium? Well, it merely enables updating descriptors directly from the GPU. This is nice in principle, but the usecases are limited in practice. We are better off without descriptor buffers in most scenarios.
B.X. Additional resources
- A blog post about descriptor indexing by Chunk Stories
- A note by DethRaid about descriptor indexing
- A talk by Sean Barrett about virtual textures
- A video by Aurailus about sparse bindless texture arrays (in OpenGL, also touches upon texture compression). The video feels clear BUT it contains a bunch of mistakes, as pointed out by a commenter; see this bonus page for a link, and the whole picture
- A blog post by Faith Ekstrand about descriptors and the underlying hardware models they are meant to abstract away
C. Improving the shaders/pipelines situation
Building pipelines is costly, as it is at this stage that all shaders are compiled and optimized. Ideally, we should compile all our pipelines in advance to avoid performance hitches. This is not always possible in practice, but we still strive to minimize the amount of runtime work required. We are however limited in that pursuit by the rigidity of pipelines, which leads to an absurd amount of duplicated work.
In this section, we discuss two extensions that are concerned with making pipelines more modular. The first extension (shader objects) is quite radical, in that it gets rid of the concept of pipelines altogether, and proposes a return to what is basically the OpenGL model: we build individual shader objects that we link (and cross-optimize) on the fly. The alternative (graphics pipeline libraries) allows us to split pipelines in four pieces, and to combine (and cross-optimize) these pieces on the fly. Neither of these extensions is (yet) part of Vulkan's core.
C.1. Shader objects
The VK_EXT_shader_object device extension enables specifying pipeline shaders and state without pipeline objects. It makes it possible to break compilation in two parts: we can precompile all shaders separately, and only link/further optimize the shaders based on context at runtime. This is closer to the way OpenGL and older APIs work, and it comes with some performance cost. The tradeoff can be positive in many situations. Khronos provides a sample and a blog post for this extension. Also, note that shader objects do not (yet) support ray tracing.
To use shader objects, we must both enable the VK_EXT_SHADER_OBJECT_EXTENSION_NAME device extension (through VkDeviceCreateInfo's ppEnabledExtensionNames field) and pass a VkPhysicalDeviceShaderObjectFeaturesEXT structure with shaderObject set to VK_TRUE (through its pNext chain).
We create shader objects using vkCreateShadersEXT, which takes a set of VkShaderCreateInfoEXT structures as arguments. This structure specifies all there is to know about the shader: its code, its interface (descriptor set layout, push constants ranges, specialization constants), and information for linking (in this context, linking mostly means "optimizing a shader based on its context"). This function returns a set of VkShaderEXT handles.
To link shader objects, we just set the VK_SHADER_CREATE_LINK_STAGE_BIT_EXT flag in the targeted VkShaderCreateInfoEXT objects of the vkCreateShadersEXT. A single call to this function can link together a single sequence of stages (as in, the stages need to actually follow each other in the graphics pipeline). We can define as many unlinked shader objects as we want to in a single call. We can even create a mixture of linked and unlinked shader objects through the same call, although, since complex restrictions apply in this setting, using one call for linking each sequence of shaders and another one for all the unlinked ones is safer.
We bind shader objects via vkCmdBindShadersEXT. We can bind linked and unlinked objects alike, although we should not expect optimal performance for the latter. When using a linked shader object, we must also bind each and every shader that it was linked to. Bound shaders are used by the following compute/draw calls.
In addition to the shader objects, we must provide any additional state information that was originally passed in the pipeline object. We now consider all of this state dynamic and we set it using the appropriate functions, as described in the spec. Launching an operation on the GPU before all the required state has been bound is an error.
C.2. Graphics pipeline libraries
In the previous section, we discussed a way of getting rid of pipeline objects entirely. This was maybe a bit harsh on them; after all, though they are annoying monolithic, they come with good performance once compiled, and we may not want to compromise on performance. Could we not keep pipelines in but make them more modular? This would enable all kind of reuse which would help with the combinatorial explosion problem. This is what the VK_EXT_graphics_pipeline_library device extension is about. Khronos provides a sample and a blog post for this extension. Note that another extension brings extends this method to ray tracing pipelines (as discussed here).
After loading the extension (by passing VK_EXT_GRAPHICS_PIPELINE_LIBRARY_EXTENSION_NAME to VkDeviceCreateInfo's ppEnabledExtensionNames field), we can start defining independent pieces of pipeline objects. We cannot break up our pipelines in any way we want. Instead, there are four predefined parts (aka libraries; see the spec for more detail):
- Vertex input interface: covers VkPipelineVertexInputStateCreateInfo and VkPipelineInputAssemblyStateCreateInfo. This part contains no shaders and is therefore fast to create.
- Pre-rasterization shaders: covers the vertex shader (plus tesselation and geometry, when they are used), as well as VkPipelineViewportStateCreateInfo, VkPipelineRasterizationStateCreateInfo, VkPipelineTessellationStateCreateInfo, and VkRenderPass (or a VkPipelineRenderingCreateInfo when we use dynamic rendering). This is a lot of information, but we can get away with giving just the shader code and the pipeline layout when we use dynamic state.
- Fragment shader: covers the fragment shader, as well as VkPipelineDepthStencilStateCreateInfo and VkRenderPass (or a VkPipelineRenderingCreateInfo when we use dynamic rendering; actually, that structure's viewMask field is the only one we need to provide in this context).
- Fragment output interface: covers VkPipelineColorBlendStateCreateInfo, VkPipelineMultisampleStateCreateInfo, and VkRenderPass (that last one only when dynamic rendering is not used). This part contains no shaders and is therefore fast to create.
To build such pipeline parts/libraries, we use vkCreateGraphicsPipelines the usual way, except that we only provide the information relevant to the parts we are actually creating. We specify which ones these are explicitly, via a VkGraphicsPipelineLibraryCreateInfoEXT structure we store in VkGraphicsPipelineCreateInfo's pNext chain. Note that creating several parts in a single call does not link them together. If we want to be able to optimize the result of the linking operation for our pipeline parts later, we should ask Vulkan to keep additional information about all of them using the VK_PIPELINE_CREATE_RETAIN_LINK_TIME_OPTIMIZATION_INFO_BIT_EXT flag.
The graphics pipelines library extension deprecates vkCreateShaderModule. We should just pass our VkShaderModuleCreateInfo directly through VkPipelineShaderStageCreateInfo's pNext field instead.
To link all parts together, we use a VkPipelineLibraryCreateInfoKHR that we pass via VkGraphicsPipelineCreateInfo's pNext chain. We can (and usually should) enable the VK_PIPELINE_CREATE_LINK_TIME_OPTIMIZATION_BIT_EXT flag to ensure that Vulkan fully optimizes the resulting pipeline. Using an unoptimized pipeline while waiting for the optimized version of it to be ready makes sense in some contexts.
If two different pipeline parts access different sets, the compiler may end up doing funky descriptor sets aliasing as it does not have a global view. For instance, if the vertex and the fragment shader use distinct sets, the driver may only remember that the fragment shader uses one set and that the vertex fragment only uses one set as well. Critically, the fact that these sets are distinct can get lost along the way. We can use the VK_PIPELINE_LAYOUT_CREATE_INDEPENDENT_SETS_BIT_EXT flag for pipeline layouts to tell the compiler to be extra careful about this.
D. Synchronization
Modern Vulkan brings some quality of life features for synchronization. The new features are nice, but the changes are not that impactful for most engine designs. They are now part of Vulkan's core.
D.1. Timeline semaphores
Timeline semaphores are a generalization of semaphores (the GPU-GPU synchronization primitive) and of fences (the CPU-GPU synchronization primitive). They are in fact implemented as an extension of vanilla Vulkan 1.0 semaphores. Khronos provides both a sample and a blog post on this topic.
To create a timeline semaphore, we pass a VkSemaphoreTypeCreateInfo structure with field semaphoreType set to VK_SEMAPHORE_TYPE_TIMELINE via VkSemaphoreCreateInfo's pNext chain. Note that the type of timeline semaphores remains VkSemaphore.
Whereas plain semaphores were basically booleans, timeline semaphores carry a payload of type int64. We can pick the initial value of this payload at creation time (through VkSemaphoreTypeCreateInfo's initialValue field). We are only ever allowed to increase this value, and we use precise values to represent certain states, making an encoding of our own.
We can interact with timeline semaphores from the GPU, using them as semaphores. For instance, we can set their value to something arbitrary after some task has been completed by inserting a VkTimelineSemaphoreSubmitInfo structure in VkSubmitInfo's pNext chain. In this structure, we store a wait (respectively, signal) value for each wait (respectively, signal) semaphore that we pass to VkSubmitInfo (we must also store values for binary semaphores that way, though in practice we almost never mix timeline and binary semaphores; this is not a real issue). A wait finishes when the semaphore reaches the specified value (we must ensure that the value we wait for is smaller than the target's value at the point where the command executes).
We can also interact with timeline semaphores from the CPU, using them as fences. We wait on them using vkWaitSemaphores (see VkSemaphoreWaitInfo), and we signal them using vkSignalSemaphore (see VkSemaphoreSignalInfo). We can also just read the current value of a timeline semaphore using vkGetSemaphoreCounterValue.
There may be a device-dependent limit on the maximum value between the current value of a semaphore and that of any pending wait or signal operation. We can read this limit in the maxTimelineSemaphoreValueDifference field of the VkPhysicalDeviceTimelineSemaphoreProperties structure (obtained via vkGetPhysicalDeviceProperties2's pNext chain).
D.2. Synchronization 2
Synchronization 2 improves pipeline barriers, events, image layout transitions and queue submissions. It does all of this through the introduction of the VkDependencyInfo structure, which centralizes all barrier information. The Vulkan guide contains a page with more information on the topic. Synchronization 2 also introduces constructs that make use of this structure:
- vkCmdPipelineBarrier2, which we use to insert a memory dependency;
- vkCmdWaitEvents2/vkCmdSetEvent2, which we use to interact with events (a feature we described in chapter 1; they are basically split barriers).
A VkDependencyInfo is a collection of VkMemoryBarrier2, VkBufferMemoryBarrier2, and VkImageMemoryBarrier2 structures. These look like the barriers we are familiar with, except that we also store stage information in them (as VkPipelineStageFlags2 objects, which themselves are like VkPipelineStageFlags, except that the stages are split differently; note that out/bottom of pipes stages have been replaced by VK_PIPELINE_STAGE_2_NONE_KHR; we used to provide the stage information via arguments of vkCmdPipelineBarrier).
Furthermore, synchronization 2 introduces vkQueueSubmit2, an alternative submission command that takes a VkSubmitInfo2 argument. This argument is defined just like VkSubmitInfo, save for its use of a pair of VkSemaphoreSubmitInfo structures for describing the wait and the signal operations, and the presence of a flags field. In addition to making timeline semaphores management more natural, VkSemaphoreSubmitInfo defines a deviceIndex field for when we use device groups (with device groups being a Vulkan 1.1 core feature that we briefly discuss in section H — in short, it is about handling distinct physical devices as a single logical one; when we are not using this feature, i.e., almost always, we just leave it at 0).
If we use the now deprecated render passes, we should make use of the VkSubpassDependency2 structure (in practice, we only ever use render passes when we target mobile devices, which we can't really expect to support anything newer than Vulkan 1.0 for now, which implies no synchronization 2, so yeah).
Finally, synchronization brings in some new image layouts (VK_IMAGE_LAYOUT_ATTACHMENT_OPTIMAL_KHR and VK_IMAGE_LAYOUT_READ_ONLY_OPTIMAL_KHR). This is just a quality of life feature (before, we would have to spell out, e.g., VK_IMAGE_LAYOUT_DEPTH_STENCIL_ATTACHMENT_OPTIMAL_KHR; now, Vulkan just deduces that the attachment the transition is applied to is a depth/stencil buffer from the context).
D.3. Getting rid of image layouts
Most (if not all) modern GPUs use a single image layout in practice. For these devices, setting up barriers all over the place to leave room for layout transitions is an exercise in futility. The VK_KHR_unified_image_layouts device extension (released in 2025) enables using the VK_IMAGE_LAYOUT_GENERAL layout almost anywhere: only initialization and presentation still require layout transitions. Devices that implement this extension guarantee that this does not come with downsides for performance. Khronos has a blog post about this extension. Support for it is still limited, and some older devices are fundamentally incompatible with it (so this is not just a matter of updating drivers).
E. Indirect rendering
Indirect rendering is a generalization of instancing where the meshes are allowed to differ (it is of course not as efficient as bona fide instancing, but it still gives us a way of rendering multiple objects via a single draw call, and each draw call incurs a cost). The trick is that the mesh data for all the objects is located in the same buffer (the one tied when we emit the indirect draw call), and that we provide the draw call arguments indirectly via a VkBuffer object (which we could generate from the GPU: this technique enables a form of GPU-directed rendering; it is possible to go even further). Indirect rendering is core since Vulkan 1.2. We can access this functionality through commands such as vkCmdDrawIndirect (as vkCmdDraw takes four uint32_t arguments, we should fill our command buffer with groups of four uint32_t; if we were using vkCmdDrawIndexedIndirect instead, it would be five). The mesh data itself is still stored in the buffers bound via vkCmdBindIndexBuffers/vkCmdBindVertexBuffers.
Khronos has a sample related to this technique. There is also this great video by Aurailus, and this vkguide.dev page.
F. Mesh shading
F.1. Working principle
GPUs have evolved from processors for a fixed rendering pipeline based on predetermined functions (where only some parameters could be tweaked) to much more general and flexible devices integrating user-defined programs (shaders). Modern GPUs even support arbitrary parallel computations through compute shaders, be they graphics related or not. In a sense, mesh shaders are a continuation of this evolution process. With the traditional pipeline, the rasterization of geometric primitives happens once all of the input assembly, vertex shading, tesselation, and geometry shading steps are over. Pre-rasterization steps typically rely on hardwired behavior. Long story short, the (quite complex) traditional pipeline is very efficient for typical workloads, but its rigidity can be quite limiting in specific contexts (the fixed input assembly and tesselation steps are especially likely to lead to avoidable bottlenecks). The main idea behind mesh shading is that we could skip the pre-rasterization portion of the pipeline and produce primitives from compute shaders instead. The VK_EXT_mesh_shader device extension introduces an alternative, more flexible pipeline for graphics.
Mesh shading is a power tool that we should use very sparingly: it makes things even more low-level than they are by default, without guaranteing any increase in performance; in fact, for classical workloads, we should expect it to make it worse. Mesh shading makes sense in contexts when the bottleneck is due to the hardwiredness of the pre-rasterization steps. Typical usecases include loading very detailed geometry (as it allows for very efficient culling and Nanite-style dynamic level of detail shenanigans) or generating isosurfaces. Mesh shading can give us something with the behavior of a geometry shader but without its awful performance. The ray tracing pipeline (not covered in this guide) is distinct from the mesh shading pipeline, so there is no direct way of combining the two. Furthermore, mesh shading has poor performance on tiling architectures. It is also hard to write a one-size-fits-all mesh shader; it is common to implement distinct versions of the same shader for distinct manufacturers.
Mesh shading makes use of two kinds of shaders: task and mesh shaders. Both of them are glorified compute shaders, and they cooperate to generate meshes in parallel within a workgroup. Only mesh shaders are strictly required, and it is from them that the primitives are actually generated. However, workgroups running mesh shaders are subject to some limitations regarding how many primitives they can output. This has two main consequences:
- If we are rendering a very large/detailed object, we should break it down into smaller submeshes, which we call meshlets. Building good meshlets is a costly operations that should almost never be done live. Instead, we should store pre-computed meshlet information in the game files. What is hard about generating meshlets is that we want each of them to be nice and local (an ideal meshlet is a set of primitives forming a circular patch, not a thin stripe); that is not a trivial problem, but there are good third-party tools out there (such as meshoptimizer). Having good meshlets makes culling more efficient.
- We have to schedule an appropriate number of workgroups runs for each mesh. For instance, if we want to render an object built out of 1300 faces and each workgroup can output only 128 primitives, then we should split that object into 6 meshlets and schedule six workgroups.
With techniques such as tesselation, the amount of geometric detail of an object varies dynamically. Unlike the more common technique of switching the model for a more detailed one depending on the distance, tesselation happens entirely on the GPU. If we want to emulate tesselation using mesh shaders, we have an issue, as we are still tied by the limits of mesh shaders: if the model becomes more detailed, then we need to split it into meshlets in a different way since none of them should use more primitives than what the device supports. We discussed earlier how meshlets should be generated statically; it is actually possible to devise schemes to precompute information that can be used to efficiently produce dynamically optimized meshlets, as exemplified by Nanite. Furthermore, for some applications, efficient heuristics can produce good enough meshlets at a low cost without any form of precomputation. This is where the (optional) task shader comes into play: the role of this shader is to dynamically schedule runs of the mesh shader and provide them with arguments.
F.2. Mesh shading in practice
F.2.1. Enabling mesh shading
We enable the mesh shading extension for a device by passing VK_EXT_MESH_SHADER_EXTENSION_NAME through VkDeviceCreateInfo's ppEnabledExtensionNames. Moreover, we can check whether a device supports all of its features by calling vkGetPhysicalDeviceFeatures2 and passing a pointer to a VkPhysicalDeviceMeshShaderFeaturesEXT structure through VkPhysicalDeviceFeatures2's pNext chain (we especially care about its taskShader and meshShader fields). To enable chosen features, we pass a VkPhysicalDeviceFeatures2 through vkCreateDevice's VkDeviceCreateInfo field's pNext chain. To query additional information about mesh shading support for a specific device (e.g., limits), we pass a pointer to a VkPhysicalDeviceMeshShaderPropertiesEXT structure through VkPhysicalDeviceProperties2's pNext chain, and we use this structure in a call to vkGetPhysicalDeviceProperties2.
F.2.2. Shaders
Both mesh and task shaders should include the #extension GL_EXT_mesh_shader: require directive. We also specify the dimensions of a workgroup in the same manner as for compute shaders, as compute shaders is what they actually are (so, something like layout(local_size_x = 2, local_size_y = 2, local_size_z = 1) in;). The typical limit for the size of a workgroup is 128.
Task shaders take no inputs (besides the builtin workgroup identifiers that are precisely the same as those of compute shaders). A workgroup is a group of task shader invocations. A single invocation typically processes one meshlet (i.e., it decides whether is it to be rendered or dropped; of course, this is not really an option for complex renderers that generate meshlets on the fly). All invocations within a workgroup cooperate to emit an appropriate number of mesh tasks via EmitMeshTasksEXT(x, y, z);. This is a GLSL command whose parameters represent the number of mesh workgroups to generate (although this command appears in all task shader invocations, it is only ever evaluated in the first one; see here for details). Additionally, we may define data to be forwarded to the mesh tasks; this data is uniformly accessible to all created workgroups. We do this by declaring a variable of the form taskPayloadSharedEXT SharedData sharedData; (assume that we defined a structure type called SharedData prior to that point) globally in the task shader, and by assigning a value to it in its contents. We can define at most one of those variables, and we should strive to keep payloads as compact as possible for performance reasons.
In mesh shaders, we set a maximum number of vertices and of primitives built out of these vertices that the workgroup may emit, as in layout(max_vertices = 128, max_primitives = 128) out; (we always reason in terms of workgroups since this is where the parallelism comes from, and we want parallelism). The actual number of vertices/primitives may vary dynamically, but it must almost be within the limits we defined; we use SetMeshOutputsEXT(vertexCount, primCount); to communicate what we actually output from the workgroup. We also specify what kind of primitives our mesh shader produces with a statement such as layout(triangles) out; (the only alternatives are points or lines). A typical mesh shader invocation handles one or two primitives. We may output additional data for the fragment shader, as in layout(location = 0) out vec3 vertColor[];; this is an array with one value per vertex. To output such additional data on a per-primitive basis instead, we use perprimitiveEXT, as in perprimitiveEXT layout(location = 1) out vec3 primNormal[];. Mesh shaders take the payload output from the task shader as a read-only input. To access this data, we need to include a declaration similar to that found in the task shader; e.g., taskPayloadSharedEXT SharedData sharedData;.
GLSL defines some write-only variables for use by mesh shaders: we should write the vertices we create in gl_MeshVerticesEXT, write the triangles we create in gl_PrimitiveTriangleIndicesEXT (alternatively, gl_PrimitiveLineIndicesEXT or gl_PrimitivePointIndicesEXT; we give, e.g., triangles as triplets of indices into gl_MeshVerticesEXT), and optionally share some predefined per-primitive data for us by later stages (those that carry over from the traditional graphics pipeline), using gl_MeshPrimitivesEXT. All of these variables have an array type; check the spec for details (in particular, gl_MeshVerticesEXT and gl_MeshPrimitivesEXT are defined as structs defined in there).
To cull specific primitives, we can do something like gl_MeshPrimitivesEXT[i].gl_CullPrimitiveEXT = true (doc 1, doc 2) from a mesh shader. Arseny Kapoulkine has a nice blog post on the topic.
F.2.3. Building and using a mesh shading pipeline object
We define the configuration of a mesh shading pipeline using a standard graphics pipeline object, except that we set both pVertexInputState and pInputAssemblyState to null in VkGraphicsPipelineCreateInfo. Furthermore, we provide VkPipelineShaderStageCreateInfo objects corresponding to our task/mesh shaders. We use vkCreateGraphicsPipelines to create a mesh pipeline (yes, this is the same function as for the traditional graphics pipeline) and vkCmdBindPipeline.
We emit mesh shaders-based draw calls in a command buffer using vkCmdDrawMeshTasksEXT, which simply takes the workgroup grid's dimensions (for the task shader if present, otherwise for the mesh shader) as arguments (we may also use one of its indirect variants: vkCmdDrawMeshTasksIndirectEXT and vkCmdDrawMeshTasksIndirectCountEXT; these take a VkDrawMeshTasksIndirectCommandEXT argument).
F.X. Additional resources
Khronos provides two samples related to mesh shading: a very basic one (without the usual detailed README) and a more advanced one. They also have a blog post on the topic. Moreover, the related GLSL/OpenGL extension has its own specification (information about finer aspects of task/mesh shaders is sparse outside of this document). You may also want to take a look at the following materials:
- A very clear XDC 2022 presentation by Ricardo Garcia (the short format implies that it is not too detailed; still, it contains good practical advice).
- A nice blog post by Jglrxavpok.
- A Vulkanised 2023 presentation by Timur Kristóf about mesh shading in Vulkan (also see this blog post of his, though it predates the Vulkan extension he helped develop).
- A SIGGRAPH presentation by Brian Karis about Nanite, a component of Unreal Engine 5 that relies on mesh shading. Note that Nanite also handles rasterization in a custom manner: large enough triangles go through the classical hardware rasterizer, but they handle small triangles through a custom, compute shader-based rasterizer. They do this because the hardwired rasterizers of modern GPUs have bad performance for small/thin triangles (SimonDev has a great video on the topic). The results for both types of triangles are then assembled in another compute shader, where color information also gets computed (they use some interesting tricks, as described in the presentation). A tangentially related blog post by Maister contains some additional information on implementing a Vulkan-based Nanite-like renderer; this series of posts by Jglrxavpok also looks very interesting, and so does this blog post by John Hable (whose entire blog is worth checking out). For justification as to why rendering pipelines are based on blocks of 2x2 pixels (which is the root of most hardware rasterizer limitations), see this blog post on derivatives.
G. Subgroups
Subgroups are a variant of the shared variables we discussed in the compute chapter (remember that these are variables whose value is shared among all invocations in the same workgroup). Subgroups can be smaller than workgroups, but they support shared memory with much better performance (as they basically stand for individual compute units). Moreover, the subgroup mechanism can be used in all shader types instead of just compute ones. This is a feature that became core in version 1.1 of the Vulkan standard.
For instance, we can get the sum of all values of a GLSL variable in the invocations of the current subgroup through a simple operation. Alternatively, we may check if a condition is true for all invocations, or do the sum only for the invocations with a lower id than the current one, or broadcast a value from a precise invocation to all other ones in the subgroup, or shuffle values, or pick the maximum value, or apply a 2D operation working on groups of 2x2 invocations, etc. This extension is almost only GLSL-side (the Vulkan API just changes to enable querying for subgroup support for devices).
There are many things left to say on this topic, but this is where this section ends. You should check additional resources for more information. This blog post by Khronos is a great starting point. There is also a 2018 Vulkan Developer Day presentation by Daniel Koch. You may be interested in the maximal reconvergence extension, which makes the semantics of, well, reconvergence more intuitive. I defer the explanation of this extension (whose utility is not limited to subgroup operations) to the Khronos blog post released alongside it.
H. Device groups
Device groups were introduced in an extension which is part of core Vulkan since version 1.1. It enables using distinct physical GPUs as if they were one and the same. This is mostly useful for doing things à la NVLink. This extension is very niche so we will not discuss it in further detail: just know that it exists.
X. Additional resources
Charles Giessen gave a presentation about modern renderers at Vulkanised 2025, where he discusses most of the techniques mentioned above. It can serve as a good recap of this chapter.