Chapter 6: modern Vulkan

Warning

This chapter is a work in progress (ETA October 2025)

The previous chapters in this guide mostly stuck to vanilla Vulkan 1.0, which was released in 2016. Many things changed since then: manufacturers thought up some new features for graphics cards, and Vulkan users and developers proposed changes to the API. These changes are made available through updates to the core specification for changes to the base API and feature considered important enough, and through extensions for the rest.

Many of these changes aim at making the lives of developers using Vulkan easier, be it by building alternative for existing constructs or by introducing higher level interfaces around GPU concepts. This is a balancing act: we would not want to want to get a bloated specification nor to reduce the amount of control that it gives us over GPUs. New versions mostly add things, but some features got deprecated over time, meaning that we can still use them, but we should feel bad if we do (see this list of deprecated features).

A. Dynamic rendering

In the graphics chapter, we went over Vulkan's classical rendering pipeline. Although most of that chapter remains valid, an important component thereof is now deprecated: exit render passes, subpasses and framebuffers, enter dynamic rendering. Dynamic rendering is a more flexible and less verbose interface for rendering, and it comes without any performance cost. In other words, it is plainly a superior interface, which explains the deprecation of render passes and associated constructs.

The main idea behind dynamic rendering is that we can reference rendering attachments directly instead of going through render passes. We ditch vkCmdBeginRenderPass/vkCmdEndRenderPass pairs in favor of vkCmdBeginRendering and vkCmdEndRendering. In addition to the command buffer, the former function takes a VkRenderingInfo structure, where we specify which attachments are available for the current rendering operations (a single depth and a single stencil attachment, and an arbitrary number of color attachments; input attachments are not mentioned as they are handled entirely through descriptors), as well as the dimensions of the actual render, a view mask (for use with multiview) and a layer count (for the rest of the time). Note that we used to provide similar information via VkRenderPassBeginInfo's framebuffer object. We also provide rendering flags (which we use to suspend/resume rendering operations or to indicate that the draw calls are emitted from secondary command buffers). We can have multiple draw operations in a single rendering block; all of them use the same depth and stencil attachments (also, although the depth and stencil attachments are considered as distinct attachments for future-proofing reasons, they should point to the same attachment in practice). Color attachments are also shared globally; we control which of them get written to by a rendering operation through the GLSL location keyword only. We describe (non-input) attachments using VkRenderingAttachmentInfo objects, which specify an image view and the layout it will be in at rendering time, resolve information for multisampling (a resolve mode, a target image view and a layout; resolving runs after rendering, and the resolve mode gives us a fair amount of control over how it runs), and a pair of load/store operations (plus a color for clearing the attachment on load, if this is the load operation we request). VkRenderingAttachmentInfo is pretty much the new VkAttachmentDescription. The main difference is that the image view is referenced directly and that the layout transitions are not handled automatically.

When we create pipelines, we use the pNext field of VkGraphicsPipelineCreateInfo to pass a VkPipelineRenderingCreateInfo (we also pass a null handle instead of the render pass object). This object specifies the format of the attachments (plus some additional information for multiview rendering only). This is simpler than having to deal with render pass compatibility.

Since there are no more explicitly encoded dependencies, we have to handle synchronization and transitions ourselves through memory barriers. It used to be that render passes yielded better performance on tiled implementations, but the addition of the VK_IMAGE_LAYOUT_RENDERING_LOCAL_READ image layout brought parity with the render pass-based approach (we need to perform the transitions to this layout explicitly). This layout indicates that only pixel-local accesses are to be allowed.

Khronos provides two relevant samples: one about forward shading and another one about deferred shading/local reads. Note that in these samples, the features are presented as if they were extensions, although they have now been made part of Vulkan's core.

B. Beyond descriptor sets

Another pain point of Vulkan 1.0 is descriptor sets. We have all of these objects with different lifetimes around, and we are responsible for grouping them into sets based on their update frequency (and tracking the lifetimes of these descriptor set objects can be a pain). Also, they are subject to some limitations: we can not update descriptors that are bound in a (non fully executed) command buffer, all descriptors must be valid (even when they are not used), there is a limit on descriptor counts that we can actually hit, etc. Furthermore, the entire descriptor model is quite cryptic: we do not handle memory directly, but we deal with the arcane concept of descriptor pools.

In this section, we discuss three Vulkan extensions that were proposed to make descriptors more flexible and less magical. The first two extensions are now part of Vulkan's core, and the last one may get the same treatment at some point.

The extensions that we will discuss are the following:

Descriptor indexing makes descriptor handling more flexible. It allows us to define unbounded arrays of descriptors, to update bound descriptors, to do non-uniform array indexing, and it relaxes the requirement about unused descriptors needing to be valid.
Buffer device addresses enable accessing buffers directly through a raw address: no descriptor required!
The two previous extensions punched holes into the descriptor set abstraction. Descriptor buffers remove it entiryle and makes us responsible for handling the storage of descriptors in buffer objects. This enables advanced techniques such as building descriptor buffers from the GPU itself. However, it comes at the cost of a lot of tedium — in most cases, it is better to do without this extension.

B.1. Descriptor indexing/bindless descriptors

What if we were to store all the required textures for a set of objects in a large array? Then, we could bind this array to a descriptor once, pass appropriate indices in this array to all objects, and use them to access their data (from the shader itself). That way, we would not have to constantly bind and unbind the texture data. Well, actually we already can do something a bit like that, however the lack of flexibility of descriptor sets limitates our options. The descriptor indexing extension, which is now a core part of Vulkan, was built to improve the situation.

Note that, unlike dynamic rendering, using bindless descriptors has a performance cost (as low as it may be: we must just pay the cost of the additional indirection, so it is not too bad). It is therefore not the default way of doing things, but something that we need to enable explicitly.

Khronos provides a sample showcasing this feature. In that sample, the descriptor ids are passed as per-vertex data (using the flat GLSL keyword to ensure that different indices do not get interpolated). If we do not need our data to vary per-vertex, we can send such ids through push constants instead.

Activation

To enable descriptor indexing, we pass a VkPhysicalDeviceDescriptorIndexingFeatures structure to vkCreateDevice, via VkDeviceCreateInfo's pNext pointer.

Update-after-bind

With the descriptor indexing feature active, we can update bound descriptors (we can even update different descriptors from different threads; also, the GPU driver can only make weaker assumptions for optimizations, but the flexibility gains can easily make up for that). We must activate some flags for each descriptor set layout whose contents we want to update that way: a VK_DESCRIPTOR_SET_LAYOUT_CREATE_UPDATE_AFTER_BIND_POOL_BIT, via the VkDescriptorSetLayoutCreateInfo's flags field, and a VK_DESCRIPTOR_BINDING_UPDATE_AFTER_BIND_BIT (which guarantees that the implementation observes descriptor updates: the most recent version at submission time is used) via a VkDescriptorSetLayoutBindingFlagsCreateInfo structure that we provide through VkDescriptorSetLayoutCreateInfo's pNext field; we provide one such structure per binding in our layout. Some more related flags are available:

A VK_DESCRIPTOR_BINDING_PARTIALLY_BOUND_BIT (to allow having invalid descriptors so long as they are not actively used)
A VK_DESCRIPTOR_BINDING_UPDATE_UNUSED_WHILE_PENDING_BIT (to enable updates to descriptors that are not used by an active command buffer; by default, updates are only permitted before submission — when the previous flag is also active, the property of being used by a shader is defined dynamically, as opposed to the static default definition)
A VK_DESCRIPTOR_BINDING_VARIABLE_DESCRIPTOR_COUNT_BIT (to make the size of the descriptor variable; i.e., it will only be known when an actual descriptor set is bound to this layout, and the descriptorCount field is interpreted as a maxium — we can only use this for the last binding in a layout)

Similarly, we must create the descriptor pool with the VK_DESCRIPTOR_POOL_CREATE_UPDATE_AFTER_BIND_BIT flag (via VkDescriptorPoolCreateInfo's flag field). Finally, we perform the actual update in the usual way, i.e., using vkUpdateDescriptorSets.

Non-uniform indexing

By default, we cannot index into descriptor arrays from shaders in any way we want: we must either use a constant index or a so-called "dynamically uniform" one (and we need to ensure that the right ArrayDynamicIndexing physical device features are supported for the latter). For the compute pipeline, dynamic uniformity is about all invocations in the same workgroup sharing the same value. For the graphic one, that value should be the same for all threads spawned from the same draw command (even for multiple instances). Non-uniform indexing makes all indexing-related restrictions go away: we just have to mark non-dynamically uniform indices with the nonuniformEXT GLSL keyword. We need to activate this feature from GLSL first, which we do as #extension GL_EXT_nonuniform_qualifier : require.

B.2. Buffer device address

What if we were able to manipulate GPU virtual addresses from our applications? Then, we could use pointers to read from buffers in Vulkan with the usual pointer arithmetic tricks applying. In particular, we could use such a feature to build a buffer with all the data required by our shaders. Then, instead of binding different descriptor sets for different meshes, we could just use push constants to provide them with addresses that map in the relevant portion of this buffer. The buffer itself never needs to be bound to a descriptor set. This is what buffer device addresses are about.

To enable this feature for a specific buffer, we must create it with the VK_BUFFER_USAGE_SHADER_DEVICE_ADDRESS_BIT flag (from VkBufferUsageFlagBits). Similarly, we must create the memory that we eventually bind this buffer to with a VkMemoryAllocateFlagsInfo that includes VK_MEMORY_ALLOCATE_DEVICE_ADDRESS_BIT. We provide this structure via VkMemoryAllocateInfo's pNext. With all of this done, we can query the address of our buffer through vkGetBufferDeviceAddress.

To use buffer addresses within GLSL shaders, we need to activate a GLSL extension. We do this by adding a #extension GL_EXT_buffer_reference : require line at the beginning of the appropriate shader file. We then use buffer_reference and buffer_reference_align (the latter being optional, we only use it if we require aligned addresses), which gives us something like:

layout(std430, buffer_reference, buffer_reference_align = 8) readonly buffer Pointers { vec2 positions[]; };

The above example does not describe a buffer but the type of a pointer to a buffer (in particular, we do not provide bindings for such declarations). We use such pointers by referencing them through a more usual object, such as the push constant in the example below:

layout(std430, push_constant) uniform Registers { Pointers pointers; } registers;

We can use buffer references with push constants, uniform buffers and storage buffers alike, and we are responsible for ensuring that the shaders never read from addresses that are not part of addressable buffers. Also, we cannot store images or such resources inside buffers, so we cannot use this feature for all of our descriptors. Once again, Khronos provides a sample for this feature.

B.3. Descriptor buffers

The two extensions presented in the previous subsections punched holes through the classical descriptor set abstraction to enable some neat techniques. What if we were to go further and do away with descriptor sets and pools entirely? This is what descriptor buffers are about. With this extension, we can store descriptors inside normal buffers, and we are responsible for the memory management (though we keep using descriptor set layout to describe interfaces). By going low-level, we gain some advantages in flexibility, though it comes at the price of more complex code.

Descriptor buffers are not part of Vulkan's core. Instead, we must activate this functionality through the VK_EXT_descriptor_buffer device extension (which is not universally supported). Note that this extension builds upon the notion of buffer device addresses described above. Also, as per usual, Khronos provides a sample illustrating this feature.

We create descriptors buffers just like we would create a normal buffer, except that we pass the VK_BUFFER_USAGE_RESOURCE_DESCRIPTOR_BUFFER_BIT_EXT flag to VkBufferCreateInfo. We then store a kind of lookup table for our actual resources in such buffers.

We use vkGetDescriptorSetLayoutSizeEXT to get information about the amount of memory needed to store all descriptors from a given descriptor set layout, and vkGetDescriptorSetLayoutBindingOffsetEXT to get the offset of a binding in that space. Finally, we obtain the data corresponding to a descriptor in a buffer using vkGetDescriptorEXT. This function takes a VkDescriptorDataEXT, which is a union of (mostly) VkDescriptorImageInfo/VkDescriptorAddressInfoEXT objects (one per descriptor kind), and it writes the descriptor at a given address (we have to do some pointer arithmetic to compute this one). With this last step done, we have a working descriptor buffer.

To bind our descriptor buffer to a command buffer, we use vkCmdBindDescriptorBuffersEXT. Then, to index into that buffer, we turn to vkCmdSetDescriptorBufferOffsetsEXT.

We must query and constantly remain mindful of the limits described in VkPhysicalDeviceDescriptorBufferPropertiesEXT (which we get through vkGetPhysicalDeviceProperties2, which behaves just like the deprecated vkGetPhysicalDeviceProperties but also returns information about extensions or new features through a pNext chain).

So, descriptor buffers are much more low-level than the previous extensions. What do we get in exchange for this tedium? As it turns out, this allows us to update descriptors directly from the GPU. This is nice in principle, but the usecases are limited in practice. For more detail on the topic, see this blog post by Khronos.

C. Improving the shaders/pipelines situation

Building a pipeline is costly: all shaders are compiled and optimized as much as possible for a certain rendering/computing operation. We should strive to compile our pipelines in advance to avoid performance hitches. However, pipelines lack modularity, and combinatorial explosions are all too common with this construct.

In this section, we discuss two extensions that are concerned with making pipelines more modular. The first extension (shader objects) is quite radical: it gets rid of the concept of pipelines altogether, and proposes a return to what is basically the OpenGL model. Instead, we build individual shader objects that we link (and cross-optimize) on the fly. On the other hand, graphics pipeline libraries allow us to split pipelines in four pieces, and to combine these pieces on the file (again, with cross-optimization).

These functionalities are not (yet) part of Vulkan's core. Whether we should use them depends on our targets and priorities.

C.1. Shader objects

The VK_EXT_shader_object device extension makes it possible to specify shaders and state without pipeline objects. The shader objects extension is based on the observation that we do not need to do all of the compilation in one go. Instead, we can do separate compilation and link/further optimize the shaders based on context only as required. (note that in practice, drivers may already optimize things a bit for us behind the scenes). This is closer to the way OpenGL and older APIs work, and it comes with some performance cost (low for linked shaders, higher for unlinked ones, as we will see). Is this performance cost an acceptable tradeoff for avoiding the pipeline combinatorial explosion problem? Your call.

To enable this extension, we must both enable the VK_EXT_SHADER_OBJECT_EXTENSION_NAME device extension (through VkDeviceCreateInfo's ppEnabledExtensionNames field) and pass a VkPhysicalDeviceShaderObjectFeaturesEXT structure with shaderObject set to VK_TRUE (through VkDeviceCreateInfo's pNext chain).

We create shader objects using vkCreateShadersEXT, which takes a set of VkShaderCreateInfoEXT structures as arguments. This structure specifies all there is to know about the shader: its code, its interface (descriptor set layout, push constants ranges, specialization constants), and information for linking (in this context, linking mostly means "optimizing a shader based on its successors"). This function returns a set of VkShaderEXT handles.

To link shader objects, we just set the VK_SHADER_CREATE_LINK_STAGE_BIT_EXT flag in the VkShaderCreateInfoEXT objects representing a logical sequence of stages. We can define as many unlinked shader objects as we want in a call to vkCreateShadersEXT, but we can link at most one such sequence of stages; there are complex restrictions to the function when creating a mix of linked and unlinked shader objects in a single call — using one call for linking each sequence of shaders and another one for all the unlinked ones seems to be the easiest way around.

We bind shader objects via vkCmdBindShadersEXT. We can bind linked and unlinked objects alike, although we should not expect optimal performance for the latter. When using linked shader objects, we must make sure that we bind every one present in the link sequence. Bound shaders are used by the following compute/draw calls.

In addition to the shader objects, we also need to provide the state information that was originally passed in the pipeline object. We now consider all of this state dynamic and we set it using the appropriate functions, as described in the spec. Launching an operation on the GPU before all the required state has been bound is an error.

Khronos provides a sample and a blog post for this extension. Also, note that shader objects do not (yet) support ray tracing.

C.2. Graphics pipeline libraries

In the previous section, we discussed a way of getting rid of pipeline objects entirely. This was maybe a bit harsh on them; after all, though they are annoying monolitic, they come with good performance once compiled, and we may not want to compromise on performance. Could we not keep pipelines in but make them more modular? This would enable all kind of reuse which would help with the combinatorial explosion problem. This is what the VK_EXT_graphics_pipeline_library device extension is about.

After loading the extension (by passing VK_EXT_GRAPHICS_PIPELINE_LIBRARY_EXTENSION_NAME via VkDeviceCreateInfo's ppEnabledExtensionNames field), we can start defining independent pieces of pipeline objects. We cannot break up our pipelines in any way we want. Instead, there are four predefined parts (aka libraries; see the spec for more detail):

Vertex input interface: covers VkPipelineVertexInputStateCreateInfo and VkPipelineInputAssemblyStateCreateInfo This part contains no shaders and is therefore fast to create.
Pre-rasterization shaders: covers the vertex shader (plus tesselation and geometry, when they are used), as well as VkPipelineViewportStateCreateInfo, VkPipelineRasterizationStateCreateInfo, VkPipelineTessellationStateCreateInfo, and VkRenderPass (or a when we use dynamic rendering). This is a lot of information, but we can get away with giving just the shader code and the pipeline layout when we use dynamic state.
Fragment shader: covers the fragment shader, as well as VkPipelineDepthStencilStateCreateInfo, VkRenderPass (or a when we use dynamic rendering; actually, the viewMask field is the only one required in this context).
Fragment output interface: covers VkPipelineColorBlendStateCreateInfo VkPipelineMultisampleStateCreateInfo VkRenderPass (for that last one, when dynamic rendering is not used). This part contains no shaders and is therefore fast to create.

To create one or several such parts (creating several parts at once does not link them), we use vkCreateGraphicsPipelines the usual way, except that we only provide the information relevant to the part we want to create, which we name explicitly in a VkGraphicsPipelineLibraryCreateInfoEXT structure that we provide via VkGraphicsPipelineCreateInfo's pNext field. If we want to be able to optimize the result of the linking operation for our pipeline parts, we should ask Vulkan to keep additional information about all of them using the VK_PIPELINE_CREATE_RETAIN_LINK_TIME_OPTIMIZATION_INFO_BIT_EXT flag.

The graphics pipelines library extension deprecates vkCreateShaderModule. Instead of using this function, we should just pass our VkShaderModuleCreateInfo directly through VkPipelineShaderStageCreateInfo's pNext field.

To link all parts together, we use a VkPipelineLibraryCreateInfoKHR that we provide to VkGraphicsPipelineCreateInfo via its pNext field. We can (and usually should) provide the VK_PIPELINE_CREATE_LINK_TIME_OPTIMIZATION_BIT_EXT to ensure that Vulkan fully optimizes the resulting pipeline. A technique that we could use here is to use an unoptimized pipeline while waiting for the optimized version to get compiled in the background.

If two different pipeline parts access different sets, the compiler may end up doing funky descriptor sets aliasing as it does not have a global view anymore. For instance, if the vertex and the fragment shader use distinct sets, the driver may only remember that the fragment shader uses one set and that the vertex fragment only uses one set as well. The fact that these sets are distinct can get lost along the way. We can use the VK_PIPELINE_LAYOUT_CREATE_INDEPENDENT_SETS_BIT_EXT flag for pipeline layouts to tell the compiler to be extra careful about this.

Once again, Khronos provides a sample and a blog post for this extension. Note that a similar technique is supported for ray tracing (as discussed here).

D. Synchronization

Modern Vulkan brings some quality of life features for synchronization. The new features are nice, but the changes are not that impactful for most engine designs. They are now part of Vulkan's core.

D.1. Timeline semaphores

Timeline semaphores are a generalization of semaphores (the GPU-GPU synchronization primitive) and of fences (the CPU-GPU synchronization primitive). They are an extension of semaphores, and we create them by passing a VkSemaphoreTypeCreateInfo structure with field semaphoreType set to VK_SEMAPHORE_TYPE_TIMELINE via VkSemaphoreCreateInfo's pNext chain. Note that the type of timeline semaphores remains VkSemaphore.

Whereas plain semaphores were basically booleans, timeline semaphores carry a payload of type int64. We can pick the initial value of this payload at creation time (through VkSemaphoreTypeCreateInfo's initialValue field). We are only ever allowed to increase this value, and we use precise values to represent certain states.

We can interact with timeline semaphores both from the GPU (using them as semaphores). For instance, we can set their value to something arbitrary after some task has been completed by inserting a VkTimelineSemaphoreSubmitInfo structure to VkSubmitInfo's pNext field. In this structure, we store a wait (resp. signal) value for each wait (resp. signal) semaphore that we pass to VkSubmitInfo (we also must store values for binary semaphores that way, though in practice we almost never really mix timeline and binary semaphores so this is not a real issue). A wait finishes when the semaphore reaches the specified value (note that we must ensure that the value we wait for is smaller than the payload's value at the point where the command executes).

We can also interact with timeline semaphores from the CPU (using them as fences). We wait on them using vkWaitSemaphores (see VkSemaphoreWaitInfo), and we signal them using vkSignalSemaphore (see VkSemaphoreSignalInfo). We can also just look at the current value of a timeline semaphore using vkGetSemaphoreCounterValue.

There may be a device-dependent limit on the maximum value between the current value of a semaphore and that of any pending wait or signal operation, which we can read in the maxTimelineSemaphoreValueDifference field of the VkPhysicalDeviceTimelineSemaphoreProperties structure, which we can obtain via vkGetPhysicalDeviceProperties2.

Khronos provides both a sample and a blog post on this topic.

D.2. Synchronization 2

Synchronization 2 improves pipeline barriers, events, image layout transitions and queue submissions. It does all of this through the introduction of the VkDependencyInfo structure, which centralizes all barrier information. It defines new constructs that make use of this representation:

vkCmdPipelineBarrier2;
vkCmdWaitEvents2/vkCmdSetEvent2 (we described events in chapter 1; they are basically split barriers).

A VkDependencyInfo is a collection of VkMemoryBarrier2, VkBufferMemoryBarrier2, and VkImageMemoryBarrier2 structures. These look like the barriers we are familiar with, except that we also store stage information in them (as VkPipelineStageFlags2 objects, which themselves are like VkPipelineStageFlags, except that the stages are split differently; note that out/bottom of pipes stages have been replaced by VK_PIPELINE_STAGE_2_NONE_KHR).

Synchronization 2 introduces vkQueueSubmit2, an alternative submission command that takes a VkSubmitInfo2 argument, which is the same as the VkSubmitInfo, save for its use of VkSemaphoreSubmitInfo structures (for wait and signal alike) and the presence of a flags field. VkSemaphoreSubmitInfo makes timeline semaphores management more natural, and it also defines a deviceIndex field for when we use device groups (with device groups being a core feature that we briefly discuss in the going further chapter — in short, it is about handling distinct physical devices as a single logical one, which is mostly useful for doing things à la NVLink; when we are not using this feature, i.e., almost always, we just leave it at 0).

If we use the now deprecated render passes, we should make use of the VkSubpassDependency2 structure (in practice, we use render passes only if we are targeting mobiles; then, we should actually stick to Vulkan 1.0, so yeah).

Finally, synchronization brings in some new image layouts (VK_IMAGE_LAYOUT_ATTACHMENT_OPTIMAL_KHR and VK_IMAGE_LAYOUT_READ_ONLY_OPTIMAL_KHR). This is just a quality of life feature (before, we would have to spell out, e.g., VK_IMAGE_LAYOUT_DEPTH_STENCIL_ATTACHMENT_OPTIMAL_KHR in full; now, Vulkan can just deduce that the attachment the transition is applied to is a depth/stencil buffer from the context).

The Vulkan guide contains a page with more information on the topic.

E. Mesh shading

GPUs have evolved from processors for a fixed rendering pipeline based on predetermined functions (where only some parameters can be tweaked) to much more general and flexible devices integrating user-defined programs (shaders). Modern GPUs even support arbitrary parallel computations through compute shaders, be they graphics related or not. In a sense, mesh shaders are a continuation of this evolution process. With the traditional pipeline, geometric primitives rasterization happens right after the input assembly, vertex shading, tesselation, and geometry shading steps. Pre-rasterization steps typically make use of specialized hardware facilities. Long story short, the (quite complex) traditional pipeline is very efficient for typical workloads, but its rigidity can be quite limiting in specific contexts (the fixed input assembly and tesselation steps are especially likely to lead to avoidable bottlenecks). The main idea behind mesh shading is that we could skip the pre-rasterization portion of the pipeline and produce primitives from compute shaders instead. In Vulkan, this idea is implemented by the VK_EXT_mesh_shader device extension.

So, when should we use mesh shading? In all likelihood, never. Mesh shading makes things even lower-level than they are by default and does not guarantee any increase in performance; in fact, for classical workloads, we should expect it to make things worse. Mesh shading makes sense in contexts when we reach an avoidable bottleneck from which we cannot rid ourselves satisfyingly using standard optimization techniques. Typical usecases include loading very detailed geometry (as it allows for very efficient culling and Nanite-style dynamic level of detail shenanigans) or generating isosurfaces. Mesh shading can give us something with the behavior of a geometry shader but without the awful performance. The ray tracing pipeline (not covered in this guide) is distinct from the mesh shading pipeline (and, for that matter, of the classical one), so there is no direct way of combining the two.

Mesh shading makes use of two kinds of shaders: tasks and mesh shaders. Both of them are glorified compute shaders, and they cooperate to generate meshes in parallel within a workgroup. Only mesh shaders are strictly required, and it is from them that the primitives are actually generated. However, each run of a mesh shader is subject to some limitations regarding how many primitives they can output. This has two main consequences:

If we are rendering a very large/detailed object, we should break it down into smaller submeshes, which we call meshlets. Building good meshlets is a costly operations, that we should almost never be done live. Instead, we should store pre-computed meshlet information in the game files. This is because we want to maximize locality, among other things (an ideal meshlets is a set of primitives forming a circular patch and not a thin stripe); that is not a trivial problem.
We have to schedule an appropriate number of mesh shader runs for each mesh. For instance, if we want to render an object built out of 1300 faces and our meshes shaders can output only 256 primitives each, then we should split that object into 6 meshlets and schedule six runs of the mesh shader.

With techniques such as tesselation, the amount of geometric detail of an object varies dynamically. Unlike the more common technique of switching the model for a more detailed one depending on the distance, tesselation happens entirely on the GPU. If we want to do tesselation using mesh shaders, we have an issue, as we are still tied by the limits of mesh shaders: if the model becomes more detailed, then we need to split differently it meshlets since none of them should use more primitives than what the device supports. We discussed earlier how meshlets should be generated statically; it is actually possible to devise schemes to precompute information that can be used to efficiently produce dynamically optimized meshlets, as exemplified by Nanite. Furthermore, for some applications, efficient heuristics can produce good enough meshlets at a low cost without any form of precomputation. This is where the (optional) task shader comes into play: the role of this shader is to dynamically schedule runs of the mesh shader and provide them with arguments.

Both mesh and the task shader should include the #extension GL_EXT_mesh_shader: require directive. We also specify their dimensions in the same manner as for compute shaders, as this is what they are (so, something like layout(local_size_x = 2, local_size_y = 2, local_size_z = 1) in;).

Task shaders take no inputs (besides the builtin workgroup identifiers that are precisely the same as with compute shaders), and may emit mesh tasks with EmitMeshTasksEXT(x, y, z);. In this command, the parameters represent the dimension of the workgroup. Additionally, we may define data to be forwarded to the mesh tasks; this data is uniformly accessible to all created workgroups. We do this by declaring a variable of the form taskPayloadSharedEXT SharedData sharedData; (assume that we defined a structure type called SharedData prior to that point) globally in the task shader, and by assigning a value to it in its contents. We can define at most one of those variables, and we should strive to keep payloads as compact as possible for performance reasons.

The mesh shader takes the task payload data as input (plus the builtin workgroup identifiers, as per usual), and . To access the payload data, we need to include a declaration similar to that found in the task shader; i.e., taskPayloadSharedEXT SharedData sharedData; or something.

layout(max_vertices = 128, max_primitives = 128) out;

layout(triangles) out;

SetMeshOutputsEXT(vertexCount, triangleCount);

gl_MeshPrimitivesEXT[idx].gl_CullPrimitiveEXT;

vkCmdDrawMeshTasksEXT(... x, y, z);

Concretely, we need to enable mesh and task shaders: we can check whether a device supports them by calling vkGetPhysicalDeviceFeatures2 and passing a pointer to a VkPhysicalDeviceMeshShaderFeaturesEXT structure through VkPhysicalDeviceFeatures2's pNext field (we care about its taskShader and meshShader fields). To enable the feature, we pass a VkPhysicalDeviceFeatures2 through vkCreateDevice's VkDeviceCreateInfo's pNext field. To get additional information about the mesh shading feature for a specific device (e.g., limits), we pass a pointer to a VkPhysicalDeviceMeshShaderPropertiesEXT structure through VkPhysicalDeviceProperties2's pNext field, and call vkGetPhysicalDeviceProperties2 with it.

We set pVertexInputState and pInputAssemblyState to null in VkGraphicsPipelineCreateInfo.

Variants on vkCmdDrawMeshTasksEXT (indirect commands take a VkDrawMeshTasksIndirectCommandEXT argument):

Variable decorations:

gl_CullPrimitiveEXT (doc)
gl_PrimitivePointIndicesEXT (doc)
gl_PrimitiveLineIndicesEXT (doc)
gl_PrimitiveTriangleIndicesEXT (doc)

Khronos provides two samples related to mesh shading: a very basic one and a more advanced one. They also have a blog post on the topic. You may also want to take a look at the following materials:

Jglrxavpok has a nice blog post about mesh shading. I found it nicer to follow than the READMEs of the official samples.
A Vulkanised 2023 presentation by Timur Kristóf about mesh shading in Vulkan (also see this blog post of his, though it predates the Vulkan extension he helped develop).
A SIGGRAPH presentation by Brian Karis about Nanite, a component of Unreal Engine 5 that relies on mesh shading. Note that Nanite also handles rasterization in a custom manner: they rasterize large enough triangles through the classical rasterization step, but they handle small triangles through a custom, compute-shader based rasterizer. They do this because the implementations of the graphics pipeline on modern GPUs have bad performance for small/thin triangles (SimonDev has a very clear video on the topic). The results for both types of triangles are then assembled in another compute shader, where color information also gets computed (this is described in the presentation, some interesting tricks are used). A tangentially related blog post by Maister contains some additional information on implementing a Vulkan-based Nanite-like renderer; this series of posts by jglrxavpok also looks very interesting; see also this blog post by John Hable (whose entire blog is worth checking); for justification as to why rendering pipelines are based on blocks of 2x2 pixels, see this blog post on derivatives.

F. Misc

F.1. Indirect rendering

TODO

F.2. Subgroups

tutorial

X. Additional resources

A Vulkanised 2025 presentation by Charles Giessen about modern renderers
A blog post about descriptor indexing by Chunk Stories
A note by DethRaid about descriptor indexing
A video by Aurailus about sparse bindless texture arrays (in OpenGL); related to descriptor indexing: it starts from sparse textures, before and bindless textures
A talk by Sean Barrett about virtual textures
A blog post by Faith Ekstrand about descriptors and the underlying hardware models they are meant to abstract away
A presentation by Brian Karis about the Nanite virtual geometry system