Chapter 6: modern Vulkan

Warning

This chapter is currently being proofread for the first time

In the previous chapters, we mostly discussed core Vulkan 1.0, which was released all the way back in 2016. Since then, Vulkan went through four major updates (as of 2025) that improved the developer experience and accommodated the new features that recent GPUs implement. Changes to the base API or new features considered important enough are merged into the core specification, whereas the rest is provided through extensions.

Many of these changes aim at making the lives of developers using Vulkan easier, be it by building alternative for existing constructs or by introducing higher level interfaces around GPU concepts (which is a balancing act: we would not want to want to get a bloated specification nor to reduce the amount of control that Vulkan gives developers over GPUs). New versions mostly add things, but some features got deprecated over time, meaning that we can still use them, but we should feel bad if we do (see this list of deprecated features).

A. Dynamic rendering

In the graphics chapter, we went over Vulkan's classical rendering pipeline. Although most of that chapter remains valid, some important components thereof are now deprecated: exit render passes, subpasses and framebuffers, enter dynamic rendering. Dynamic rendering is a more flexible and less verbose interface for rendering that comes without any performance cost. In other words, it is plainly a superior abstraction, which explains the deprecation of render passes and associated constructs. Having an update that simplifies things is a rare thing that we should cherish. However, mobile drivers are lagging behind as far as support for modern Vulkan goes, so dynamic rendering cannot be recommended for cross-platform development as of 2025.

The main idea behind dynamic rendering is that we can reference rendering attachments directly instead of declaring subpasses upfront. We ditch vkCmdBeginRenderPass/vkCmdEndRenderPass pairs in favor of the vkCmdBeginRendering and vkCmdEndRendering commands. We must provide a VkRenderingInfo structure when we open such a rendering context, which replaces VkRenderPassBeginInfo's framebuffer object. The new structure defines:

We can have multiple draw operations in a single rendering block, with the restriction that all of them share the same depth and stencil attachments. Color attachments are also shared globally: we control which of them gets written to by a rendering operation through the GLSL location keyword only. We describe (non-input) attachments using VkRenderingAttachmentInfo objects, which specify an image view and the layout it will be in at rendering time, resolve information for multisampling (a resolve mode, a target image view and a layout; resolving runs after rendering, and the resolve mode gives us a fair amount of control over how it runs), and a pair of load/store operations (plus a color for clearing the attachment on load, if this is the load operation we request). VkRenderingAttachmentInfo is pretty much the new VkAttachmentDescription. The main difference is that the image view is referenced directly and that the layout transitions are not handled automatically.

When creating pipelines, we pass a VkPipelineRenderingCreateInfo object through VkGraphicsPipelineCreateInfo's pNext field. This object specifies the format of the attachments (plus some additional information for multiview rendering). We do not have to deal with render pass compatibility anymore! Also, we should pass a null handle instead of the render pass object we used to provide.

Since there are no more explicitly encoded dependencies, we have to handle synchronization and transitions ourselves through memory barriers. It used to be that render passes yielded better performance on tiled implementations, but parity with the render pass-based approach was achieved through the addition of the VK_IMAGE_LAYOUT_RENDERING_LOCAL_READ image layout, which indicates that only pixel-local accesses are allowed (we need to perform the transitions to this layout explicitly, but this is less work than defining render passes).

Khronos provides two relevant samples: one about forward shading and another one about deferred shading/local reads. Note that in these samples, the features are presented as if they were part of extensions, although they have been made part of Vulkan's core in version 1.4.

B. Beyond descriptor sets

Descriptor sets are another major pain point of Vulkan 1.0: we have to manage these heterogeneous collections of objects, and we are responsible for grouping them into sets based on the update frequency of the bindings. Descriptor sets are subject to many limitations: we cannot update descriptors that are bound in a (non executed) command buffer, all descriptors must be bound to valid (even when they end up not being accessed), there is a maximum number of descriptors, etc. Furthermore, the entire descriptor model is quite cryptic: unlike everywhere else in Vulkan, we do not handle memory directly. What is the arcane concept of descriptor pools really hiding?

In this section, we discuss three Vulkan extensions that make descriptors more flexible and less magical. The first two extensions are now part of Vulkan's core, and the last one may get the same treatment at some point:

B.1. Descriptor indexing/bindless descriptors

What if we were to store all the required textures for a set of objects in a large array? Then, we could bind this array to a descriptor once, and only pass indices into this array to all objects. That way, we would not have to constantly bind and unbind the texture data. This is actually something that we could do in Vulkan 1.0, but the lack of flexibility of descriptor sets limitates our options. The descriptor indexing extension, which is now a core part of Vulkan, improves the situation. Bindless descriptors have an impact on performance, as low as it may be: there is just the cost of that one additional indirection.

Khronos provides a sample showcasing this feature. There, the descriptor ids are passed as per-vertex data (using the flat GLSL keyword to disable interpolating data of different vertices: each fragment will inherit the exact data of one of its surrounding vertices). If we do not need the data to vary per-vertex, we can send such ids through push constants instead.

Activation

To enable descriptor indexing, we pass a VkPhysicalDeviceDescriptorIndexingFeatures structure to vkCreateDevice, via VkDeviceCreateInfo's pNext pointer.

Update-after-bind

With the descriptor indexing feature active, we can update bound descriptors (we can even update different ones from different threads). We must activate some flags for each descriptor set layout whose contents we want to update that way: a VK_DESCRIPTOR_SET_LAYOUT_CREATE_UPDATE_AFTER_BIND_POOL_BIT, via the VkDescriptorSetLayoutCreateInfo's flags field, and a VK_DESCRIPTOR_BINDING_UPDATE_AFTER_BIND_BIT (which guarantees that the implementation observes descriptor updates: the most recent version at submission time is used) via a VkDescriptorSetLayoutBindingFlagsCreateInfo structure that we provide through VkDescriptorSetLayoutCreateInfo's pNext field; we provide one such structure per binding in our layout. Some more related flags are available:

Similarly, we must create the descriptor pool with the VK_DESCRIPTOR_POOL_CREATE_UPDATE_AFTER_BIND_BIT flag (via VkDescriptorPoolCreateInfo's flag field). Finally, we perform the actual update in the usual way, i.e., using vkUpdateDescriptorSets.

the GPU driver can only make weaker assumptions for optimizations, but the flexibility gains can easily make up for that.

Non-uniform indexing

Indexing into descriptor arrays from shaders is quite limited in Vulkan 1.0: only constant indexing are is guaranteed to be supported. Assuming that a device supports the appropriate ArrayDynamicIndexing physical device features, it may also use "dynamically uniform" indexing, i.e.:

Non-uniform indexing makes all indexing-related restrictions go away: we just have to mark non-dynamically uniform indices with the nonuniformEXT GLSL keyword. This qualifier is defined in a GLSL extension, which we load via #extension GL_EXT_nonuniform_qualifier : require.

B.2. Buffer device addresses

What if we were able to manipulate GPU virtual addresses from our applications? Then, we could use GPU-side pointers to read from Vulkan buffers, with all the usual pointer arithmetic tricks applying. We could use such a feature to build a buffer with all the data required by all invocations of a given shader. Then, instead of binding and unbinding descriptor sets for each mesh that relies on this shader, we could simply forward the address of the relevant portion of the buffer via a push constant. Note that the buffer itself never needs to be bound to a descriptor set. This is what buffer device addresses are about (and Khronos once again provides a sample for this feature).

To enable this feature for a specific buffer, we must create it with the VK_BUFFER_USAGE_SHADER_DEVICE_ADDRESS_BIT flag (from VkBufferUsageFlagBits). Similarly, when we allocate memory that we eventually want to bind such a buffer to, we pass a VkMemoryAllocateFlagsInfo structure including VK_MEMORY_ALLOCATE_DEVICE_ADDRESS_BIT to VkMemoryAllocateInfo's pNext field. With all of this done, we can query the address of our buffer through vkGetBufferDeviceAddress (we can then do pointer arithmetic, so long as we remain in-bounds).

To use buffer addresses within GLSL shaders, we again need to activate a GLSL extension (#extension GL_EXT_buffer_reference : require). This extension introduces the buffer_reference and buffer_reference_align qualifiers (the latter being optional, we only use it if we require aligned addresses):

layout(std430, buffer_reference, buffer_reference_align = 8) readonly buffer Pointers { vec2 positions[]; };

Note that the above does not describe a buffer but the type of a pointer to a buffer (in particular, we do not provide bindings for such declarations). We use such pointers by referencing them through other objects, such as the push constant in the example below:

layout(std430, push_constant) uniform Registers { Pointers pointers; } registers;

We can use buffer references with push constants, uniform buffers and storage buffers alike, and we are responsible for ensuring that the shaders never read from addresses that are not part of addressable buffers. We cannot store every resource type inside buffers (e.g., this is not possible for images), so we cannot this feature for all of our descriptors.

B.3. Descriptor buffers

The two extensions above punched holes through the classical descriptor set abstraction. What if we were to go further and do away with descriptor sets and pools entirely? This is what the descriptor buffers extension is about: it enables storing descriptors inside normal buffers (though we keep using descriptor set layout to describe shader interfaces), and puts us in charge of their memory management. This brings flexibility benefits, though at the price of more complex code.

Descriptor buffers are not part of Vulkan's core: to use this functionality, we must enable the VK_EXT_descriptor_buffer device extension (which is not universally supported). Note that this extension builds upon the previously defined notion of buffer device addresses. Also, as per usual, Khronos provides a sample illustrating this feature (as well as a blog post).

We create descriptors buffers just like we would create a normal device addressable buffer, except that we pass the VK_BUFFER_USAGE_SAMPLER_DESCRIPTOR_BUFFER_BIT_EXT flag to VkBufferCreateInfo in addition to VK_BUFFER_USAGE_SHADER_DEVICE_ADDRESS_BIT (if we want to store combined image samplers in buffers, we also need the VK_BUFFER_USAGE_RESOURCE_DESCRIPTOR_BUFFER_BIT_EXT flag).

Descriptor buffers store descriptor data, but different devices encode this information in different ways, so we have to go through a little song and dance to write the data in a generic way. We use vkGetDescriptorSetLayoutSizeEXT to get the amount of memory required to store all descriptors from a given descriptor set layout, and vkGetDescriptorSetLayoutBindingOffsetEXT to get the offset of a binding in that space. Finally, we obtain the data corresponding to a descriptor in a buffer using vkGetDescriptorEXT. This function takes a VkDescriptorDataEXT, which is a union of (mostly) VkDescriptorImageInfo/VkDescriptorAddressInfoEXT objects (one per descriptor kind); it writes this data at a given address (pointer arithmetic comes in handy for computing this one).

We use vkCmdBindDescriptorBuffersEXT to bind descriptor buffers to a command buffer, and we turn to vkCmdSetDescriptorBufferOffsetsEXT to index into a bound buffer.

We must respect the limits described in VkPhysicalDeviceDescriptorBufferPropertiesEXT (we get this structure through vkGetPhysicalDeviceProperties2, which behaves just like the deprecated vkGetPhysicalDeviceProperties but also returns information about extensions or new features through its pNext chain).

So, descriptor buffers are much more low-level than the previous extensions. What do we get in exchange for this tedium? Well, it merely enables updating descriptors directly from the GPU. This is nice in principle, but the usecases are limited in practice. We are better off without descriptor buffers in most scenarios.

B.X. Additional resources

C. Improving the shaders/pipelines situation

Building pipelines is costly, as it is at this stage that all shaders are compiled and optimized. Ideally, we should compile all our pipelines in advance to avoid performance hitches. This is not always possible in practice, but we still strive to minimize the amount of runtime work required. We are however limited in that pursuit by the rigidity of pipelines, which leads to an absurd amount of duplicated work.

In this section, we discuss two extensions that are concerned with making pipelines more modular. The first extension (shader objects) is quite radical, in that it gets rid of the concept of pipelines altogether, and proposes a return to what is basically the OpenGL model: we build individual shader objects that we link (and cross-optimize) on the fly. The alternative (graphics pipeline libraries) allows us to split pipelines in four pieces, and to combine (and cross-optimize) these pieces on the fly. Neither of these extensions is (yet) part of Vulkan's core.

Warning

Never proofread past this point

C.1. Shader objects

The VK_EXT_shader_object device extension makes it possible to specify shaders and state without pipeline objects. The shader objects extension is based on the observation that we do not need to do all of the compilation in one go. Instead, we can do separate compilation and link/further optimize the shaders based on context only as required. (note that in practice, drivers may already optimize things a bit for us behind the scenes). This is closer to the way OpenGL and older APIs work, and it comes with some performance cost (low for linked shaders, higher for unlinked ones, as we will see). Is this performance cost an acceptable tradeoff for avoiding the pipeline combinatorial explosion problem? Your call.

To enable this extension, we must both enable the VK_EXT_SHADER_OBJECT_EXTENSION_NAME device extension (through VkDeviceCreateInfo's ppEnabledExtensionNames field) and pass a VkPhysicalDeviceShaderObjectFeaturesEXT structure with shaderObject set to VK_TRUE (through VkDeviceCreateInfo's pNext chain).

We create shader objects using vkCreateShadersEXT, which takes a set of VkShaderCreateInfoEXT structures as arguments. This structure specifies all there is to know about the shader: its code, its interface (descriptor set layout, push constants ranges, specialization constants), and information for linking (in this context, linking mostly means "optimizing a shader based on its successors"). This function returns a set of VkShaderEXT handles.

To link shader objects, we just set the VK_SHADER_CREATE_LINK_STAGE_BIT_EXT flag in the VkShaderCreateInfoEXT objects representing a logical sequence of stages. We can define as many unlinked shader objects as we want in a call to vkCreateShadersEXT, but we can link at most one such sequence of stages; there are complex restrictions to the function when creating a mix of linked and unlinked shader objects in a single call — using one call for linking each sequence of shaders and another one for all the unlinked ones seems to be the easiest way around.

We bind shader objects via vkCmdBindShadersEXT. We can bind linked and unlinked objects alike, although we should not expect optimal performance for the latter. When using linked shader objects, we must make sure that we bind every one present in the link sequence. Bound shaders are used by the following compute/draw calls.

In addition to the shader objects, we also need to provide the state information that was originally passed in the pipeline object. We now consider all of this state dynamic and we set it using the appropriate functions, as described in the spec. Launching an operation on the GPU before all the required state has been bound is an error.

Khronos provides a sample and a blog post for this extension. Also, note that shader objects do not (yet) support ray tracing.

C.2. Graphics pipeline libraries

In the previous section, we discussed a way of getting rid of pipeline objects entirely. This was maybe a bit harsh on them; after all, though they are annoying monolitic, they come with good performance once compiled, and we may not want to compromise on performance. Could we not keep pipelines in but make them more modular? This would enable all kind of reuse which would help with the combinatorial explosion problem. This is what the VK_EXT_graphics_pipeline_library device extension is about.

After loading the extension (by passing VK_EXT_GRAPHICS_PIPELINE_LIBRARY_EXTENSION_NAME via VkDeviceCreateInfo's ppEnabledExtensionNames field), we can start defining independent pieces of pipeline objects. We cannot break up our pipelines in any way we want. Instead, there are four predefined parts (aka libraries; see the spec for more detail):

To create one or several such parts (creating several parts at once does not link them), we use vkCreateGraphicsPipelines the usual way, except that we only provide the information relevant to the part we want to create, which we name explicitly in a VkGraphicsPipelineLibraryCreateInfoEXT structure that we provide via VkGraphicsPipelineCreateInfo's pNext field. If we want to be able to optimize the result of the linking operation for our pipeline parts, we should ask Vulkan to keep additional information about all of them using the VK_PIPELINE_CREATE_RETAIN_LINK_TIME_OPTIMIZATION_INFO_BIT_EXT flag.

The graphics pipelines library extension deprecates vkCreateShaderModule. Instead of using this function, we should just pass our VkShaderModuleCreateInfo directly through VkPipelineShaderStageCreateInfo's pNext field.

To link all parts together, we use a VkPipelineLibraryCreateInfoKHR that we provide to VkGraphicsPipelineCreateInfo via its pNext field. We can (and usually should) provide the VK_PIPELINE_CREATE_LINK_TIME_OPTIMIZATION_BIT_EXT to ensure that Vulkan fully optimizes the resulting pipeline. A technique that we could use here is to use an unoptimized pipeline while waiting for the optimized version to get compiled in the background.

If two different pipeline parts access different sets, the compiler may end up doing funky descriptor sets aliasing as it does not have a global view anymore. For instance, if the vertex and the fragment shader use distinct sets, the driver may only remember that the fragment shader uses one set and that the vertex fragment only uses one set as well. The fact that these sets are distinct can get lost along the way. We can use the VK_PIPELINE_LAYOUT_CREATE_INDEPENDENT_SETS_BIT_EXT flag for pipeline layouts to tell the compiler to be extra careful about this.

Once again, Khronos provides a sample and a blog post for this extension. Note that a similar technique is supported for ray tracing (as discussed here).

D. Synchronization

Modern Vulkan brings some quality of life features for synchronization. The new features are nice, but the changes are not that impactful for most engine designs. They are now part of Vulkan's core.

D.1. Timeline semaphores

Timeline semaphores are a generalization of semaphores (the GPU-GPU synchronization primitive) and of fences (the CPU-GPU synchronization primitive). They are an extension of semaphores, and we create them by passing a VkSemaphoreTypeCreateInfo structure with field semaphoreType set to VK_SEMAPHORE_TYPE_TIMELINE via VkSemaphoreCreateInfo's pNext chain. Note that the type of timeline semaphores remains VkSemaphore.

Whereas plain semaphores were basically booleans, timeline semaphores carry a payload of type int64. We can pick the initial value of this payload at creation time (through VkSemaphoreTypeCreateInfo's initialValue field). We are only ever allowed to increase this value, and we use precise values to represent certain states.

We can interact with timeline semaphores both from the GPU (using them as semaphores). For instance, we can set their value to something arbitrary after some task has been completed by inserting a VkTimelineSemaphoreSubmitInfo structure to VkSubmitInfo's pNext field. In this structure, we store a wait (resp. signal) value for each wait (resp. signal) semaphore that we pass to VkSubmitInfo (we also must store values for binary semaphores that way, though in practice we almost never really mix timeline and binary semaphores so this is not a real issue). A wait finishes when the semaphore reaches the specified value (note that we must ensure that the value we wait for is smaller than the payload's value at the point where the command executes).

We can also interact with timeline semaphores from the CPU (using them as fences). We wait on them using vkWaitSemaphores (see VkSemaphoreWaitInfo), and we signal them using vkSignalSemaphore (see VkSemaphoreSignalInfo). We can also just look at the current value of a timeline semaphore using vkGetSemaphoreCounterValue.

There may be a device-dependent limit on the maximum value between the current value of a semaphore and that of any pending wait or signal operation, which we can read in the maxTimelineSemaphoreValueDifference field of the VkPhysicalDeviceTimelineSemaphoreProperties structure, which we can obtain via vkGetPhysicalDeviceProperties2.

Khronos provides both a sample and a blog post on this topic.

D.2. Synchronization 2

Synchronization 2 improves pipeline barriers, events, image layout transitions and queue submissions. It does all of this through the introduction of the VkDependencyInfo structure, which centralizes all barrier information. It defines new constructs that make use of this representation:

A VkDependencyInfo is a collection of VkMemoryBarrier2, VkBufferMemoryBarrier2, and VkImageMemoryBarrier2 structures. These look like the barriers we are familiar with, except that we also store stage information in them (as VkPipelineStageFlags2 objects, which themselves are like VkPipelineStageFlags, except that the stages are split differently; note that out/bottom of pipes stages have been replaced by VK_PIPELINE_STAGE_2_NONE_KHR).

Synchronization 2 introduces vkQueueSubmit2, an alternative submission command that takes a VkSubmitInfo2 argument, which is the same as the VkSubmitInfo, save for its use of VkSemaphoreSubmitInfo structures (for wait and signal alike) and the presence of a flags field. VkSemaphoreSubmitInfo makes timeline semaphores management more natural, and it also defines a deviceIndex field for when we use device groups (with device groups being a core feature that we briefly discuss in section H — in short, it is about handling distinct physical devices as a single logical one; when we are not using this feature, i.e., almost always, we just leave it at 0).

If we use the now deprecated render passes, we should make use of the VkSubpassDependency2 structure (in practice, we use render passes only if we are targeting mobiles; then, we should actually stick to Vulkan 1.0, so yeah).

Finally, synchronization brings in some new image layouts (VK_IMAGE_LAYOUT_ATTACHMENT_OPTIMAL_KHR and VK_IMAGE_LAYOUT_READ_ONLY_OPTIMAL_KHR). This is just a quality of life feature (before, we would have to spell out, e.g., VK_IMAGE_LAYOUT_DEPTH_STENCIL_ATTACHMENT_OPTIMAL_KHR in full; now, Vulkan can just deduce that the attachment the transition is applied to is a depth/stencil buffer from the context).

The Vulkan guide contains a page with more information on the topic.

E. Indirect rendering

Indirect rendering is a generalization of instancing where the meshes can differ (it is of course not as efficient as bona fide instancing, but it still gives us a way of rendering multiple objects via a single draw call). The trick is that the mesh data lives in the same single buffer (the one tied when we emit the indirect draw call), and that we provide the draw call arguments indirectly via a VkBuffer object (which we could generate from the GPU: this technique enables a form of GPU-directed rendering; it is possible to go even further). Indirect rendering is core since Vulkan 1.2. We can access this functionality through commands such as vkCmdDrawIndirect. As vkCmdDraw takes four uint32_t arguments, we should fill our buffer with groups of four such values (had we used vkCmdDrawIndexedIndirect instead, it would be five).

Khronos has a sample related to this technique. There is also this great video by Aurailus, and this vkguide.dev page.

F. Mesh shading

F.1. Working principle

GPUs have evolved from processors for a fixed rendering pipeline based on predetermined functions (where only some parameters can be tweaked) to much more general and flexible devices integrating user-defined programs (shaders). Modern GPUs even support arbitrary parallel computations through compute shaders, be they graphics related or not. In a sense, mesh shaders are a continuation of this evolution process. With the traditional pipeline, geometric primitives rasterization happens right after the input assembly, vertex shading, tesselation, and geometry shading steps. Pre-rasterization steps typically make use of specialized hardware facilities. Long story short, the (quite complex) traditional pipeline is very efficient for typical workloads, but its rigidity can be quite limiting in specific contexts (the fixed input assembly and tesselation steps are especially likely to lead to avoidable bottlenecks). The main idea behind mesh shading is that we could skip the pre-rasterization portion of the pipeline and produce primitives from compute shaders instead. In Vulkan, this idea is implemented by the VK_EXT_mesh_shader device extension.

So, when should we use mesh shading? In all likelihood, very sparingly. Mesh shading makes things even lower-level than they are by default and does not guarantee any increase in performance; in fact, for classical workloads, we should expect it to make things worse. Mesh shading makes sense in contexts when we reach an avoidable bottleneck from which we cannot rid ourselves satisfyingly using standard optimization techniques. Typical usecases include loading very detailed geometry (as it allows for very efficient culling and Nanite-style dynamic level of detail shenanigans) or generating isosurfaces. Mesh shading can give us something with the behavior of a geometry shader but without the awful performance. The ray tracing pipeline (not covered in this guide) is distinct from the mesh shading pipeline (and, for that matter, of the classical one), so there is no direct way of combining the two. Furthermore, mesh shading has poor performance on tiling architectures. It is also hard to write a one-size-fits-all mesh shader; it is common to implement a distinct version of the same shader for each manufacturer.

Mesh shading makes use of two kinds of shaders: task and mesh shaders. Both of them are glorified compute shaders, and they cooperate to generate meshes in parallel within a workgroup. Only mesh shaders are strictly required, and it is from them that the primitives are actually generated. However, workgroups running mesh shaders are subject to some limitations regarding how many primitives they can output. This has two main consequences:

With techniques such as tesselation, the amount of geometric detail of an object varies dynamically. Unlike the more common technique of switching the model for a more detailed one depending on the distance, tesselation happens entirely on the GPU. If we want to do tesselation using mesh shaders, we have an issue, as we are still tied by the limits of mesh shaders: if the model becomes more detailed, then we need to split differently it meshlets since none of them should use more primitives than what the device supports. We discussed earlier how meshlets should be generated statically; it is actually possible to devise schemes to precompute information that can be used to efficiently produce dynamically optimized meshlets, as exemplified by Nanite. Furthermore, for some applications, efficient heuristics can produce good enough meshlets at a low cost without any form of precomputation. This is where the (optional) task shader comes into play: the role of this shader is to dynamically schedule runs of the mesh shader and provide them with arguments.

F.2. Mesh shading in practice

F.2.1. Enabling mesh shading

We enable the mesh shading extension for a device by passing VK_EXT_MESH_SHADER_EXTENSION_NAME through VkDeviceCreateInfo's ppEnabledExtensionNames field.

Moreover, we can check whether a device supports them by calling vkGetPhysicalDeviceFeatures2 and passing a pointer to a VkPhysicalDeviceMeshShaderFeaturesEXT structure through VkPhysicalDeviceFeatures2's pNext field (we care about its taskShader and meshShader fields). To enable the feature, we pass a VkPhysicalDeviceFeatures2 through vkCreateDevice's VkDeviceCreateInfo's pNext field. To get additional information about the mesh shading feature for a specific device (e.g., limits), we pass a pointer to a VkPhysicalDeviceMeshShaderPropertiesEXT structure through VkPhysicalDeviceProperties2's pNext field, and call vkGetPhysicalDeviceProperties2 with it.

F.2.2. Shaders

Both mesh and task shaders should include the #extension GL_EXT_mesh_shader: require directive. We also specify the dimensions of a workgroup in the same manner as for compute shaders, as this is what they are (so, something like layout(local_size_x = 2, local_size_y = 2, local_size_z = 1) in;). The typical limit for the size of a workgroup is 128.

Task shaders take no inputs (besides the builtin workgroup identifiers that are precisely the same as with compute shaders). An invocation of a task shader typically processes one meshlet. All invocations within a workgroup cooperate to emit an appropriate number of mesh tasks with EmitMeshTasksEXT(x, y, z);. This is a GLSL command whose parameters represent the number of mesh workgroups to generate (although this command appears in all task shader invocation, only the effect of the one from the first invocation are considered; see here for details). Additionally, we may define data to be forwarded to the mesh tasks; this data is uniformly accessible to all created workgroups. We do this by declaring a variable of the form taskPayloadSharedEXT SharedData sharedData; (assume that we defined a structure type called SharedData prior to that point) globally in the task shader, and by assigning a value to it in its contents. We can define at most one of those variables, and we should strive to keep payloads as compact as possible for performance reasons.

In mesh shaders, we define a maximum number of vertices, and a maximum number of primitives built out of these vertices that the workgroup may emit, as in layout(max_vertices = 128, max_primitives = 128) out; (indeed, these limits are not defined per shader invocation; remember that different invocations belonging to the same workgroup cooperate to generate the data: we want parallelism). In the actual code, we use SetMeshOutputsEXT(vertexCount, primCount); to communicate the actual number of vertices and primitives that we output from the workgroup. We also specify what kind of primitives we are using with a statement such as layout(triangles) out; (the only possible alternatives are points or lines). A typical mesh shader invocation handles one or two primitives. We may output additional data for the fragment shader, e.g., layout(location = 0) out vec3 vertColor[]; (note that we output an array of values, with one value per vertex). To output data on a per-primitive basis instead of a per-vertex basis, we we can use perprimitiveEXT to indicate that an output is meant as a per-primitive one, as in perprimitiveEXT layout(location = 1) out vec3 primNormal[];. Additionally, they take the task payload data as input. To access the (read only) payload data, we need to include a declaration similar to that found in the task shader; e.g., taskPayloadSharedEXT SharedData sharedData;.

GLSL defines some write-only variables for use by mesh shaders: we should write the vertices we create in gl_MeshVerticesEXT, write indices of triplets of vertices to gl_PrimitiveTriangleIndicesEXT to form triangles (alternatively, gl_PrimitiveLineIndicesEXT or gl_PrimitivePointIndicesEXT), and optionally share some predefined per-primitive data using gl_MeshPrimitivesEXT. All of these variables have an array type; check the spec for details (in particular, gl_MeshVerticesEXT and gl_MeshPrimitivesEXT are defined as structs defined in there).

To apply some form of culling on a per-primitive basis, we can set gl_MeshPrimitivesEXT[i].gl_CullPrimitiveEXT (doc) to an appropriate value from inside a mesh shader. Arseny Kapoulkine has a nice blog post on the topic.

F.2.3. Building and using a mesh shading pipeline object

We setup mesh shading by creating a standard graphics pipeline object, except that we set pVertexInputState and pInputAssemblyState to null in VkGraphicsPipelineCreateInfo.

We emit mesh shaders-based draw calls in a command buffer using vkCmdDrawMeshTasksEXT, which simply takes the workgroup grid's dimensions (for the task shader if present, otherwise for the mesh shader) as arguments (we may also use one of its indirect variants vkCmdDrawMeshTasksIndirectEXT and vkCmdDrawMeshTasksIndirectCountEXT; these take a VkDrawMeshTasksIndirectCommandEXT argument).

F.X. Additional resources

Khronos provides two samples related to mesh shading: a very basic one and a more advanced one. They also have a blog post on the topic. The related GLSL/OpenGL extension also has a specification (information about its finer aspects is sparse outside of this document). You may also want to take a look at the following materials:

G. Subgroups

Subgroups are a feature that became core in version 1.1 of the standard. They are basically a variant of the shared variable we discussed in the compute chapter (these are variables whose value is shared among all invocations in the same workgroup). Subgroups can be smaller than workgroups, but they support shared memory with much better performance (as they basically represent individual compute units). Also, the subgroup mechanism can be used in all shader types instead of just compute ones.

For instance, you can get the sum of all values of a variable in the invocations of the current subgroup through a simple GLSL operation (or check if a condition is true for all invocations, or do the sum only for the invocations with a lower id than the current one, or broadcast a value from a precise invocation to all other ones in the subgroup, or shuffle values, or pick the maximum value, or apply a 2D operation working on groups of 2x2 invocations, or whatnot). All changes are GLSL side (besides querying for subgroup support for devices).

This section does not go into too much detail on this topic. You should check additional resources for more information. Khronos provides a blog post on the topic, which is a great starting point. There is also a 2018 Vulkan Developer Day presentation by Daniel Koch. You may also be interested in the maximal reconvergence extension, which makes the semantics of, well, reconvergence more intuitive. I defer the explanation of this extension (whose utility is not limited to subgroup operations) to the original Khronos blog post.

H. Device groups

Device groups were introduced in an extension which is part of core Vulkan since version 1.1. It enables using distinct physical GPUs as if they one and the same. This is mostly useful for doing things à la NVLink. This extension is very niche so we won't discuss it in further detail: just know that it exists.

X. Additional resources

Charles Giessen gave a presentation about modern renderers at Vulkanised 2025, where he discusses most of the techniques mentioned above. It can serve as a good recap of this chapter.