Bonus content: virtual textures
These are my personal notes on the sparse bindless texture arrays video by Aurailus. This video looks great BUT it contains a bunch of errors, and it ended up leading me somewhat astray. The mistakes are pointed out in a very thorough, practical information dense comment thread started by @fabiangiesen306. I copied the comment chain below since I cannot link directly to a Youtube comment (nor do I trust in it remaining online longer than this page). I keep a link to this video because I still do find it useful on two accounts:
- As a very good explainer of virtual textures, once you factor in the correction
- As a cautionary tale on the fact that for many techniques, documentation is quite sparse, and that testing on small samples can be misleading
Textures management techniques:
- Texture arrays: Bind only once instead of once by texture.
- Virtual textures: Not all textures are actually backed by physical memory. The data is only loaded when it is required. Some form of indirection is required for mapping virtual resources to their location in physical memory.
- Sparse textures: Textures can be bound to memory non-contiguously. For instance, we could divide textures (including mipmaps) into tiles and update a single tile at a time. Some form of indirection is also required.
- Sparse virtual textures (aka megatextures): These mix the two above ideas. We split our textures (including mipmaps) into tiles (whose size match that of a page of physical memory; this is the sparse part), and we load such tiles in physical memory only when they are actually required (this is the virtual part). János Turánszki has a cool video showcasing what the result looks like in practice.
The performance of GPU-side virtual texturing is poor for security reasons: the system checks that you only ever access memory that is yours. I feel like those should be avoidable in most cases: if no other private computations are running on the GPU at the time, then why lock everything up? Oh, some desktop environments use the GPU. Still, I feel like always limiting performance is a bit silly.
Comment thread
I work on Unreal Engine on numerous things including most of the texture processing/streaming stack, some notes:
- GL implementation's low texture bind point limitation is really just a GL-ism. For example, on Windows, going back as far as D3D10 (that was 2006!), D3D required support for 128 texture bind points _per shader stage_. So you could have 128 textures for vertex shaders, another 128 for your fragment shaders, all in the same batch. GPU HW has supported way higher limits than GL usually exposes for decades.
- That said, even though the actual limit is usually way higher than you can access with unextended GL, bind point limits are per-batch. You can change the bound textures for every draw call if you want, and it is in fact completely normal for games to change active texture bindings many thousands of times per frame. This is true for GL as well.
- For a powers-of-2 size progression, if you don't want mip map filtering (and when you're using pixel art, you probably don't), you could just create a texture array at size say 512x512 with mip maps (which makes it ballpark 33% larger in memory than without) and then mip map 0 of it is a 512x512 texture array, mip map 1 is a 256x256 texture array, and so forth. You can use texture sampling instructions in the shader that specify the mip map explicitly and this will let you collapse a whole stack of texture sizes into a single bind point, if you want to avoid changing bindings frequently.
- That said, most games don't use texture arrays much. The most common path is still separate 2D textures and changing all texture bindings whenever the material changes between draw calls.
- Likewise with sparse textures. You need to distinguish between the technique and the actual hardware feature. The actual HW feature (what GL calls "sparse textures") exists but is rarely actually used by games, due to some fundamental problems with how it works. For example, Unreal Engine doesn't use it at all and instead just does it manually in the shader. See e.g. Sean Barrett's "Virtual Textures (aka Megatextures") video on YouTube, a re-recording of a GDC talk he gave in 2008. 2008 was before the dedicated HW feature existed, and most games and engines still use the old all-in-the-shader method. The issue is that the HW method involves manipulating the actual mapping of GPU virtual to physical addresses used by the GPU, which is security-sensitive since there might be data in GPU memory from a different process that you're not allowed to see. (Imagine e.g. a user having a browser tab with their email, a messaging app or their bank app open). So whenever you change the actual GPU virtual → physical mapping, any changes you make need to be vetted by the driver and the OS to ensure you're not trying to gain back-door access to memory you have no business looking at. That makes the "proper" sparse texture mapping update path relatively restricted and slow. The "old school" technique works purely with a manual indirection through your own textures, none of which requires any privileged access or special validation; you're allowed to poke around in your own texture memory, that's fine. So even though it's more work in the shader, updating the mapping can be done from the GPU (the official sparse texture mapping update is CPU-only, since the OS needs to check it) and is much more efficient, and this turns out to be a more important factor in most cases than the higher cost of texture sampling in the shader.
- Bindless, likewise. This has been around for a while and some games now routinely use it, but most games don't. For example, as of this writing (early Jun 2024), bindless support in Unreal Engine is being worked on but hasn't actually shipped. Currently, most games really just have a bunch of textures around and change bindings more or less every draw call. (These days through somewhat more efficient methods than GL's bind points, but it's still explicit binding.)
- Texture compression is not really transparent generally. I think some drivers (especially on PC) support creating a texture in a compressed format and then passing in RGB(A) texture data, but this is a convenience thing, not the normal way to use compressed textures, doesn't tend to exist on mobile targets at all, and tends to give pretty poor quality. The normal approach is to compress textures to one of the supported formats off-line using a texture compressor library (these are much slower but higher quality than on-the-fly compression) and then pass that in.
- DXT3 and DXT5 are the same for the color (RGB) portion and differ only in how they handle A. In practice, ~nobody uses DXT3, and there is not much of a reason to. DXT5 gives better quality pretty much always.
- One very important thing to keep in mind is how textures are stored in GPU memory. In short, really small textures tend to be incredibly inefficient. It's not rare for texturing HW to require textures (and often individual mip maps) in GPU memory to be a multiple of some relatively large size, often 4kb or even 64kb. This is not a big deal if you're dealing with high-res textures. A 4096x4096 DXT1 texture with mip maps is around 21.3MiB. Rounding that up (even if it happens at the individual mip map level) to multiples of 64k does not change things much. If you're doing pixel art with 16x16 and 32x32 textures though, then even with uncompressed RGBA8, a 32x32 pixel texture is 4kb of payload, and if your GPU requires 64k alignment internally, that means your GPU memory usage is 15/16ths = ~94% wasted space. This is one of the bigger reasons why atlasing and related techniques are so important for pixel art, font rendering and other cases where you have many small images. It's worth noting that texture arrays do not necessarily make this any better. It depends on the GPU, but if you have something like an array of 32x32 images, it's totally possible (and not even uncommon) for the GPU to internally require memory for 64x64 pixels or more per array element. Basically, current GPUs are not built for really small texture sizes, and if that's most of what you use, you need to keep an eye on how much memory that ends up consuming because it can be surprisingly horrible.
- Texture streaming. Both for sparse/virtual and regular textures, you have these high-res assets on disk (one of the main reasons why modern AAA games are so big, just tons of high-res textures in the package) but just because you're using a texture doesn't mean to have it fully resident all the time. In something like UE (and many other game engines), if you load a map or world chunk, that'll include a very low-res version of the texture - the last few mip levels. For UE, typically, those textures are around 64x64 pixels with mip maps. That part of the texture is always resident as long as the part of the world referencing it is (it's no use making it much smaller than 64x64 due to the texture storage size limitations mentioned earlier). But the actual size these textures are built from is generally much larger, these days usually at least 4096x4096 pixels and often much more. So we have more larger mip levels on disk: there's a 128x128 mip, a 256x256 mip, and you might go all the way up to 8192x8192 or whatever on disk or you might have the highest resolutions only in the source art (just in case, for archival purposes or in case there's a remaster in a few years) and have the largest size stored on-disk for a texture be 2048x2048 pixels or similar. Anyway, those larger mips will get loaded only once you get close enough to something in the world using those textures to actually see them. That threshold is usually generous (we like to start loading the high-res mips a bit early because it takes a while to load them!) but still, even though high-res versions of most everything tend to exist on disk somewhere, most textures at any given time only have a subset of their mip maps loaded because you're nowhere near close enough to be able to see the full thing. Pure mip map-based streaming has the problem that if you do end up needing those larger mips, e.g. a 4096x4096 or 8192x8192 mip in a close-up of an important character model or whatever, that's still a lot of image to load at once and a lot of VRAM to need from one frame to the next. Overall games are using sparse/virtual textures much more for that kind of thing. These can be partially resident and dice up large textures (and their mip levels) into something much more manageable like 128x128 pixel units. So even if you have your close-up hero shot of some character or NPC, then you need a high-res version of the face resident, and the highest-rest version of the "body" texture up to their shoulders or so, but their back isn't visible at all and neither are most of their arms, legs or torso, so the extra-creamy 8192x8192 textures for those can stay mainly on disk instead of having to be resident in your precious GPU memory pool that instant. Texture streaming is the biggest single reason we can make this stuff fit. Most 3D games released in the past 15 years lean pretty heavily on it.
Oh wow, thank you for the thorough response. It's super cool to hear from someone experienced in doing these things the AAA way. Looking back, I can see I made a lot of mistakes on this video, some of which have been pointed out already. Particularly I think I really failed to emphasize the GL-ness of all of these problems, since I had no experience with any other frameworks at the time I put this video together. I've since moved my project entirely over to Vulkan, and it's such a different beast. I'm really happy for the change.
For your point on mip-mapping, you might have been misled by the way that I reused some textures in the section where I showed the different textures in the texture arrays. That was honestly just because I didn't have enough assets to work with. I do use proper texture mip-mapping on all of my uploaded textures.
I'm aware that a lot of games do swap textures normally, I'm mostly coming at this from the perspective of a voxel game developer, in particular, a voxel engine developer. I need to be able to support any combination of any blocks in any chunk in the game, which could possible entail supporting one mesh with every single registered texture rendering at once. Obviously that's a huge pain, but it's also a great use-case for texture arrays, and bindless textures, so I'm glad that they exist. I'm not sure I see what you're saying with bindless textures not being used commonly. I know some games may not need it, but from my understanding, a lot of games use it, or similar techniques with large descriptor arrays in Vulkan. Perhaps large engines are better at optimizing and organizing texture atlases, but in the indie dev scene, it's been pretty ubiquitous from the people I've talked to that moving towards bindless techniques is the way to go.
In retrospect, I'm quite dissappointed at the way I represented sparse textures. I got most of the knowledge for this video from the 2014 GDC AZDO talk, where a lot of these features were new, and looked at very optimistically by the developers. I put a bit of work into implementing them, and while they seemed to function, I didn't put due dilligence into testing their performance to see if they were actually better in the modern day. It's a shame that information of the "right" way to do things in graphics programming is so fragmented. When looking for ways to improve my engine, that talk was the first resource I was told to learn from by several people, which is odd given a lot of the techniques they discuss in it turned out to be quite slow and unwieldy in the modern day. The security angle for sparse textures is something I hadn't considered. I found out after making this video that sparse texture access was quite slow, but I didn't know why. Very interesting!
In a similar vein to the AZDO thing, comments on this video were how I realized that the page on texture compression on the OpenGL wiki hasn't been updated in almost a decade. That was a fun discovery after I uploaded my video 🥲. Honestly, the state of OGL documentation and a huge amount of work I had to go through because OGL refused to report errors to me were what finally pushed me over the edge, and I've now reimplemented the entire render pipeline of my engine in Vulkan. I'll be making a video on that process soon.
I had no idea that textures were aligned to such large memory regions in VRAM? Does this alignment apply to individual textures in texture arrays as well, or does it treat a whole texture array as a single block? Depending on what it does, I may need to rearchitect a few things :) Where does one find out these low level details for how graphics cards lay out memory and execute the render pipeline? I'd love to get a more thorough understanding of the hardware I'm using, but it's very hard to find modern, comprehensive information on anything in this area.
I've witnessed texture streaming in games before (mostly when they're doing it poorly, although I'm sure well implemented games do it so well that it's invisible.) I don't think that's an issue I'll need to handle in my engine, as I'm mostly optimizing for low-ish resolution assets, but it's regardless a cool thing to discuss.
Thank you for taking the time to write all this out! I really enjoyed reading and replying to it!
Re: bindless textures not being used commonly, in the AAA space people are slowly starting to use them but mostly only within the last 2-3 years or so. There's a ton of inertia, and with bindless especially the debugging experience on most platforms is still significantly worse than with explicit binding. This is a significant hurdle for big projects because there are so many unexpected interactions in complex projects and you often need to debug code that was originally written 5 years ago by someone who's no longer at the company. Good tooling support is just absolutely essential to get anything to work at all, and deploying anything that makes it harder is an uphill battle. Explicit bind points have been around since forever and AAA games virtually never start from scratch, they mostly either use a licensed engine or start from an earlier version of an in-house engine. In both cases there's a lot of shaders and custom tooling around that all need to be touched and updated to move to bindless.
I can't emphasize enough just how big the scale of that problem is. As of this writing, a change has just landed in the shader build pipeline for Fortnite that makes it so the ~764000 (not a typo) PC "shader model 6" shaders (that's the generation that even supports bindless, there's a whole separate set of shaders for older and integrated GPUs that can't really do it) in Fortnite are now merely 7GB instead of the 18.6GB they were earlier (in the compressed form they use on disk, that's 4GB and originally 11.6GB, respectively). Fortnite is a bit of an outlier and has been shipping continuously since 2017 (mostly adding to it, much less often removing things) and it's not like these shaders are all hand-written (it's a combination of ubershaders and an extensively artist-customizable material system), but the sheer quantity is a problem and it all has to keep working. Even testing that you didn't break anything important after a change takes a long time. I recently spent around an afternoon writing a fix for the way textures were packaged on the Nintendo Switch to save a few tens of megabytes of memory (always tight on that target) and reduce CPU/GPU load for texture streaming. It then took around a week of iteration to find a shippable way to put the "enable" toggle for the new way of doing things into the engine config in a way that actually worked correctly in all possible texture build paths (several of which don't have access to the normal engine config facilities for complicated reasons) and was going to work for other Unreal Engine licensees too. And then another ~4 weeks of testing to make sure it actually worked right in-game in all cases it needed to work in.
"Perhaps large engines are better at optimizing and organizing texture atlases" - not really, no. UI things (especially icons and fonts) are usually atlased but for the textures used for your main rendering viewport it's up to the artists. If you have a character model, environment art or prop, there's usually some manual (or at least adapted from a semi-automated UV unwrap) atlasing there but across meshes, there's really usually nothing. These things usually all have different shaders anyway so it's usually separate batches regardless. (With deferred shading, which has been common for over a decade at this point, you do have opportunities to do most of the complicated material shading across mesh boundaries if you want to.)
"I had no idea that textures were aligned to such large memory regions in VRAM? Does this alignment apply to individual textures in texture arrays as well, or does it treat a whole texture array as a single block?" Vulkan and D3D12 do expose at least some of the memory sizes and alignments, so you can see some of it, if you know where to look. The details of texture layout vary greatly between GPUs, there's not even any general agreement on what order things are stored in memory :) - e.g. for a texture array with mip maps, do you store a mip chain for element 0, then a mip chain for element 1 and so forth, do you store mip 0 for all elements, then mip 1 for all elements, or something even funkier? I've seen variants of all three. But usually, the base address of the texture is aligned, the base of each mip level also needs to be aligned (sometimes to very large units like 4k or even 64k, but usually at least to some multiple of 256 or 512 bytes), and yes, the array elements usually also need to be aligned. The GPUs with large alignment requirements generally have some way to relax them for the smallest few mip levels, but usually the last few mips (say 8x8 downwards) really are mostly alignment padding on most GPUs. If you're on game consoles, you tend to get access to a bunch of low-level hardware docs so you know what's going on, as a PC or mobile dev you're unfortunately mostly stuck with the crumbs the APIs, vendor-specific debugging tools and the occasional support thread with an IHV give you.