A trip through the Graphics Pipeline 2011:
Index
July 9, 2011
Welcome.
This is the index page for a series of blog posts I’m currently writing about the
D3D/OpenGL graphics pipelines as actually implemented by GPUs. A lot of
this is well known among graphics programmers, and there’s tons of papers on
various bits and pieces of it, but one bit I’ve been annoyed with is that while
there’s both broad overviews and very detailed information on individual
components, there’s not much in between, and what little there is is mostly out
of date.
This series is intended for graphics programmers that know a modern 3D API
(at least OpenGL 2.0+ or D3D9+) well and want to know how it all looks under
the hood. It’s not a description of the graphics pipeline for novices; if you
haven’t used a 3D API, most if not all of this will be completely useless to you.
I’m also assuming a working understanding of contemporary hardware design
– you should at the very least know what registers, FIFOs, caches and
pipelines are, and understand how they work. Finally, you need a working
understanding of at least basic parallel programming mechanisms. A GPU is a
massively parallel computer, there’s no way around it.
Some readers have commented that this is a really low-level description of the
graphics pipeline and GPUs; well, it all depends on where you’re standing.
GPU architects would call this a high-level description of a GPU. Not quite as
high-level as the multicolored flowcharts you tend to see on hardware review
sites whenever a new GPU generation arrives; but, to be honest, that kind of
reporting tends to have a very low information density, even when it’s done
well. Ultimately, it’s not meant to explain how anything actually works – it’s just
technology porn that’s trying to show off shiny new gizmos. Well, I try to be a
1
bit more substantial here, which unfortunately means less colors and less
benchmark results, but instead lots and lots of text, a few mono-colored
diagrams and even some (shudder) equations. If that’s okay with you, then
here’s the index:
 Part 1: Introduction; the Software stack.
 Part 2: GPU memory architecture and the Command Processor.
 Part 3: 3D pipeline overview, vertex processing.
 Part 4: Texture samplers.
 Part 5: Primitive Assembly, Clip/Cull, Projection, and Viewport transform.
 Part 6: (Triangle) rasterization and setup.
 Part 7: Z/Stencil processing, 3 different ways.
 Part 8: Pixel processing – “fork phase”.
 Part 9: Pixel processing – “join phase”.
 Part 10: Geometry Shaders.
 Part 11: Stream-Out.
 Part 12: Tessellation.
 Part 13: Compute Shaders.
A trip through the Graphics Pipeline 2011, part 1
July 1, 2011
It’s been awhile since I posted something here, and I figured I might use this
spot to explain some general points about graphics hardware and software as
2
of 2011; you can find functional descriptions of what the graphics stack in your
PC does, but usually not the “how” or “why”; I’ll try to fill in the blanks without
getting too specific about any particular piece of hardware. I’m going to be
mostly talking about DX11-class hardware running D3D9/10/11 on Windows,
because that happens to be the (PC) stack I’m most familiar with – not that the
API details etc. will matter much past this first part; once we’re actually on the
GPU it’s all native commands.
The application
This is your code. These are also your bugs. Really. Yes, the API runtime and
the driver have bugs, but this is not one of them. Now go fix it already.
The API runtime
You make your resource creation / state setting / draw calls to the API. The
API runtime keeps track of the current state your app has set, validates
parameters and does other error and consistency checking, manages
user-visible resources, may or may not validate shader code and shader
linkage (or at least D3D does, in OpenGL this is handled at the driver level)
maybe batches work some more, and then hands it all over to the graphics
driver – more precisely, the user-mode driver.
The user-mode graphics driver (or UMD)
This is where most of the “magic” on the CPU side happens. If your app
crashes because of some API call you did, it will usually be in here :). It’s called
“nvd3dum.dll” (NVidia) or “atiumd*.dll” (AMD). As the name suggests, this is
user-mode code; it’s running in the same context and address space as your
app (and the API runtime) and has no elevated privileges whatsoever. It
implements a lower-level API (the DDI) that is called by D3D; this API is fairly
similar to the one you’re seeing on the surface, but a bit more explicit about
things like memory management and such.
3
This module is where things like shader compilation happen. D3D passes a
pre-validated shader token stream to the UMD – i.e. it’s already checked that
the code is valid in the sense of being syntactically correct and obeying D3D
constraints (using the right
types, not using more textures/samplers than
available, not exceeding the number of available constant buffers, stuff like
that). This is compiled from HLSL code and usually has quite a number of
high-level optimizations (various loop optimizations, dead-code elimination,
constant propagation, predicating ifs etc.) applied to it – this is good news
since it means the driver benefits from all these relatively costly optimizations
that have been performed at compile time. However, it also has a bunch of
lower-level optimizations (such as register allocation and loop unrolling)
applied that drivers would rather do themselves; long story short, this usually
just gets immediately turned into a intermediate representation (IR) and then
compiled some more; shader hardware is close enough to D3D bytecode that
compilation doesn’t need to work wonders to give good results (and the HLSL
compiler having done some of
the high-yield and high-cost optimizations
already definitely helps), but there’s still lots of low-level details (such as HW
resource limits and scheduling constraints) that D3D neither knows nor cares
about, so this is not a trivial process.
And of course, if your app is a well-known game, programmers at NV/AMD
have probably looked at your shaders and wrote hand-optimized replacements
for their hardware – though they better produce the same results lest there be
a scandal :). These shaders get detected and substituted by the UMD too.
You’re welcome.
More fun: Some of the API state may actually end up being compiled into the
shader – to give an example, relatively exotic (or at least infrequently used)
features such as texture borders are probably not implemented in the texture
4
sampler, but emulated with extra code in the shader (or just not supported at
all). This means that there’s sometimes multiple versions of the same shader
floating around, for different combinations of API states.
Incidentally, this is also the reason why you’ll often see a delay the first time
you use a new shader or resource; a lot of the creation/compilation work is
deferred by the driver and only executed when it’s actually necessary (you
wouldn’t believe how much unused crap some apps create!). Graphics
programmers know the other side of the story – if you want to make sure
something is actually created (as opposed to just having memory reserved),
you need to issue a dummy draw call that uses it to “warm it up”. Ugly and
annoying, but this has been the case since I first started using 3D hardware in
1999 – meaning, it’s pretty much a fact of life by this point, so get used to it. :)
Anyway, moving on. The UMD also gets to deal with fun stuff like all the D3D9
“legacy” shader versions and the fixed function pipeline – yes, all of that will get
faithfully passed through by D3D. The 3.0 shader profile ain’t that bad (it’s
quite reasonable in fact), but 2.0 is crufty and the various 1.x shader versions
are seriously whack – remember 1.3 pixel shaders? Or, for that matter, the
fixed-function vertex pipeline with vertex lighting and such? Yeah, support for
all that’s still there in D3D and the guts of every modern graphics driver, though
of course they just translate it to newer shader versions by now (and have
been doing so for quite some time).
Then there’s things like memory management. The UMD will get things like
texture creation commands and need to provide space for them. Actually, the
UMD just suballocates some larger memory blocks it gets from the KMD
(kernel-mode driver); actually mapping and unmapping pages (and managing
which part of video memory the UMD can see, and conversely which parts of
system memory the GPU may access) is a kernel-mode privilege and can’t be
5
done by the UMD.
But the UMD can do things like swizzling textures (unless the GPU can do this
in hardware, usually using 2D blitting units not the real 3D pipeline) and
schedule transfers between system memory and (mapped) video memory and
the like. Most importantly, it can also write command buffers (or “DMA buffers”
– I’ll be using these two names interchangeably) once the KMD has allocated
them and handed them over. A command buffer contains, well, commands :).
All your state-changing and drawing operations will be converted by the UMD
into commands that the hardware understands. As will a lot of things you don’t
trigger manually – such as uploading textures and shaders to video memory.
In general, drivers will try to put as much of the actual processing into the UMD
as possible; the UMD is user-mode code, so anything that runs in it doesn’t
need any costly kernel-mode transitions, it can freely allocate memory, farm
work out to multiple threads, and so on – it’s just a regular DLL (even though
it’s loaded by the API, not directly by your app). This has advantages for driver
development too – if the UMD crashes, the app crashes with it, but not the
whole system; it can just be replaced while the system is running (it’s just a
DLL!); it can be debugged with a regular debugger; and so on. So it’s not only
efficient, it’s also convenient.
But there’s a big elephant in the room that I haven’t mentioned yet.
Did I say “user-mode driver”? I meant “user-mode drivers”.
As said, the UMD is just a DLL. Okay, one that happens to have the blessing of
D3D and a direct pipe to the KMD, but it’s still a regular DLL, and in runs in the
address space of its calling process.
But we’re using multi-tasking OSes nowadays. In fact, we have been for some
6
time.
This “GPU” thing I keep talking about? That’s a shared resource. There’s only
one that drives your main display (even if you use SLI/Crossfire). Yet we have
multiple apps that try to access it (and pretend they’re the only ones doing it).
This doesn’t just work automatically; back in The Olden Days, the solution was
to only give 3D to one app at a time, and while that app was active, all others
wouldn’t have access. But that doesn’t really cut it if you’re trying to have your
windowing system use the GPU for rendering. Which is why you need some
component that arbitrates access to the GPU and allocates time-slices and
such.
Enter the scheduler.
This is a system component – note the “the” is somewhat misleading; I’m
talking about the graphics scheduler here, not the CPU or IO schedulers. This
does exactly what you think it does – it arbitrates access to the 3D pipeline by
time-slicing it between different apps that want to use it. A context switch
incurs, at the very least, some state switching on the GPU (which generates
extra commands for the command buffer) and possibly also swapping some
resources in and out of video memory. And of course only one process gets to
actually submit commands to the 3D pipe at any given time.
You’ll often find console programmers complaining about the fairly high-level,
hands-off nature of PC 3D APIs, and the performance cost this incurs. But the
thing is that 3D APIs/drivers on PC really have a more complex problem to
solve than console games – they really do need to keep track of the full current
state for example, since someone may pull the metaphorical rug from under
them at any moment! They also work around broken apps and try to fix
performance problems behind their backs; this is a rather annoying practice
that no-one’s happy with, certainly including the driver authors themselves, but
7
the fact is that the business perspective wins here; people expect stuff that
runs to continue running (and doing so smoothly). You just won’t win any
friends by yelling “BUT IT’S WRONG!” at the app and then sulking and going
through an ultra-slow path.
Anyway, on with the pipeline. Next stop: Kernel mode!
The kernel-mode driver (KMD)
This is the part that actually deals with the hardware. There may be multiple
UMD instances running at any one time, but there’s only ever one KMD, and if
that crashes, then boom you’re dead – used to be “blue screen” dead, but by
now Windows actually knows how to kill a crashed driver and reload it
(progress!). As long as it happens to be just a crash and not some kernel
memory corruption at least – if that happens, all bets are off.
The KMD deals with all the things that are just there once. There’s only one
GPU memory, even though there’s multiple apps fighting over it. Someone
needs to call the shots and actually allocate (and map) physical memory.
Similarly, someone must initialize the GPU at startup, set display modes (and
get mode information from displays), manage the hardware mouse cursor (yes,
there’s HW handling for this, and yes, you really only get one! :), program the
HW watchdog timer so the GPU gets reset if it stays unresponsive for a certain
time, respond to interrupts, and so on. This is what the KMD does.
There’s also this whole content protection/DRM bit about setting up a
protected/DRM’ed path between a video player and the GPU so no the actual
precious decoded video pixels aren’t visible to any dirty user-mode code that
might do awful forbidden things like dump them to disk (…whatever). The KMD
has some involvement in that too.
8