Understanding The
Linux Virtual Memory Manager
Mel Gorman
6th December 2003
Preface
Linux is developed with a stronger practical emphasis than a theoretical one. When
new algorithms or changes to existing implementations are suggested, it is common
to request code to match the argument. Many of the algorithms used in the Virtual
Memory (VM) system were designed by theorists but the implementations have now
diverged from the theory considerably. In part, Linux does follow the traditional
development cycle of design to implementation but it is more common for changes
to be made in reaction to how the system behaved in the “real-world” and intuitive
decisions by developers.
This means that the VM performs well in practice but there is very little VM
specific documentation available except for a few incomplete overviews in a small
number of websites, except the web site containing an earlier draft of this book
of course! This has lead to the situation where the VM is fully understood only
by a small number of core developers. New developers looking for information on
how it functions are generally told to read the source and little or no information is
available on the theoretical basis for the implementation. This requires that even a
casual observer invest a large amount of time to read the code and study the field
of Memory Management.
This book, gives a detailed tour of the Linux VM as implemented in 2.4.22
and gives a solid introduction of what to expect in 2.6. As well as discussing the
implementation, the theory it is is based on will also be introduced. This is not
intended to be a memory management theory book but it is often much simpler to
understand why the VM is implemented in a particular fashion if the underlying
basis is known in advance.
To complement the description, the appendix includes a detailed code comment-
ary on a significant percentage of the VM. This should drastically reduce the amount
of time a developer or researcher needs to invest in understanding what is happening
inside the Linux VM. As VM implementations tend to follow similar code patterns
even between major versions. This means that with a solid understanding of the 2.4
VM, the later 2.5 development VMs and the final 2.6 release will be decipherable in
a number of weeks.
The Intended Audience
Anyone interested in how the VM, a core kernel subsystem, works will find answers
to many of their questions in this book. The VM, more than any other subsystem,
i
Preface
ii
affects the overall performance of the operating system. It is also one of the most
poorly understood and badly documented subsystem in Linux, partially because
there is, quite literally, so much of it. It is very difficult to isolate and understand
individual parts of the code without first having a strong conceptual model of the
whole VM, so this book intends to give a detailed description of what to expect
without before going to the source.
This material should be of prime interest to new developers interested in adapting
the VM to their needs and to readers who simply would like to know how the VM
works. It also will benefit other subsystem developers who want to get the most from
the VM when they interact with it and operating systems researchers looking for
details on how memory management is implemented in a modern operating system.
For others, who are just curious to learn more about a subsystem that is the focus of
so much discussion, they will find an easy to read description of the VM functionality
that covers all the details without the need to plough through source code.
However, it is assumed that the reader has read at least one general operating
system book or one general Linux kernel orientated book and has a general know-
ledge of C before tackling this book. While every effort is made to make the material
approachable, some prior knowledge of general operating systems is assumed.
Book Overview
In chapter 1, we go into detail on how the source code may be managed and de-
ciphered. Three tools will be introduced that are used for the analysis, easy browsing
and management of code. The main tools are the Linux Cross Referencing (LXR)
tool which allows source code to be browsed as a web page and CodeViz for gener-
ating call graphs which was developed while researching this book. The last tool,
PatchSet is for managing kernels and the application of patches. Applying patches
manually can be time consuming and the use of version control software such as
CVS (http://www.cvshome.org/ ) or BitKeeper (http://www.bitmover.com) are not
always an option. With this tool, a simple specification file determines what source
to use, what patches to apply and what kernel configuration to use.
In the subsequent chapters, each part of the Linux VM implementation will be
discussed in detail, such as how memory is described in an architecture independent
manner, how processes manage their memory, how the specific allocators work and
so on. Each will refer to the papers that describe closest the behaviour of Linux
as well as covering in depth the implementation, the functions used and their call
graphs so the reader will have a clear view of how the code is structured. At the
end of each chapter, there will be a “What’s New” section which introduces what to
expect in the 2.6 VM.
The appendices are a code commentary of a significant percentage of the VM. It
gives a line by line description of some of the more complex aspects of the VM. The
style of the VM tends to be reasonably consistent, even between major releases of
the kernel so an in-depth understanding of the 2.4 VM will be an invaluable aid to
understanding the 2.6 kernel when it is released.
Preface
What’s New in 2.6
iii
At the time of writing, 2.6.0-test4 has just been released so 2.6.0-final is due
“any month now” which means December 2003 or early 2004. Fortunately the 2.6
VM, in most ways, is still quite recognisable in comparison to 2.4. However, there
is some new material and concepts in 2.6 and it would be pity to ignore them so
to address this, hence the “What’s New in 2.6” sections. To some extent, these
sections presume you have read the rest of the book so only glance at them during
the first reading.
If you decide to start reading 2.5 and 2.6 VM code, the basic
description of what to expect from the “Whats New” sections should greatly aid
your understanding.
It is important to note that the sections are based on the
2.6.0-test4 kernel which should not change change significantly before 2.6. As
they are still subject to change though, you should still treat the “What’s New”
sections as guidelines rather than definite facts.
Companion CD
A companion CD is included with this book and it is highly recommended the
reader become familiar with it, especially as you progress more through the book
and are using the code commentary. It is recommended that the CD is used with a
GNU/Linux system but it is not required.
The text of the book is constained the CD in HTML, PDF and HTML formats
so the reader can perform basic text searches if the index does not have the desired
information. If you are reading the first edition of the book, you may notice small
differences between the CD version and the paper version due to printing deadlines
but they will be minor.
Almost all the tools used to research the book’s material are contained on the
CD. Each of the tools may be installed on virtually any GNU/Linux installation
and references are included to available documentation and the project home sites
so you can check for further updates.
With many GNU/Linux installations, there is the additional bonus of being able
to run a web server directly from the CD which has been tested with Red Hat 7.3
and Debian Woody but should work with any distribution. The small web site it
provides at http://localhost:10080 provides a number of useful features;
• A searchable index for functions that have a code commentary available. If a
function is searched for that does not have a commentary, the browser will be
automatically redirected to LXR;
• A web browsable copy of the Linux 2.4.22 source. This allows code to be
browsed and identifiers to be searched for;
• A live version of CodeViz, the tool used to generate call-graphs for the book, is
available. If you feel that the book’s graphs are lacking some detail you want,
generate them yourself;
Preface
iv
• The VM Regress, CodeViz and patchset packages which are discussed in
Chapter 1 are available in /cdrom/software. gcc-3.0.4 is also provided as it
is required for building CodeViz.
Mount the CD on /cdrom as followed;
root@joshua:/$ mount /dev/cdrom /cdrom -o exec
The webserver is Apache 1.3.27 (http://www.apache.org/ ) and has been built
and configured to run with it’s root as /cdrom/. If your distribution normally uses
another directory, you will need to use this one instead. To start it, run the script
/cdrom/start_server. If there are no errors, the output should look like:
mel@joshua:~$ /cdrom/start_server
Starting CodeViz Server: done
done
Starting Apache Server:
The URL to access is http://localhost:10080/
If the server starts successfully, point your browser to http://localhost:10080
to avail of the CDs web services. To shutdown the server, run the script
/cdrom/stop_server and the CD may then be unmounted.
Typographic Conventions
The conventions used in this document are simple. New concepts that are introduced
as well as URLs are in italicised font. Binaries and package names are are in bold.
Structures, field names, compile time defines and variables are in a constant-width
font. At times when talking about a field in a structure, both the structure and field
name will be included like page→list for example. Filenames are in a constant-
width font but include files have angle brackets around them like
and may be found in the include/ directory of the kernel source.
Acknowledgments
The compilation of this book was not a trivial task. This book was researched and
developed in the open and it would be remiss of me not to mention some of the
people who helped me at various intervals. If there is anyone I missed, I apologise
now.
First, I would like to thank John O’Gorman who tragically passed away while
the material for this book was being researched. It was his experience and guidance
that largely inspired the format and quality of this book.
Secondly, I would like to thank Mark L. Taub from Prentice Hall PTR for giving
me the opportunity to publish this book. It has being a rewarding experience and it
Preface
v
made trawling through all the code worthwhile. Massive thanks go to my reviewers
who provided clear and detailed feedback long after I thought I had finished writing.
Finally, on the publishers front, I would like to thank Bruce Perens for allowing me to
publish under the Bruce Peren’s Open Book Series (http://www.perens.com/Books).
With the technical research, a number of people provided invaluable insight.
Abhishek Nayani, was a source of encouragement and enthusiasm early in the re-
search. Ingo Oeser kindly provided invaluable assistance early on with a detailed
explanation on how data is copied from userspace to kernel space including some
valuable historical context. He also kindly offered to help me if I felt I ever got
lost in the twisty maze of kernel code. Scott Kaplan made numerous corrections to
a number of systems from non-contiguous memory allocation, to page replacement
policy. Jonathon Corbet provided the most detailed account of the history of the
kernel development with the kernel page he writes for Linux Weekly News. Zack
Brown, the chief behind Kernel Traffic, is the sole reason I did not drown in kernel
related mail. IBM, as part of the Equinox Project, provided an xSeries 350 which
was invaluable for running my own test kernels on machines larger than what I pre-
viously had access to. Finally, Patrick Healy was crucial to ensuring that this book
was consistent and approachable to people who are familiar, but not experts, on
Linux or memory management.
A number of people helped with smaller technical issues and general inconsisten-
cies where material was not covered in sufficient depth. They are Muli Ben-Yehuda,
Parag Sharma, Matthew Dobson, Roger Luethi, Brian Lowe and Scott Crosby. All of
them sent corrections and queries on differnet parts of the document which ensured
too much prior knowledge was assumed.
Carl Spalletta sent a number of queries and corrections to every aspect of the
book in its earlier online form. Steve Greenland sent a large number of grammar
corrections. Philipp Marek went above and beyond being helpful sending over 90
separate corrections and queries on various aspects. Long after I thought I was
finished, Aris Sotiropoulos sent a large number of small corrections and suggestions.
The last person, whose name I cannot remember but is an editor for a magazine
sent me over 140 corrections against an early version to the document. You know
who you are, thanks.
Eleven people sent a few corrections, though small, were still missed by several
of my own checks. They are Marek Januszewski, Amit Shah, Adrian Stanciu, Andy
Isaacson, Jean Francois Martinez, Glen Kaukola, Wolfgang Oertl, Michael Babcock,
Kirk True, Chuck Luciano and David Wilson.
On the development of VM Regress, there were nine people who helped me keep
it together. Danny Faught and Paul Larson both sent me a number of bug reports
and helped ensure it worked with a variety of different kernels. Cliff White, from
the OSDL labs ensured that VM Regress would have a wider application than my
own test box. Dave Olien, also associated with the OSDL labs was responsible
for updating VM Regress to work with 2.5.64 and later kernels. Albert Cahalan
sent all the information I needed to make it function against later proc utilities.
Finally, Andrew Morton, Rik van Riel and Scott Kaplan all provided insight on
what direction the tool should be developed to be both valid and useful.
Preface
vi
The last long list are people who sent me encouragement and thanks at various
intervals. They are Martin Bligh, Paul Rolland, Mohamed Ghouse, Samuel Chess-
man, Ersin Er, Mark Hoy, Michael Martin, Martin Gallwey, Ravi Parimi, Daniel
Codt, Adnan Shafi, Xiong Quanren, Dave Airlie, Der Herr Hofrat, Ida Hallgren,
Manu Anand, Eugene Teo, Diego Calleja and Ed Cashin. Thanks, the encourage-
ment was heartening.
In conclusion, I would like to thank a few people without whom, I would not
have completed this.
I would like to thank my parents who kept me going long
after I should have been earning enough money to support myself. I would like to
thank my girlfriend Karen, who patiently listened to rants, tech babble, angsting
over the book and made sure I was the person with the best toys. Kudos to friends
who dragged me away from the computer periodically and kept me relatively sane,
including Daren who is cooking me dinner as I write this. Finally, I would like to
thank the thousands of hackers that have contributed to GNU, the Linux kernel
and other Free Software projects over the years who without I would not have an
excellent system to write about. It was an inspiration to me to see such dedication
when I first started programming on my own PC 6 years ago after finally figuring
out that Linux was not an application for Windows used for reading email.
Contents
List of Figures
List of Tables
xiii
xvi
1 Introduction
1
2
1.1 Getting Started . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.2 Managing the Source . . . . . . . . . . . . . . . . . . . . . . . . . . .
4
1.3 Browsing the Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.4 Reading the Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.5 Submitting Patches . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2 Describing Physical Memory
14
2.1 Nodes
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2 Zones . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.3 Zone Initialisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.4 Pages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.5 High Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.6 What’s New In 2.6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3 Page Table Management
32
3.1 Describing the Page Directory . . . . . . . . . . . . . . . . . . . . . . 33
3.2 Describing a Page Table Entry . . . . . . . . . . . . . . . . . . . . . . 35
3.3 Using Page Table Entries . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.4 Translating and Setting Page Table Entries . . . . . . . . . . . . . . . 38
3.5 Allocating and Freeing Page Tables . . . . . . . . . . . . . . . . . . . 38
3.6 Kernel Page Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.7 Mapping addresses to a struct page . . . . . . . . . . . . . . . . . . 41
3.8 Translation Lookaside Buffer (TLB) . . . . . . . . . . . . . . . . . . . 42
3.9 Level 1 CPU Cache Management
. . . . . . . . . . . . . . . . . . . . 43
3.10 What’s New In 2.6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4 Process Address Space
52
4.1 Linear Address Space . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.2 Managing the Address Space . . . . . . . . . . . . . . . . . . . . . . . 54
4.3 Process Address Space Descriptor . . . . . . . . . . . . . . . . . . . . 54
vii