Table of Contents
Preface
The Audience for This Book
Organization of the Material
Level of Description
Overview of the Book
Background Information
Conventions in This Book
How to Contact Us
Safari® Enabled
Acknowledgments
Introduction
Linux Versus Other Unix-Like Kernels
Hardware Dependency
Linux Versions
Basic Operating System Concepts
Multiuser Systems
Users and Groups
Processes
Kernel Architecture
An Overview of the Unix Filesystem
Files
Hard and Soft Links
File Types
File Descriptor and Inode
Access Rights and File Mode
File-Handling System Calls
Opening a file
Accessing an opened file
Closing a file
Renaming and deleting a file
An Overview of Unix Kernels
The Process/Kernel Model
Process Implementation
Reentrant Kernels
Process Address Space
Synchronization and Critical Regions
Kernel preemption disabling
Interrupt disabling
Semaphores
Spin locks
Avoiding deadlocks
Signals and Interprocess Communication
Process Management
Zombie processes
Process groups and login sessions
Memory Management
Virtual memory
Random access memory usage
Kernel Memory Allocator
Process virtual address space handling
Caching
Device Drivers
Memory Addressing
Memory Addresses
Segmentation in Hardware
Segment Selectors and Segmentation Registers
Segment Descriptors
Fast Access to Segment Descriptors
Segmentation Unit
Segmentation in Linux
The Linux GDT
The Linux LDTs
Paging in Hardware
Regular Paging
Extended Paging
Hardware Protection Scheme
An Example of Regular Paging
The Physical Address Extension (PAE) Paging Mechanism
Paging for 64-bit Architectures
Hardware Cache
Translation Lookaside Buffers (TLB)
Paging in Linux
The Linear Address Fields
Page Table Handling
Physical Memory Layout
Process Page Tables
Kernel Page Tables
Provisional kernel Page Tables
Final kernel Page Table when RAM size is less than 896MB
Final kernel Page Table when RAM size is between 896MB and 4096MB
Final kernel Page Table when RAM size is more than 4096MB
Fix-Mapped Linear Addresses
Handling the Hardware Cache and the TLB
Handling the hardware cache
Handling the TLB
Processes
Processes, Lightweight Processes, andThreads
Process Descriptor
Process State
Identifying a Process
Process descriptors handling
Identifying the current process
Doubly linked lists
The process list
The lists of TASK_RUNNING processes
Relationships Among Processes
The pidhash table and chained lists
How Processes Are Organized
Wait queues
Handling wait queues
Process Resource Limits
Process Switch
Hardware Context
Task State Segment
The thread field
Performing the Process Switch
The switch_to macro
The __switch_to() function
Saving and Loading the FPU, MMX, and XMM Registers
Saving the FPU registers
Loading the FPU registers
Using the FPU, MMX, and SSE/SSE2 units in Kernel Mode
Creating Processes
The clone(), fork(), and vfork() System Calls
The do_fork() function
The copy_process() function
Kernel Threads
Creating a kernel thread
Process 0
Process 1
Other kernel threads
Destroying Processes
Process Termination
The do_group_exit() function
The do_exit() function
Process Removal
Interrupts and Exceptions
The Role of Interrupt Signals
Interrupts and Exceptions
IRQs and Interrupts
The Advanced Programmable Interrupt Controller (APIC)
Exceptions
Interrupt Descriptor Table
Hardware Handling of Interrupts and Exceptions
Nested Execution of Exception and Interrupt Handlers
Initializing the Interrupt Descriptor Table
Interrupt, Trap, and System Gates
Preliminary Initialization of the IDT
Exception Handling
Saving the Registers for the Exception Handler
Entering and Leaving the Exception Handler
Interrupt Handling
I/O Interrupt Handling
Interrupt vectors
IRQ data structures
IRQ distribution in multiprocessor systems
Multiple Kernel Mode stacks
Saving the registers for the interrupt handler
The do_IRQ() function
The __do_IRQ() function
Reviving a lost interrupt
Interrupt service routines
Dynamic allocation of IRQ lines
Interprocessor Interrupt Handling
Softirqs and Tasklets
Softirqs
Data structures used for softirqs
Handling softirqs
The do_softirq() function
The __do_softirq() function
The ksoftirqd kernel threads
Tasklets
Work Queues
Work queue data structures
Work queue functions
The predefined work queue
Returning from Interrupts and Exceptions
The entry points
Resuming a kernel control path
Checking for kernel preemption
Resuming a User Mode program
Checking for rescheduling
Handling pending signals, virtual-8086 mode, and single stepping
Kernel Synchronization
How the Kernel Services Requests
Kernel Preemption
When Synchronization Is Necessary
When Synchronization Is Not Necessary
Synchronization Primitives
Per-CPU Variables
Atomic Operations
Optimization and Memory Barriers
Spin Locks
The spin_lock macro with kernel preemption
The spin_lock macro without kernel preemption
The spin_unlock macro
Read/Write Spin Locks
Getting and releasing a lock for reading
Getting and releasing a lock for writing
Seqlocks
Read-Copy Update (RCU)
Semaphores
Getting and releasing semaphores
Read/Write Semaphores
Completions
Local Interrupt Disabling
Disabling and Enabling Deferrable Functions
Synchronizing Accesses to Kernel Data Structures
Choosing Among Spin Locks, Semaphores, and Interrupt Disabling
Protecting a data structure accessed by exceptions
Protecting a data structure accessed by interrupts
Protecting a data structure accessed by deferrable functions
Protecting a data structure accessed by exceptions and interrupts
Protecting a data structure accessed by exceptions and deferrable functions
Protecting a data structure accessed by interrupts and deferrable functions
Protecting a data structure accessed by exceptions, interrupts, and deferrable functions
Examples of Race Condition Prevention
Reference Counters
The Big Kernel Lock
Memory Descriptor Read/Write Semaphore
Slab Cache List Semaphore
Inode Semaphore
Timing Measurements
Clock and Timer Circuits
Real Time Clock (RTC)
Time Stamp Counter (TSC)
Programmable Interval Timer (PIT)
CPU Local Timer
High Precision Event Timer (HPET)
ACPI Power Management Timer
The Linux Timekeeping Architecture
Data Structures of the Timekeeping Architecture
The timer object
The jiffies variable
The xtime variable
Timekeeping Architecture in Uniprocessor Systems
Initialization phase
The timer interrupt handler
Timekeeping Architecture in Multiprocessor Systems
Initialization phase
The global timer interrupt handler
The local timer interrupt handler
Updating the Time and Date
Updating System Statistics
Updating Local CPU Statistics
Keeping Track of System Load
Profiling the Kernel Code
Checking the NMI Watchdogs
Software Timers and Delay Functions
Dynamic Timers
Dynamic timers and race conditions
Data structures for dynamic timers
Dynamic timer handling
An Application of Dynamic Timers: the nanosleep() System Call
Delay Functions
System Calls Related to Timing Measurements
The time() and gettimeofday() System Calls
The adjtimex() System Call
The setitimer() and alarm() System Calls
System Calls for POSIX Timers
Process Scheduling
Scheduling Policy
Process Preemption
How Long Must a Quantum Last?
The Scheduling Algorithm
Scheduling of Conventional Processes
Base time quantum
Dynamic priority and average sleep time
Active and expired processes
Scheduling of Real-Time Processes
Data Structures Used by the Scheduler
The runqueue Data Structure
Process Descriptor
Functions Used by the Scheduler
The scheduler_tick() Function
Updating the time slice of a real-time process
Updating the time slice of a conventional process
The try_to_wake_up() Function
The recalc_task_prio() Function
The schedule() Function
Direct invocation
Lazy invocation
Actions performed by schedule() before a process switch
Actions performed by schedule() to make the process switch
Actions performed by schedule() after a process switch
Runqueue Balancing in Multiprocessor Systems
Scheduling Domains
The rebalance_tick() Function
The load_balance() Function
The move_tasks() Function
System Calls Related to Scheduling
The nice() System Call
The getpriority() and setpriority() System Calls
The sched_getaffinity() and sched_setaffinity() System Calls
System Calls Related to Real-Time Processes
The sched_getscheduler() and sched_setscheduler() system calls
The sched_getparam() and sched_setparam() system calls
The sched_yield() system call
The sched_get_priority_min() and sched_get_priority_max() system calls
The sched_rr_get_interval() system call
Memory Management
Page Frame Management
Page Descriptors
Non-Uniform Memory Access (NUMA)
Memory Zones
The Pool of Reserved Page Frames
The Zoned Page Frame Allocator
Requesting and releasing page frames
Kernel Mappings of High-Memory Page Frames
Permanent kernel mappings
Temporary kernel mappings
The Buddy System Algorithm
Data structures
Allocating a block
Freeing a block
The Per-CPU Page Frame Cache
Allocating page frames through the per-CPU page frame caches
Releasing page frames to the per-CPU page frame caches
The Zone Allocator
Releasing a group of page frames
Memory Area Management
The Slab Allocator
Cache Descriptor
Slab Descriptor
General and Specific Caches
Interfacing the Slab Allocator with the Zoned Page Frame Allocator
Allocating a Slab to a Cache
Releasing a Slab from a Cache
Object Descriptor
Aligning Objects in Memory
Slab Coloring
Local Caches of Free Slab Objects
Allocating a Slab Object
Freeing a Slab Object
General Purpose Objects
Memory Pools
Noncontiguous Memory Area Management
Linear Addresses of Noncontiguous Memory Areas
Descriptors of Noncontiguous Memory Areas
Allocating a Noncontiguous Memory Area
Releasing a Noncontiguous Memory Area
Process Address Space
The Process’s Address Space
The Memory Descriptor
Memory Descriptor of Kernel Threads
Memory Regions
Memory Region Data Structures
Memory Region Access Rights
Memory Region Handling
Finding the closest region to a given address: find_vma()
Finding a region that overlaps a given interval: find_vma_intersection()
Finding a free interval: get_unmapped_area()
Inserting a region in the memory descriptor list: insert_vm_struct()
Allocating a Linear Address Interval
Releasing a Linear Address Interval
The do_munmap() function
The split_vma() function
The unmap_region() function
Page Fault Exception Handler
Handling a Faulty Address Outside the Address Space
Handling a Faulty Address Inside the Address Space
Demand Paging
Copy On Write
Handling Noncontiguous Memory Area Accesses
Creating and Deleting a Process Address Space
Creating a Process Address Space
Deleting a Process Address Space
Managing the Heap
System Calls
POSIX APIs and System Calls
System Call Handler and Service Routines
Entering and Exiting a System Call
Issuing a System Call via the int$0x80 Instruction
The system_call() function
Exiting from the system call
Issuing a System Call via the sysenter Instruction
The sysenter instruction
The vsyscall page
Entering the system call
Exiting from the system call
The sysexit instruction
The SYSENTER_RETURN code
Parameter Passing
Verifying the Parameters
Accessing the Process Address Space
Dynamic Address Checking: The Fix-up Code
The Exception Tables
Generating the Exception Tables and the Fixup Code
Kernel Wrapper Routines
Signals
The Role of Signals
Actions Performed upon Delivering a Signal
POSIX Signals and Multithreaded Applications
Data Structures Associated with Signals
The signal descriptor and the signal handler descriptor
The sigaction data structure
The pending signal queues
Operations on Signal Data Structures
Generating a Signal
The specific_send_sig_info() Function
The send_signal() Function
The group_send_sig_info() Function
Delivering a Signal
Executing the Default Action for the Signal
Catching the Signal
Setting up the frame
Evaluating the signal flags
Starting the signal handler
Terminating the signal handler
Reexecution of System Calls
Restarting a system call interrupted by a non-caught signal
Restarting a system call for a caught signal
System Calls Related to Signal Handling
The kill() System Call
The tkill() and tgkill() System Calls
Changing a Signal Action
Examining the Pending Blocked Signals
Modifying the Set of Blocked Signals
Suspending the Process
System Calls for Real-Time Signals
The Virtual Filesystem
The Role of the Virtual Filesystem (VFS)
The Common File Model
System Calls Handled by the VFS
VFS Data Structures
Superblock Objects
Inode Objects
File Objects
dentry Objects
The dentry Cache
Files Associated with a Process
Filesystem Types
Special Filesystems
Filesystem Type Registration
Filesystem Handling
Namespaces
Filesystem Mounting
Mounting a Generic Filesystem
The do_kern_mount() function
Allocating a superblock object
Mounting the Root Filesystem
Phase 1: Mounting the rootfs filesystem
Phase 2: Mounting the real root filesystem
Unmounting a Filesystem
Pathname Lookup
Standard Pathname Lookup
Parent Pathname Lookup
Lookup of Symbolic Links
Implementations of VFS System Calls
The open() System Call
The read() and write() System Calls
The close() System Call
File Locking
Linux File Locking
File-Locking Data Structures
FL_FLOCK Locks
FL_POSIX Locks
I/O Architecture and Device Drivers
I/O Architecture
I/O Ports
Accessing I/O ports
I/O Interfaces
Custom I/O interfaces
General-purpose I/O interfaces
Device Controllers
The Device Driver Model
The sysfs Filesystem
Kobjects
Kobjects, ksets, and subsystems
Registering kobjects, ksets, and subsystems
Components of the Device Driver Model
Devices
Drivers
Buses
Classes
Device Files
User Mode Handling of Device Files
Dynamic device number assignment
Dynamic device file creation
VFS Handling of Device Files
Device Drivers
Device Driver Registration
Device Driver Initialization
Monitoring I/O Operations
Polling mode
Interrupt mode
Accessing the I/O Shared Memory
Direct Memory Access (DMA)
Synchronous and asynchronous DMA
Helper functions for DMA transfers
Bus addresses
Cache coherency
Helper functions for coherent DMA mappings
Helper functions for streaming DMA mappings
Levels of Kernel Support
Character Device Drivers
Assigning Device Numbers
The register_chrdev_region() and alloc_chrdev_region() functions
The register_chrdev() function
Accessing a Character Device Driver
Buffering Strategies for Character Devices
Block Device Drivers
Block Devices Handling
Sectors
Blocks
Segments
The Generic Block Layer
The Bio Structure
Representing Disks and Disk Partitions
Submitting a Request
The I/O Scheduler
Request Queue Descriptors
Request Descriptors
Managing the allocation of request descriptors
Avoiding request queue congestion
Activating the Block Device Driver
I/O Scheduling Algorithms
The “Noop” elevator
The “CFQ” elevator
The “Deadline” elevator
The “Anticipatory” elevator
Issuing a Request to the I/O Scheduler
The blk_queue_bounce() function
Block Device Drivers
Block Devices
Accessing a block device
Device Driver Registration and Initialization
Defining a custom driver descriptor
Initializing the custom descriptor
Initializing the gendisk descriptor
Initializing the table of block device methods
Allocating and initializing a request queue
Setting up the interrupt handler
Registering the disk
The Strategy Routine
The Interrupt Handler
Opening a Block Device File
The Page Cache
The Page Cache
The address_space Object
The Radix Tree
Page Cache Handling Functions
Finding a page
Adding a page
Removing a page
Updating a page
The Tags of the Radix Tree
Storing Blocks in the Page Cache
Block Buffers and Buffer Heads
Managing the Buffer Heads
Buffer Pages
Allocating Block Device Buffer Pages
Releasing Block Device Buffer Pages
Searching Blocks in the Page Cache
The __find_get_block() function
The __getblk() function
The __bread() function
Submitting Buffer Heads to the Generic Block Layer
The submit_bh() function
The ll_rw_block() function
Writing Dirty Pages to Disk
The pdflush Kernel Threads
Looking for Dirty Pages To Be Flushed
Retrieving Old Dirty Pages
The sync(), fsync(), and fdatasync() System Calls
The sync() System Call
The fsync() and fdatasync() System Calls
Accessing Files
Reading and Writing a File
Reading from a File
The readpage method for regular files
The readpage method for block device files
Read-Ahead of Files
The page_cache_readahead() function
The handle_ra_miss() function
Writing to a File
The prepare_write and commit_write methods for regular files
The prepare_write and commit_write methods for block device files
Writing Dirty Pages to Disk
Memory Mapping
Memory Mapping Data Structures
Creating a Memory Mapping
Destroying a Memory Mapping
Demand Paging for Memory Mapping
Flushing Dirty Memory Mapping Pages to Disk
Non-Linear Memory Mappings
Direct I/O Transfers
Asynchronous I/O
Asynchronous I/O in Linux 2.6
The asynchronous I/O context
Submitting the asynchronous I/O operations
Page Frame Reclaiming
The Page Frame Reclaiming Algorithm
Selecting a Target Page
Design of the PFRA
Reverse Mapping
Reverse Mapping for Anonymous Pages
The try_to_unmap_anon() function
The try_to_unmap_one() function
Reverse Mapping for Mapped Pages
The priority search tree
The try_to_unmap_file() function
Implementing the PFRA
The Least Recently Used (LRU) Lists
Moving pages across the LRU lists
The mark_page_accessed() function
The page_referenced() function
The refill_inactive_zone() function
Low On Memory Reclaiming
The free_more_memory() function
The try_to_free_pages() function
The shrink_caches() function
The shrink_zone() function
The shrink_cache() function
The shrink_list() function
The pageout() function
Reclaiming Pages of Shrinkable Disk Caches
Reclaiming page frames from the dentry cache
Reclaiming page frames from the inode cache
Periodic Reclaiming
The kswapd kernel threads
The cache_reap() function
The Out of Memory Killer
The Swap Token
Swapping
Swap Area
Creating and activating a swap area
How to distribute pages in the swap areas
Swap Area Descriptor
Swapped-Out Page Identifier
Activating and Deactivating a Swap Area
The sys_swapon() service routine
The sys_swapoff() service routine
The try_to_unuse() function
Allocating and Releasing a Page Slot
The scan_swap_map() function
The get_swap_page() function
The swap_free() function
The Swap Cache
Swap cache implementation
Swap cache helper functions
Swapping Out Pages
Inserting the page frame in the swap cache
Updating the Page Table entries
Writing the page into the swap area
Removing the page frame from the swap cache
Swapping in Pages
The do_swap_page() function
The read_swap_cache_async() function
The Ext2 and Ext3 Filesystems
General Characteristics of Ext2
Ext2 Disk Data Structures
Superblock
Group Descriptor and Bitmap
Inode Table
Extended Attributes of an Inode
Access Control Lists
How Various File Types Use Disk Blocks
Regular file
Directory
Symbolic link
Device file, pipe, and socket
Ext2 Memory Data Structures
The Ext2 Superblock Object
The Ext2 inode Object
Creating the Ext2 Filesystem
Ext2 Methods
Ext2 Superblock Operations
Ext2 inode Operations
Ext2 File Operations
Managing Ext2 Disk Space
Creating inodes
Deleting inodes
Data Blocks Addressing
File Holes
Allocating a Data Block
Releasing a Data Block
The Ext3 Filesystem
Journaling Filesystems
The Ext3 Journaling Filesystem
The Journaling Block Device Layer
Log records
Atomic operation handles
Transactions
How Journaling Works
Process Communication
Pipes
Using a Pipe
Pipe Data Structures
The pipefs special filesystem
Creating and Destroying a Pipe
Reading from a Pipe
Writing into a Pipe
FIFOs
Creating and Opening a FIFO
System V IPC
Using an IPC Resource
The ipc() System Call
IPC Semaphores
Undoable semaphore operations
The queue of pending requests
IPC Messages
IPC Shared Memory
Swapping out pages of IPC shared memory regions
Demand paging for IPC shared memory regions
POSIX Message Queues
Program Execution
Executable Files
Process Credentials and Capabilities
Process capabilities
The Linux Security Modules framework
Command-Line Arguments and Shell Environment
Libraries
Program Segments and Process Memory Regions
Flexible memory region layout
Execution Tracing
Executable Formats
Execution Domains
The exec Functions
System Startup
Prehistoric Age: the BIOS
Ancient Age: the Boot Loader
Booting Linux from a Disk
Middle Ages: the setup() Function
Renaissance: the startup_32() Functions
Modern Age: the start_kernel() Function
Modules
To Be (a Module) or Not to Be?
Module Licenses
Module Implementation
Module Usage Counters
Exporting Symbols
Module Dependency
Linking and Unlinking Modules
Linking Modules on Demand
The modprobe Program
The request_module() Function
Bibliography
Books on Unix Kernels
Books on the Linux Kernel
Books on PC Architecture and Technical Manuals on Intel Microprocessors
Other Online Documentation Sources
Research Papers Related to Linux Development
Source Code Index
Index