logo资料库

HTTP_权威指南.pdf

第1页 / 共596页
第2页 / 共596页
第3页 / 共596页
第4页 / 共596页
第5页 / 共596页
第6页 / 共596页
第7页 / 共596页
第8页 / 共596页
资料共596页,剩余部分请下载后查看
HTTP: The Definitive Guide
Preface
Running Example: Joe's Hardware Store
Chapter-by-Chapter Guide
Typographic Conventions
Comments and Questions
Acknowledgments
Part I: HTTP: The Web's Foundation
Chapter 1. Overview of HTTP
1.1 HTTP: The Internet's Multimedia Courier
1.2 Web Clients and Servers
Figure 1-1. Web clients and servers
1.3 Resources
Figure 1-2. A web resource is anything that provides web content
1.3.1 Media Types
Figure 1-3. MIME types are sent back with the data content
1.3.2 URIs
Figure 1-4. URLs specify protocol, server, and local resource
1.3.3 URLs
Table 1-1. Example URLs
1.3.4 URNs
1.4 Transactions
Figure 1-5. HTTP transactions consist of request and response messages
1.4.1 Methods
Table 1-2. Some common HTTP methods
1.4.2 Status Codes
Table 1-3. Some common HTTP status codes
1.4.3 Web Pages Can Consist of Multiple Objects
Figure 1-6. Composite web pages require separate HTTP transactions for each embedded resource
1.5 Messages
Figure 1-7. HTTP messages have a simple, line-oriented text structure
1.5.1 Simple Message Example
Figure 1-8. Example GET transaction for http://www.joes-hardware.com/tools.html
1.6 Connections
1.6.1 TCP/IP
Figure 1-9. HTTP network protocol stack
1.6.2 Connections, IP Addresses, and Port Numbers
Figure 1-10. Basic browser connection process
1.6.3 A Real Example Using Telnet
Example 1-1. An HTTP transaction using telnet
1.7 Protocol Versions
1.8 Architectural Components of the Web
1.8.1 Proxies
Figure 1-11. Proxies relay traffic between client and server
1.8.2 Caches
Figure 1-12. Caching proxies keep local copies of popular documents to improve performance
1.8.3 Gateways
Figure 1-13. HTTP/FTP gateway
1.8.4 Tunnels
Figure 1-14. Tunnels forward data across non-HTTP networks (HTTP/SSL tunnel shown)
1.8.5 Agents
Figure 1-15. Automated search engine "spiders" are agents, fetching web pages around the world
1.9 The End of the Beginning
1.10 For More Information
1.10.1 HTTP Protocol Information
1.10.2 Historical Perspective
1.10.3 Other World Wide Web Information
Chapter 2. URLs and Resources
2.1 Navigating the Internet's Resources
Figure 2-1. How URLs relate to browser, machine, server, and location on the server's filesystem
2.1.1 The Dark Days Before URLs
2.2 URL Syntax
Table 2-1. General URL components
2.2.1 Schemes: What Protocol to Use
2.2.2 Hosts and Ports
2.2.3 Usernames and Passwords
2.2.4 Paths
2.2.5 Parameters
2.2.6 Query Strings
Figure 2-2. The URL query component is sent along to the gateway application
2.2.7 Fragments
Figure 2-3. The URL fragment is used only by the client, because the server deals with entire objects
2.3 URL Shortcuts
2.3.1 Relative URLs
Example 2-1. HTML snippet with relative URLs
Figure 2-4. Using a base URL
2.3.1.1 Base URLs
2.3.1.2 Resolving relative references
Figure 2-5. Converting relative to absolute URLs
2.3.2 Expandomatic URLs
2.4 Shady Characters
2.4.1 The URL Character Set
2.4.2 Encoding Mechanisms
Table 2-2. Some encoded character examples
2.4.3 Character Restrictions
Table 2-3. Reserved and restricted characters
2.4.4 A Bit More
2.5 A Sea of Schemes
Table 2-4. Common scheme formats
2.6 The Future
Figure 2-6. PURLs use a resource locator server to name the current location of a resource
2.6.1 If Not Now, When?
2.7 For More Information
Chapter 3. HTTP Messages
3.1 The Flow of Messages
3.1.1 Messages Commute Inbound to the Origin Server
Figure 3-1. Messages travel inbound to the origin server and outbound back to the client
3.1.2 Messages Flow Downstream
Figure 3-2. All messages flow downstream
3.2 The Parts of a Message
Figure 3-3. Three parts of an HTTP message
3.2.1 Message Syntax
Figure 3-4. An HTTP transaction has request and response messages
Figure 3-5. Example request and response messages
3.2.2 Start Lines
3.2.2.1 Request line
3.2.2.2 Response line
3.2.2.3 Methods
Table 3-1. Common HTTP methods
3.2.2.4 Status codes
Table 3-2. Status code classes
Table 3-3. Common status codes
3.2.2.5 Reason phrases
3.2.2.6 Version numbers
3.2.3 Headers
3.2.3.1 Header classifications
Table 3-4. Common header examples
3.2.3.2 Header continuation lines
3.2.4 Entity Bodies
3.2.5 Version 0.9 Messages
Figure 3-6. HTTP/0.9 transaction
3.3 Methods
3.3.1 Safe Methods
3.3.2 GET
Figure 3-7. GET example
3.3.3 HEAD
Figure 3-8. HEAD example
3.3.4 PUT
Figure 3-9. PUT example
3.3.5 POST
Figure 3-10. POST example
3.3.6 TRACE
Figure 3-11. TRACE example
3.3.7 OPTIONS
Figure 3-12. OPTIONS example
3.3.8 DELETE
Figure 3-13. DELETE example
3.3.9 Extension Methods
Table 3-5. Example web publishing extension methods
3.4 Status Codes
3.4.1 100-199: Informational Status Codes
Table 3-6. Informational status codes and reason phrases
3.4.1.1 Clients and 100 Continue
3.4.1.2 Servers and 100 Continue
3.4.1.3 Proxies and 100 Continue
3.4.2 200-299: Success Status Codes
Table 3-7. Success status codes and reason phrases
3.4.3 300-399: Redirection Status Codes
Figure 3-14. Redirected request to new location
Figure 3-15. Request redirected to use local copy
Table 3-8. Redirection status codes and reason phrases
3.4.4 400-499: Client Error Status Codes
Table 3-9. Client error status codes and reason phrases
3.4.5 500-599: Server Error Status Codes
Table 3-10. Server error status codes and reason phrases
3.5 Headers
3.5.1 General Headers
Table 3-11. General informational headers
3.5.1.1 General caching headers
Table 3-12. General caching headers
3.5.2 Request Headers
Table 3-13. Request informational headers
3.5.2.1 Accept headers
Table 3-14. Accept headers
3.5.2.2 Conditional request headers
Table 3-15. Conditional request headers
3.5.2.3 Request security headers
Table 3-16. Request security headers
3.5.2.4 Proxy request headers
Table 3-17. Proxy request headers
3.5.3 Response Headers
Table 3-18. Response informational headers
3.5.3.1 Negotiation headers
Table 3-19. Negotiation headers
3.5.3.2 Response security headers
Table 3-20. Response security headers
3.5.4 Entity Headers
Table 3-21. Entity informational headers
3.5.4.1 Content headers
Table 3-22. Content headers
3.5.4.2 Entity caching headers
Table 3-23. Entity caching headers
3.6 For More Information
Chapter 4. Connection Management
4.1 TCP Connections
Figure 4-1. Web browsers talk to web servers over TCP connections
4.1.1 TCP Reliable Data Pipes
Figure 4-2. TCP carries HTTP data in order, and without corruption
4.1.2 TCP Streams Are Segmented and Shipped by IP Packets
Figure 4-3. HTTP and HTTPS network protocol stacks
Figure 4-4. IP packets carry TCP segments, which carry chunks of the TCP data stream
4.1.3 Keeping TCP Connections Straight
Table 4-1. TCP connection values
Figure 4-5. Four distinct TCP connections
4.1.4 Programming with TCP Sockets
Table 4-2. Common socket interface functions for programming TCP connections
Figure 4-6. How TCP clients and servers communicate using the TCP sockets interface
4.2 TCP Performance Considerations
4.2.1 HTTP Transaction Delays
Figure 4-7. Timeline of a serial HTTP transaction
4.2.2 Performance Focus Areas
4.2.3 TCP Connection Handshake Delays
Figure 4-8. TCP requires two packet transfers to set up the connection before it can send data
4.2.4 Delayed Acknowledgments
4.2.5 TCP Slow Start
4.2.6 Nagle's Algorithm and TCP_NODELAY
4.2.7 TIME_WAIT Accumulation and Port Exhaustion
4.3 HTTP Connection Handling
4.3.1 The Oft-Misunderstood Connection Header
Figure 4-9. The Connection header allows the sender to specify connection-specific options
4.3.2 Serial Transaction Delays
Figure 4-10. Four transactions (serial)
4.4 Parallel Connections
Figure 4-11. Each component of a page involves a separate HTTP transaction
4.4.1 Parallel Connections May Make Pages Load Faster
Figure 4-12. Four transactions (parallel)
4.4.2 Parallel Connections Are Not Always Faster
4.4.3 Parallel Connections May "Feel" Faster
4.5 Persistent Connections
4.5.1 Persistent Versus Parallel Connections
4.5.2 HTTP/1.0+ Keep-Alive Connections
Figure 4-13. Four transactions (serial versus persistent)
4.5.3 Keep-Alive Operation
Figure 4-14. HTTP/1.0 keep-alive transaction header handshake
4.5.4 Keep-Alive Options
4.5.5 Keep-Alive Connection Restrictions and Rules
4.5.6 Keep-Alive and Dumb Proxies
4.5.6.1 The Connection header and blind relays
Figure 4-15. Keep-alive doesn't interoperate with proxies that don't support Connection headers
4.5.6.2 Proxies and hop-by-hop headers
4.5.7 The Proxy-Connection Hack
Figure 4-16. Proxy-Connection header fixes single blind relay
Figure 4-17. Proxy-Connection still fails for deeper hierarchies of proxies
4.5.8 HTTP/1.1 Persistent Connections
4.5.9 Persistent Connection Restrictions and Rules
4.6 Pipelined Connections
Figure 4-18. Four transactions (pipelined connections)
4.7 The Mysteries of Connection Close
4.7.1 "At Will" Disconnection
4.7.2 Content-Length and Truncation
4.7.3 Connection Close Tolerance, Retries, and Idempotency
4.7.4 Graceful Connection Close
Figure 4-19. TCP connections are bidirectional
4.7.4.1 Full and half closes
Figure 4-20. Full and half close
4.7.4.2 TCP close and reset errors
Figure 4-21. Data arriving at closed connection generates "connection reset by peer" error
4.7.4.3 Graceful close
4.8 For More Information
4.8.1 HTTP Connections
4.8.2 HTTP Performance Issues
4.8.3 TCP/IP
Part II: HTTP Architecture
Chapter 5. Web Servers
5.1 Web Servers Come in All Shapes and Sizes
5.1.1 Web Server Implementations
5.1.2 General-Purpose Software Web Servers
Figure 5-1. Web server market share as estimated by Netcraft's automated survey
5.1.3 Web Server Appliances
5.1.4 Embedded Web Servers
5.2 A Minimal Perl Web Server
Example 5-1. type-o-serve—a minimal Perl web serv
Figure 5-2. The type-o-serve utility lets you type in server responses to send back to clients
5.3 What Real Web Servers Do
Figure 5-3. Steps of a basic web server request
5.4 Step 1: Accepting Client Connections
5.4.1 Handling New Connections
5.4.2 Client Hostname Identification
Example 5-2. Configuring Apache to look up hostnames for HTML and CGI resources
5.4.3 Determining the Client User Through ident
Figure 5-4. Using the ident protocol to determine HTTP client username
5.5 Step 2: Receiving Request Messages
Figure 5-5. Reading a request message from a connection
5.5.1 Internal Representations of Messages
Figure 5-6. Parsing a request message into a convenient internal representation
5.5.2 Connection Input/Output Processing Architectures
Figure 5-7. Web server input/output architectures
5.6 Step 3: Processing Requests
5.7 Step 4: Mapping and Accessing Resources
5.7.1 Docroots
Figure 5-8. Mapping request URI to local web server resource
5.7.1.1 Virtually hosted docroots
Figure 5-9. Different docroots for virtually hosted requests
Example 5-3. Apache web server virtual host docroot configuration
5.7.1.2 User home directory docroots
Figure 5-10. Different docroots for different users
5.7.2 Directory Listings
5.7.3 Dynamic Content Resource Mapping
Figure 5-11. A web server can serve static resources as well as dynamic resources
5.7.4 Server-Side Includes (SSI)
5.7.5 Access Controls
5.8 Step 5: Building Responses
5.8.1 Response Entities
5.8.2 MIME Typing
Figure 5-12. A web server uses MIME types file to set outgoing Content-Type of resources
5.8.3 Redirection
5.9 Step 6: Sending Responses
5.10 Step 7: Logging
5.11 For More Information
Chapter 6. Proxies
6.1 Web Intermediaries
Figure 6-1. A proxy must be both a server and a client
6.1.1 Private and Shared Proxies
6.1.2 Proxies Versus Gateways
Figure 6-2. Proxies speak the same protocol; gateways tie together different protocols
6.2 Why Use Proxies?
Figure 6-3. Proxy application example: child-safe Internet filter
Figure 6-4. Proxy application example: centralized document access control
Figure 6-5. Proxy application example: security firewall
Figure 6-6. Proxy application example: web cache
Figure 6-7. Proxy application example: surrogate (in a server accelerator deployment)
Figure 6-8. Proxy application example: content routing
Figure 6-9. Proxy application example: content transcoder
Figure 6-10. Proxy application example: anonymizer
6.3 Where Do Proxies Go?
6.3.1 Proxy Server Deployment
Figure 6-11. Proxies can be deployed many ways, depending on their intended use
6.3.2 Proxy Hierarchies
Figure 6-12. Three-level proxy hierarchy
6.3.2.1 Proxy hierarchy content routing
Figure 6-13. Proxy hierarchies can be dynamic, changing for each request
6.3.3 How Proxies Get Traffic
Figure 6-14. There are many techniques to direct web requests to proxies
6.4 Client Proxy Settings
6.4.1 Client Proxy Configuration: Manual
6.4.2 Client Proxy Configuration: PAC Files
Table 6-1. Proxy auto-configuration script return values
Example 6-1. Example proxy auto-configuration file
6.4.3 Client Proxy Configuration: WPAD
6.5 Tricky Things About Proxy Requests
6.5.1 Proxy URIs Differ from Server URIs
Figure 6-15. Intercepting proxies will get server requests
6.5.2 The Same Problem with Virtual Hosting
6.5.3 Intercepting Proxies Get Partial URIs
6.5.4 Proxies Can Handle Both Proxy and Server Requests
6.5.5 In-Flight URI Modification
6.5.6 URI Client Auto-Expansion and Hostname Resolution
6.5.7 URI Resolution Without a Proxy
Figure 6-16. Browser auto-expands partial hostnames when no explicit proxy is present
6.5.8 URI Resolution with an Explicit Proxy
Figure 6-17. Browser does not auto-expand partial hostnames when there is an explicit proxy
6.5.9 URI Resolution with an Intercepting Proxy
Figure 6-18. Browser doesn't detect dead server IP addresses when using intercepting proxies
6.6 Tracing Messages
Figure 6-19. Access proxies and CDN proxies create two-level proxy hierarchies
6.6.1 The Via Header
Figure 6-20. Via header example
6.6.1.1 Via syntax
6.6.1.2 Via request and response paths
Figure 6-21. The response Via is usually the reverse of the request Via
6.6.1.3 Via and gateways
Figure 6-22. HTTP/FTP gateway generates Via headers, logging the received protocol (FTP)
6.6.1.4 The Server and Via headers
6.6.1.5 Privacy and security implications of Via
6.6.2 The TRACE Method
Figure 6-23. TRACE response reflects back the received request message
6.6.2.1 Max-Forwards
Figure 6-24. You can limit the forwarding hop count with the Max-Forwards header field
6.7 Proxy Authentication
Figure 6-25. Proxies can implement authentication to control access to content
6.8 Proxy Interoperation
6.8.1 Handling Unsupported Headers and Methods
6.8.2 OPTIONS: Discovering Optional Feature Support
Figure 6-26. Using OPTIONS to find a server's supported methods
6.8.3 The Allow Header
6.9 For More Information
Chapter 7. Caching
7.1 Redundant Data Transfers
7.2 Bandwidth Bottlenecks
Figure 7-1. Limited wide area bandwidth creates a bottleneck that caches can improve
Table 7-1. Bandwidth-imposed transfer time delays, idealized (time in seconds)
7.3 Flash Crowds
Figure 7-2. Flash crowds can overload web servers
7.4 Distance Delays
Figure 7-3. Speed of light can cause significant delays, even with parallel, keep-alive connections
7.5 Hits and Misses
Figure 7-4. Cache hits, misses, and revalidations
7.5.1 Revalidations
Figure 7-5. Successful revalidations are faster than cache misses; failed revalidations are nearly identical to misses
Figure 7-6. HTTP uses If-Modified-Since header for revalidation
7.5.2 Hit Rate
7.5.3 Byte Hit Rate
7.5.4 Distinguishing Hits and Misses
7.6 Cache Topologies
Figure 7-7. Public and private caches
7.6.1 Private Caches
7.6.2 Public Proxy Caches
Figure 7-8. Shared, public caches can decrease network traffic
7.6.3 Proxy Cache Hierarchies
Figure 7-9. Accessing documents in a two-level cache hierarchy
7.6.4 Cache Meshes, Content Routing, and Peering
Figure 7-10. Sibling caches
7.7 Cache Processing Steps
Figure 7-11. Processing a fresh cache hit
7.7.1 Step 1: Receiving
7.7.2 Step 2: Parsing
7.7.3 Step 3: Lookup
7.7.4 Step 4: Freshness Check
7.7.5 Step 5: Response Creation
7.7.6 Step 6: Sending
7.7.7 Step 7: Logging
7.7.8 Cache Processing Flowchart
Figure 7-12. Cache GET request flowchart
7.8 Keeping Copies Fresh
7.8.1 Document Expiration
Figure 7-13. Expires and Cache Control headers
7.8.2 Expiration Dates and Ages
Table 7-2. Expiration response headers
7.8.3 Server Revalidation
7.8.4 Revalidation with Conditional Methods
Table 7-3. Two conditional headers used in cache revalidation
7.8.5 If-Modified-Since: Date Revalidation
Figure 7-14. If-Modified-Since revalidations return 304 if unchanged or 200 with new body if changed
7.8.6 If-None-Match: Entity Tag Revalidation
Figure 7-15. If-None-Match revalidates because entity tag still matches
7.8.7 Weak and Strong Validators
7.8.8 When to Use Entity Tags and Last-Modified Dates
7.9 Controlling Cachability
7.9.1 No-Cache and No-Store Headers
7.9.2 Max-Age Response Headers
7.9.3 Expires Response Headers
7.9.4 Must-Revalidate Response Headers
7.9.5 Heuristic Expiration
Figure 7-16. Computing a freshness period using the LM-Factor algorithm
7.9.6 Client Freshness Constraints
Table 7-4. Cache-Control request directives
7.9.7 Cautions
7.10 Setting Cache Controls
7.10.1 Controlling HTTP Headers with Apache
7.10.2 Controlling HTML Caching Through HTTP-EQUIV
Figure 7-17. HTTP-EQUIV tags cause problems, because most software ignores them
7.11 Detailed Algorithms
7.11.1 Age and Freshness Lifetime
7.11.2 Age Computation
Example 7-1. HTTP/1.1 age-calculation algorithm calculates the overall age of a cached document
7.11.2.1 Apparent age is based on the Date header
7.11.2.2 Hop-by-hop age calculations
7.11.2.3 Compensating for network delays
7.11.3 Complete Age-Calculation Algorithm
Figure 7-18. The age of a cached document includes resident time in the network and cache
7.11.4 Freshness Lifetime Computation
7.11.5 Complete Server-Freshness Algorithm
Example 7-2. Server freshness constraint calculation
Example 7-3. Client freshness constraint calculation
7.12 Caches and Advertising
7.12.1 The Advertiser's Dilemma
7.12.2 The Publisher's Response
7.12.3 Log Migration
7.12.4 Hit Metering and Usage Limiting
7.13 For More Information
Chapter 8. Integration Points: Gateways, Tunnels, and Relays
8.1 Gateways
Figure 8-1. Gateway magic
Figure 8-2. Three web gateway examples
8.1.1 Client-Side and Server-Side Gateways
8.2 Protocol Gateways
Figure 8-3. Configuring an HTTP/FTP gateway
Figure 8-4. Browsers can configure particular protocols to use particular gateways
8.2.1 HTTP/*: Server-Side Web Gateways
Figure 8-5. The HTTP/FTP gateway translates HTTP request into FTP requests
8.2.2 HTTP/HTTPS: Server-Side Security Gateways
Figure 8-6. Inbound HTTP/HTTPS security gateway
8.2.3 HTTPS/HTTP: Client-Side Security Accelerator Gateways
Figure 8-7. HTTPS/HTTP security accelerator gateway
8.3 Resource Gateways
Figure 8-8. An application server connects HTTP clients to arbitrary backend applications
Figure 8-9. Server gateway application mechanics
8.3.1 Common Gateway Interface (CGI)
8.3.2 Server Extension APIs
8.4 Application Interfaces and Web Services
8.5 Tunnels
8.5.1 Establishing HTTP Tunnels with CONNECT
Figure 8-10. Using CONNECT to establish an SSL tunnel
8.5.1.1 CONNECT requests
8.5.1.2 CONNECT responses
8.5.2 Data Tunneling, Timing, and Connection Management
8.5.3 SSL Tunneling
Figure 8-11. Tunnels let non-HTTP traffic flow through HTTP connections
Figure 8-12. Direct SSL connection vs. tunnelled SSL connection
8.5.4 SSL Tunneling Versus HTTP/HTTPS Gateways
8.5.5 Tunnel Authentication
Figure 8-13. Gateways can proxy-authenticate a client before it's allowed to use a tunnel
8.5.6 Tunnel Security Considerations
8.6 Relays
Figure 8-14. Simple blind relays can hang if they are single-tasking and don't support the Connection header
8.7 For More Information
Chapter 9. Web Robots
9.1 Crawlers and Crawling
9.1.1 Where to Start: The "Root Set"
Figure 9-1. A root set is needed to reach all pages
9.1.2 Extracting Links and Normalizing Relative Links
9.1.3 Cycle Avoidance
Figure 9-2. Crawling over a web of hyperlinks
9.1.4 Loops and Dups
9.1.5 Trails of Breadcrumbs
9.1.6 Aliases and Robot Cycles
Table 9-1. Different URLs that alias to the same documents
9.1.7 Canonicalizing URLs
9.1.8 Filesystem Link Cycles
Figure 9-3. Symbolic link cycles
9.1.9 Dynamic Virtual Web Spaces
Figure 9-4. Malicious dynamic web space example
9.1.10 Avoiding Loops and Dups
9.2 Robotic HTTP
9.2.1 Identifying Request Headers
9.2.2 Virtual Hosting
Figure 9-5. Example of virtual docroots causing trouble if no Host header is sent with the request
9.2.3 Conditional Requests
9.2.4 Response Handling
9.2.4.1 Status codes
9.2.4.2 Entities
9.2.5 User-Agent Targeting
9.3 Misbehaving Robots
9.4 Excluding Robots
Figure 9-6. Fetching robots.txt and verifying accessibility before crawling the target file
9.4.1 The Robots Exclusion Standard
Table 9-2. Robots Exclusion Standard versions
9.4.2 Web Sites and robots.txt Files
9.4.2.1 Fetching robots.txt
9.4.2.2 Response codes
9.4.3 robots.txt File Format
9.4.3.1 The User-Agent line
9.4.3.2 The Disallow and Allow lines
9.4.3.3 Disallow/Allow prefix matching
Table 9-3. Robots.txt path matching examples
9.4.4 Other robots.txt Wisdom
9.4.5 Caching and Expiration of robots.txt
9.4.6 Robot Exclusion Perl Code
Table 9-4. Robot accessibility to the Mary's Antiques web site
9.4.7 HTML Robot-Control META Tags
9.4.7.1 Robot META directives
9.4.7.2 Search engine META tags
Table 9-5. Additional META tag directives
9.5 Robot Etiquette
Table 9-6. Guidelines for web robot operators
9.6 Search Engines
9.6.1 Think Big
9.6.2 Modern Search Engine Architecture
Figure 9-7. A production search engine contains cooperating crawlers and query gateways
9.6.3 Full-Text Index
Figure 9-8. Three documents and a full-text index
9.6.4 Posting the Query
Figure 9-9. Example search query request
9.6.5 Sorting and Presenting the Results
9.6.6 Spoofing
9.7 For More Information
Chapter 10. HTTP-NG
10.1 HTTP's Growing Pains
10.2 HTTP-NG Activity
10.3 Modularize and Enhance
Figure 10-1. HTTP-NG separates functions into layers
10.4 Distributed Objects
10.5 Layer 1: Messaging
10.6 Layer 2: Remote Invocation
10.7 Layer 3: Web Application
10.8 WebMUX
Figure 10-2. WebMUX can multiplex multiple messages over a single connection
10.9 Binary Wire Protocol
10.10 Current Status
10.11 For More Information
Part III: Identification, Authorization, and Security
Chapter 11. Client Identification and Cookies
11.1 The Personal Touch
11.2 HTTP Headers
Table 11-1. HTTP headers carry clues about users
11.3 Client IP Address
Figure 11-1. Proxies can add extension headers to pass along the original client IP address
11.4 User Login
Figure 11-2. Registering username using HTTP authentication headers
11.5 Fat URLs
11.6 Cookies
11.6.1 Types of Cookies
11.6.2 How Cookies Work
Figure 11-3. Slapping a cookie onto a user
11.6.3 Cookie Jar: Client-Side State
11.6.3.1 Netscape Navigator cookies
11.6.3.2 Microsoft Internet Explorer cookies
Figure 11-4. Internet Explorer cookies are stored in individual text files in the cache directory
11.6.4 Different Cookies for Different Sites
11.6.4.1 Cookie Domain attribute
11.6.4.2 Cookie Path attribute
11.6.5 Cookie Ingredients
Table 11-2. Cookie specifications
11.6.6 Version 0 (Netscape) Cookies
11.6.6.1 Version 0 Set-Cookie header
Table 11-3. Version 0 (Netscape) Set-Cookie attributes
11.6.6.2 Version 0 Cookie header
11.6.7 Version 1 (RFC 2965) Cookies
11.6.7.1 Version 1 Set-Cookie2 header
Table 11-4. Version 1 (RFC 2965) Set-Cookie2 attributes
11.6.7.2 Version 1 Cookie header
11.6.7.3 Version 1 Cookie2 header and version negotiation
11.6.8 Cookies and Session Tracking
Figure 11-5. The Amazon.com web site uses session cookies to track users
11.6.9 Cookies and Caching
11.6.10 Cookies, Security, and Privacy
11.7 For More Information
Chapter 12. Basic Authentication
12.1 Authentication
12.1.1 HTTP's Challenge/Response Authentication Framework
Figure 12-1. Simplified challenge/response authentication
12.1.2 Authentication Protocols and Headers
Table 12-1. Four phases of authentication
Figure 12-2. Basic authentication example
12.1.3 Security Realms
Figure 12-3. Security realms in a web server
12.2 Basic Authentication
12.2.1 Basic Authentication Example
Table 12-2. Basic authentication headers
12.2.2 Base-64 Username/Password Encoding
Figure 12-4. Generating a basic Authorization header from username and password
12.2.3 Proxy Authentication
Table 12-3. Web server versus proxy authentication
12.3 The Security Flaws of Basic Authentication
12.4 For More Information
Chapter 13. Digest Authentication
13.1 The Improvements of Digest Authentication
13.1.1 Using Digests to Keep Passwords Secret
Figure 13-1. Using digests for password-obscured authentication
13.1.2 One-Way Digests
Table 13-1. MD5 digest examples
13.1.3 Using Nonces to Prevent Replays
13.1.4 The Digest Authentication Handshake
Figure 13-2. Digest authentication handshake
Figure 13-3. Basic versus digest authentication syntax
13.2 Digest Calculations
13.2.1 Digest Algorithm Input Data
13.2.2 The Algorithms H(d) and KD(s,d)
13.2.3 The Security-Related Data (A1)
Table 13-2. Definitions for A1 by algorithm
13.2.4 The Message-Related Data (A2)
Table 13-3. Definitions for A2 by algorithm (request digests)
13.2.5 Overall Digest Algorithm
Table 13-4. Old and new digest algorithms
Table 13-5. Unfolded digest algorithm cheat sheet
13.2.6 Digest Authentication Session
13.2.7 Preemptive Authorization
Figure 13-4. Preemptive authorization reduces message count
13.2.7.1 Next nonce pregeneration
13.2.7.2 Limited nonce reuse
13.2.7.3 Synchronized nonce generation
13.2.8 Nonce Selection
13.2.9 Symmetric Authentication
Table 13-6. Definitions for A2 by algorithm (request digests)
Table 13-7. Definitions for A2 by algorithm (response digests)
13.3 Quality of Protection Enhancements
13.3.1 Message Integrity Protection
13.3.2 Digest Authentication Headers
Table 13-8. HTTP authentication headers
13.4 Practical Considerations
13.4.1 Multiple Challenges
13.4.2 Error Handling
13.4.3 Protection Spaces
13.4.4 Rewriting URIs
13.4.5 Caches
13.5 Security Considerations
13.5.1 Header Tampering
13.5.2 Replay Attacks
13.5.3 Multiple Authentication Mechanisms
13.5.4 Dictionary Attacks
13.5.5 Hostile Proxies and Man-in-the-Middle Attacks
13.5.6 Chosen Plaintext Attacks
13.5.7 Storing Passwords
13.6 For More Information
Chapter 14. Secure HTTP
14.1 Making HTTP Safe
14.1.1 HTTPS
Figure 14-1. Browsing secure web sites
Figure 14-2. HTTPS is HTTP layered over a security layer, layered over TCP
14.2 Digital Cryptography
14.2.1 The Art and Science of Secret Coding
14.2.2 Ciphers
Figure 14-3. Plaintext and ciphertext
Figure 14-4. Rotate-by-3 cipher example
14.2.3 Cipher Machines
14.2.4 Keyed Ciphers
Figure 14-5. The rotate-by-N cipher, using different keys
14.2.5 Digital Ciphers
Figure 14-6. Plaintext is encoded with encoding key e, and decoded using decoding key d
14.3 Symmetric-Key Cryptography
Figure 14-7. Symmetric-key cryptography algorithms use the same key for encoding and decoding
14.3.1 Key Length and Enumeration Attacks
Table 14-1. Longer keys take more effort to crack (1995 data, from "Applied Cryptography")
14.3.2 Establishing Shared Keys
14.4 Public-Key Cryptography
Figure 14-8. Public-key cryptography is asymmetric, using different keys for encoding and decoding
Figure 14-9. Public-key cryptography assigns a single, public encoding key to each host
14.4.1 RSA
14.4.2 Hybrid Cryptosystems and Session Keys
14.5 Digital Signatures
14.5.1 Signatures Are Cryptographic Checksums
Figure 14-10. Unencrypted digital signature
14.6 Digital Certificates
14.6.1 The Guts of a Certificate
Figure 14-11. Typical digital signature format
14.6.2 X.509 v3 Certificates
Table 14-2. X.509 certificate fields
14.6.3 Using Certificates to Authenticate Servers
Figure 14-12. Verifying that a signature is real
14.7 HTTPS: The Details
14.7.1 HTTPS Overview
Figure 14-13. HTTP transport-level security
14.7.2 HTTPS Schemes
Figure 14-14. HTTP and HTTPS port numbers
14.7.3 Secure Transport Setup
Figure 14-15. HTTP and HTTPS transactions
14.7.4 SSL Handshake
Figure 14-16. SSL handshake (simplified)
14.7.5 Server Certificates
Figure 14-17. HTTPS certificates are X.509 certificates with site information
14.7.6 Site Certificate Validation
14.7.7 Virtual Hosting and Certificates
Figure 14-18. Certificate name mismatches bring up certificate error dialog boxes
14.8 A Real HTTPS Client
14.8.1 OpenSSL
14.8.2 A Simple HTTPS Client
14.8.3 Executing Our Simple OpenSSL Client
14.9 Tunneling Secure Traffic Through Proxies
Figure 14-19. Corporate firewall proxy
Figure 14-20. Proxy can't proxy an encrypted request
14.10 For More Information
14.10.1 HTTP Security
14.10.2 SSL and TLS
14.10.3 Public-Key Infrastructure
14.10.4 Digital Cryptography
Part IV: Entities, Encodings, and Internationalization
Chapter 15. Entities and Encodings
15.1 Messages Are Crates, Entities Are Cargo
Figure 15-1. Message entity is made up of entity headers and entity body
15.1.1 Entity Bodies
Figure 15-2. Hex dumps of real message content (raw message content follows blank CRLF)
15.2 Content-Length: The Entity's Size
15.2.1 Detecting Truncation
15.2.2 Incorrect Content-Length
15.2.3 Content-Length and Persistent Connections
15.2.4 Content Encoding
15.2.5 Rules for Determining Entity Body Length
15.3 Entity Digests
15.4 Media Type and Charset
Table 15-1. Common media types
15.4.1 Character Encodings for Text Media
15.4.2 Multipart Media Types
15.4.3 Multipart Form Submissions
15.4.4 Multipart Range Responses
15.5 Content Encoding
15.5.1 The Content-Encoding Process
Figure 15-3. Content-encoding example
15.5.2 Content-Encoding Types
Table 15-2. Content-encoding tokens
15.5.3 Accept-Encoding Headers
Figure 15-4. Content encoding
15.6 Transfer Encoding and Chunked Encoding
Figure 15-5. Content encodings versus transfer encodings
15.6.1 Safe Transport
15.6.2 Transfer-Encoding Headers
15.6.3 Chunked Encoding
15.6.3.1 Chunking and persistent connections
Figure 15-6. Anatomy of a chunked message
15.6.3.2 Trailers in chunked messages
15.6.4 Combining Content and Transfer Encodings
Figure 15-7. Combining content encoding with transfer encoding
15.6.5 Transfer-Encoding Rules
15.7 Time-Varying Instances
Figure 15-8. Instances are "snapshots" of a resource in time
15.8 Validators and Freshness
15.8.1 Freshness
Table 15-3. Cache-Control header directives
15.8.2 Conditionals and Validators
Table 15-4. Conditional request types
15.9 Range Requests
Figure 15-9. Entity range request example
15.10 Delta Encoding
Figure 15-10. Mechanics of delta-encoding
Table 15-5. Delta-encoding headers
15.10.1 Instance Manipulations, Delta Generators, and Delta Appliers
Table 15-6. IANA registered types of instance manipulations
15.11 For More Information
Chapter 16. Internationalization
16.1 HTTP Support for International Content
16.2 Character Sets and HTTP
16.3 Multilingual Character Encoding Primer
16.4 Language Tags and HTTP
16.5 Internationalized URIs
16.6 Other Considerations
16.7 For More Information
Chapter 17. Content Negotiation and Transcoding
17.1 Content-Negotiation Techniques
17.2 Client-Driven Negotiation
17.3 Server-Driven Negotiation
17.4 Transparent Negotiation
17.5 Transcoding
17.6 Next Steps
17.7 For More Information
Part V: Content Publishing and Distribution
Chapter 18. Web Hosting
18.1 Hosting Services
18.1.1 A Simple Example: Dedicated Hosting
Figure 18-1. Outsourced dedicated hosting
18.2 Virtual Hosting
Figure 18-2. Outsourced virtual hosting
18.2.1 Virtual Server Request Lacks Host Information
Figure 18-3. HTTP/1.0 server requests don't contain hostname information
18.2.2 Making Virtual Hosting Work
18.2.2.1 Virtual hosting by URL path
18.2.2.2 Virtual hosting by port number
18.2.2.3 Virtual hosting by IP address
Figure 18-4. Virtual IP hosting
18.2.2.4 Virtual hosting by Host header
Figure 18-5. Host headers distinguish virtual host requests
18.2.3 HTTP/1.1 Host Headers
18.2.3.1 Syntax and usage
18.2.3.2 Missing Host headers
18.2.3.3 Interpreting Host headers
18.2.3.4 Host headers and proxies
18.3 Making Web Sites Reliable
18.3.1 Mirrored Server Farms
Figure 18-6. Mirrored server farm
Figure 18-7. Dispersed mirrored servers
18.3.2 Content Distribution Networks
18.3.3 Surrogate Caches in CDNs
18.3.4 Proxy Caches in CDNs
Figure 18-8. Client requests intercepted by a switch and sent to a proxy
18.4 Making Web Sites Fast
18.5 For More Information
Chapter 19. Publishing Systems
19.1 FrontPage Server Extensions for Publishing Support
19.1.1 FrontPage Server Extensions
Figure 19-1. FrontPage publishing architecture
19.1.2 FrontPage Vocabulary
19.1.3 The FrontPage RPC Protocol
Figure 19-2. Initial request
19.1.3.1 Request
19.1.3.2 Response
19.1.4 FrontPage Security Model
19.2 WebDAV and Collaborative Authoring
19.2.1 WebDAV Methods
19.2.2 WebDAV and XML
19.2.3 WebDAV Headers
19.2.4 WebDAV Locking and Overwrite Prevention
Figure 19-3. Lost update problem
19.2.5 The LOCK Method
19.2.5.1 The opaquelocktoken scheme
19.2.5.2 The XML element
19.2.5.3 Lock refreshes and the Timeout header
19.2.6 The UNLOCK Method
Table 19-1. Status codes for LOCK and UNLOCK methods
19.2.7 Properties and META Data
19.2.8 The PROPFIND Method
19.2.9 The PROPPATCH Method
Table 19-2. Status codes for PROPFIND and PROPPATCH methods
19.2.10 Collections and Namespace Management
19.2.11 The MKCOL Method
19.2.12 The DELETE Method
19.2.13 The COPY and MOVE Methods
19.2.13.1 Overwrite header effect
19.2.13.2 COPY/MOVE of properties
19.2.13.3 Locked resources and COPY/MOVE
Table 19-3. Status codes for the MKCOL, DELETE, COPY, and MOVE methods
19.2.14 Enhanced HTTP/1.1 Methods
19.2.14.1 The PUT method
19.2.14.2 The OPTIONS method
19.2.15 Version Management in WebDAV
19.2.16 Future of WebDAV
19.3 For More Information
Chapter 20. Redirection and Load Balancing
20.1 Why Redirect?
20.2 Where to Redirect
20.3 Overview of Redirection Protocols
Table 20-1. General redirection methods
Table 20-2. Proxy and cache redirection techniques
20.4 General Redirection Methods
20.4.1 HTTP Redirection
Figure 20-1. HTTP redirection
20.4.2 DNS Redirection
Figure 20-2. DNS-based redirection
20.4.2.1 DNS round robin
Example 20-1. IP addresses for www.cnn.com
20.4.2.2 Multiple addresses and round-robin address rotation
Example 20-2. Rotating DNS address lists
20.4.2.3 DNS round robin for load balancing
Figure 20-3. DNS round robin load balances across servers in a server farm
20.4.2.4 The impact of DNS caching
20.4.2.5 Other DNS-based redirection algorithms
Figure 20-4. DNS request involving authoritative server
20.4.3 Anycast Addressing
Figure 20-5. Distributed anycast addressing
20.4.4 IP MAC Forwarding
Figure 20-6. Layer-2 switch sending client requests to a gateway
Figure 20-7. MAC forwarding using a layer-4 switch
20.4.5 IP Address Forwarding
Figure 20-8. A switch doing IP forwarding to a caching proxy or mirrored web server
Figure 20-9. Full NAT of a TCP/IP datagram
20.4.6 Network Element Control Protocol
20.4.6.1 Messages
Table 20-3. NECP messages
20.5 Proxy Redirection Methods
20.5.1 Explicit Browser Configuration
20.5.2 Proxy Auto-configuration
Figure 20-10. Proxy auto-configuration
20.5.3 Web Proxy Autodiscovery Protocol
20.5.3.1 PAC file autodiscovery
Figure 20-11. WPAD determines the PAC URL, which determines the proxy server
20.5.3.2 WPAD algorithm
20.5.3.3 CURL discovery using DHCP
20.5.3.4 DNS A record lookup
20.5.3.5 Retrieving the PAC file
20.5.3.6 When to execute WPAD
20.5.3.7 WPAD spoofing
20.5.3.8 Timeouts
20.5.3.9 Administrator considerations
20.6 Cache Redirection Methods
20.6.1 WCCP Redirection
20.6.1.1 How WCCP redirection works
20.6.1.2 WCCP2 messages
Table 20-4. WCCP2 messages
20.6.1.3 Message components
Table 20-5. WCCP2 message components
20.6.1.4 Service groups
20.6.1.5 GRE packet encapsulation
Figure 20-12. How a WCCP router changes an HTTP packet's destination IP address
20.6.1.6 WCCP load balancing
20.7 Internet Cache Protocol
20.8 Cache Array Routing Protocol
Figure 20-13. ICP queries
Figure 20-14. CARP redirection
20.9 Hyper Text Caching Protocol
Figure 20-15. HTCP message format
Table 20-6. HTCP data components
Table 20-7. HTCP opcodes
20.9.1 HTCP Authentication
Table 20-8. HTCP authentication components
20.9.2 Setting Caching Policies
Table 20-9. List of Cache headers for modifying caching policies
20.10 For More Information
Chapter 21. Logging and Usage Tracking
21.1 What to Log?
21.2 Log Formats
21.2.1 Common Log Format
Table 21-1. Common Log Format fields
Example 21-1. Common Log Format
21.2.2 Combined Log Format
Table 21-2. Additional Combined Log Format fields
Example 21-2. Combined Log Format
21.2.3 Netscape Extended Log Format
Table 21-3. Additional Netscape Extended Log Format fields
Example 21-3. Netscape Extended Log Format
21.2.4 Netscape Extended 2 Log Format
Table 21-4. Additional Netscape Extended 2 Log Format fields
Example 21-4. Netscape Extended 2 Log Format
Table 21-5. Netscape route codes
Table 21-6. Netscape finish status codes
Table 21-7. Netscape cache codes
21.2.5 Squid Proxy Log Format
Table 21-8. Squid Log Format fields
Example 21-5. Squid Log Format
Table 21-9. Squid result codes
21.3 Hit Metering
21.3.1 Overview
21.3.2 The Meter Header
Table 21-10. Hit Metering directives
Figure 21-1. Hit Metering example
21.4 A Word on Privacy
21.5 For More Information
Part VI: Appendixes
Appendix A. URI Schemes
Table A-1. URI schemes from the W3C registry
Appendix B. HTTP Status Codes
B.1 Status Code Classifications
Table B-1. Status code classifications
B.2 Status Codes
Table B-2. Status codes
Appendix C. HTTP Header Reference
Appendix D. MIME Types
D.1 Background
D.2 MIME Type Structure
D.2.1 Discrete Types
D.2.2 Composite Types
D.2.3 Multipart Types
D.2.4 Syntax
Table D-1. Common primary MIME types
D.3 MIME Type IANA Registration
D.3.1 Registration Trees
Table D-2. Four MIME media type registration trees
D.3.2 Registration Process
D.3.3 Registration Rules
D.3.4 Registration Template
Example D-1. IANA MIME registration email template
D.3.5 MIME Media Type Registry
D.4 MIME Type Tables
D.4.1 application/*
Table D-3. "Application" MIME types
D.4.2 audio/*
Table D-4. "Audio" MIME types
D.4.3 chemical/*
Table D-5. "Chemical" MIME types
D.4.4 image/*
Table D-6. "Image" MIME types
D.4.5 message/*
Table D-7. "Message" MIME types
D.4.6 model/*
Table D-8. "Model" MIME types
D.4.7 multipart/*
Table D-9. "Multipart" MIME types
D.4.8 text/*
Table D-10. "Text" MIME types
D.4.9 video/*
Table D-11. "Video" MIME types
D.4.10 Experimental Types
Table D-12. Extension MIME types
Appendix E. Base-64 Encoding
E.1 Base-64 Encoding Makes Binary Data Safe
E.2 Eight Bits to Six Bits
Table E-1. Base-64 alphabet
Figure E-1. Base-64 encoding example
E.3 Base-64 Padding
Table E-2. Base-64 padding examples
E.4 Perl Implementation
E.5 For More Information
Appendix F. Digest Authentication
F.1 Digest WWW-Authenticate Directives
Table F-1. Digest WWW-Authenticate header directives (from RFC 2617)
F.2 Digest Authorization Directives
Table F-2. Digest Authorization header directives (from RFC 2617)
F.3 Digest Authentication-Info Directives
Table F-3. Digest Authentication-Info header directives (from RFC 2617)
F.4 Reference Code
F.4.1 File "digcalc.h"
F.4.2 File "digcalc.c"
F.4.3 File "digtest.c"
Appendix G. Language Tags
G.1 First Subtag Rules
G.2 Second Subtag Rules
G.3 IANA-Registered Language Tags
Table G-1. Language tags
G.4 ISO 639 Language Codes
Table G-2. ISO 639 and 639-2 language codes
G.5 ISO 3166 Country Codes
Table G-3. ISO 3166 country codes
G.6 Language Administrative Organizations
Appendix H. MIME Charset Registry
H.1 MIME Charset Registry
H.2 Preferred MIME Names
H.3 Registered Charsets
Table H-1. IANA MIME charset tags
HTTP: The Definitive Guide Copyright © 2002 O'Reilly & Associates, Inc. All rights reserved. Printed in the United States of America. Published by O'Reilly & Associates, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472. O'Reilly & Associates books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles (http:// ). For more information contact our corporate/institutional sales department: 800-998-9938 or corporate@oreilly.com. Nutshell Handbook, the Nutshell Handbook logo, and the O'Reilly logo are registered trademarks of O'Reilly & Associates, Inc. Many of the designations used by manufacturers and sellers to distinguish their products are claimed as trademarks. Where those designations appear in this book, and O'Reilly & Associates, Inc. was aware of a trademark claim, the designations have been printed in caps or initial caps. The association between the image of a thirteen-lined ground squirrel and the topic of HTTP is a trademark of O'Reilly & Associates, Inc. While every precaution has been taken in the preparation of this book, the publisher and the authors assume no responsibility for errors or omissions, or for damages resulting from the use of the information contained herein. Preface The Hypertext Transfer Protocol (HTTP) is the protocol programs use to communicate over the World Wide Web. There are many applications of HTTP, but HTTP is most famous for two-way conversation between web browsers and web servers. HTTP began as a simple protocol, so you might think there really isn't that much to say about it. And yet here you stand, with a two-pound book in your hands. If you're wondering how we could have written 650 pages on HTTP, take a look at the Table of Contents. This book isn't just an HTTP header reference manual; it's a veritable bible of web architecture. In this book, we try to tease apart HTTP's interrelated and often misunderstood rules, and we offer you a series of topic-based chapters that explain all the aspects of HTTP. Throughout the book, we are careful to explain the "why" of HTTP, not just the "how." And to save you time chasing references, we explain many of the critical non-HTTP technologies that are required to make HTTP applications work. You can find the alphabetical header reference (which forms the basis of most conventional HTTP texts) in a conveniently organized appendix. We hope this conceptual design makes it easy for you to work with HTTP. This book is written for anyone who wants to understand HTTP and the underlying architecture of the Web. Software and hardware engineers can use this book as a coherent reference for HTTP and related web technologies. Systems architects and network administrators can use this book to better understand how to design, deploy, and manage complicated web architectures. Performance engineers and analysts can benefit from the sections on caching and performance optimization. Marketing and consulting professionals will be able to use the conceptual orientation to better understand the landscape of web technologies.
This book illustrates common misconceptions, advises on "tricks of the trade," provides convenient reference material, and serves as a readable introduction to dry and confusing standards specifications. In a single book, we detail the essential and interrelated technologies that make the Web work. This book is the result of a tremendous amount of work by many people who share an enthusiasm for Internet technologies. We hope you find it useful. Running Example: Joe's Hardware Store Many of our chapters include a running example of a hypothetical online hardware and home- improvement store called "Joe's Hardware" to demonstrate technology concepts. We have set up a real web site for the store (http://www.joes-hardware.com) for you to test some of the examples in the book. We will maintain this web site while this book remains in print. Chapter-by-Chapter Guide This book contains 21 chapters, divided into 5 logical parts (each with a technology theme), and 8 useful appendixes containing reference data and surveys of related technologies: Part I Part II Part III Part IV Part V Part VI Part I, describes the core technology of HTTP, the foundation of the Web, in four chapters: • Chapter 1 is a rapid-paced overview of HTTP. • Chapter 2 details the formats of uniform resource locators (URLs) and the various types of resources that URLs name across the Internet. It also outlines the evolution to uniform resource names (URNs). • Chapter 3 details how HTTP messages transport web content. • Chapter 4 explains the commonly misunderstood and poorly documented rules and behavior for managing HTTP connections. Part II highlights the HTTP server, proxy, cache, gateway, and robot applications that are the architectural building blocks of web systems. (Web browsers are another building block, of course, but browsers already were covered thoroughly in Part I of the book.) Part II contains the following six chapters: • Chapter 5 gives an overview of web server architectures. • Chapter 6 explores HTTP proxy servers, which are intermediary servers that act as platforms for HTTP services and controls.
• Chapter 7 delves into the science of web caches—devices that improve performance and reduce traffic by making local copies of popular documents. • Chapter 8 explains gateways and application servers that allow HTTP to work with software that speaks different protocols, including Secure Sockets Layer (SSL) encrypted protocols. • Chapter 9 describes the various types of clients that pervade the Web, including the ubiquitous browsers, robots and spiders, and search engines. • Chapter 10 talks about HTTP developments still in the works: the HTTP-NG protocol. Part III presents a suite of techniques and technologies to track identity, enforce security, and control access to content. It contains the following four chapters: • Chapter 11 talks about techniques to identify users so that content can be personalized to the user audience. • Chapter 12 highlights the basic mechanisms to verify user identity. The chapter also examines how HTTP authentication interfaces with databases. • Chapter 13 explains digest authentication, a complex proposed enhancement to HTTP that provides significantly enhanced security. • Chapter 14 is a detailed overview of Internet cryptography, digital certificates, and SSL. Part IV focuses on the bodies of HTTP messages (which contain the actual web content) and on the web standards that describe and manipulate content stored in the message bodies. Part IV contains three chapters: • Chapter 15 describes the structure of HTTP content. • Chapter 16 surveys the web standards that allow users around the globe to exchange content in different languages and character sets. • Chapter 17 explains mechanisms for negotiating acceptable content. Part V discusses the technology for publishing and disseminating web content. It contains four chapters: • Chapter 18 discusses the ways people deploy servers in modern web hosting environments and HTTP support for virtual web hosting. • Chapter 19 discusses the technologies for creating web content and installing it onto web servers. • Chapter 20 surveys the tools and techniques for distributing incoming web traffic among a collection of servers. • Chapter 21 covers log formats and common questions.
Part VI contains helpful reference appendixes and tutorials in related technologies: • Appendix A summarizes the protocols supported through uniform resource identifier (URI) schemes. • Appendix B conveniently lists the HTTP response codes. • Appendix C provides a reference list of HTTP header fields. • Appendix D provides an extensive list of MIME types and explains how MIME types are registered. • Appendix E explains base-64 encoding, used by HTTP authentication. • Appendix F gives details on how to implement various authentication schemes in HTTP. • Appendix G defines language tag values for HTTP language headers. • Appendix H provides a detailed list of character encodings, used for HTTP internationalization support. Each chapter contains many examples and pointers to additional reference material. Typographic Conventions In this book, we use the following typographic conventions: Italic Used for URLs, C functions, command names, MIME types, new terms where they are defined, and emphasis Constant width Used for computer output, code, and any literal text Constant width bold Used for user input Comments and Questions Please address comments and questions concerning this book to the publisher: O'Reilly & Associates, Inc. 1005 Gravenstein Highway North Sebastopol, CA 95472 (800) 998-9938 (in the United States or Canada) (707) 829-0515 (international/local)
(707) 829-0104 (fax) There is a web page for this book, which lists errata, examples, or any additional information. You can access this page at: http://www.oreilly.com/catalog/httptdg/ To comment or ask technical questions about this book, send email to: bookquestions@oreilly.com For more information about books, conferences, Resource Centers, and the O'Reilly Network, see the O'Reilly web site at: http://www.oreilly.com Acknowledgments This book is the labor of many. The five authors would like to hold up a few people in thanks for their significant contributions to this project. To start, we'd like to thank Linda Mui, our editor at O'Reilly. Linda first met with David and Brian way back in 1996, and she refined and steered several concepts into the book you hold today. Linda also helped keep our wandering gang of first-time book authors moving in a coherent direction and on a progressing (if not rapid) timeline. Most of all, Linda gave us the chance to create this book. We're very grateful. We'd also like to thank several tremendously bright, knowledgeable, and kind souls who devoted noteworthy energy to reviewing, commenting on, and correcting drafts of this book. These include Tony Bourke, Sean Burke, Mike Chowla, Shernaz Daver, Fred Douglis, Paula Ferguson, Vikas Jha, Yves Lafon, Peter Mattis, Chuck Neerdaels, Luis Tavera, Duane Wessels, Dave Wu, and Marco Zagha. Their viewpoints and suggestions have improved the book tremendously. Rob Romano from O'Reilly created most of the amazing artwork you'll find in this book. The book contains an unusually large number of detailed illustrations that make subtle concepts very clear. Many of these illustrations were painstakingly created and revised numerous times. If a picture is worth a thousand words, Rob added hundreds of pages of value to this book. Brian would like to personally thank all of the authors for their dedication to this project. A tremendous amount of time was invested by the authors in a challenge to make the first detailed but accessible treatment of HTTP. Weddings, childbirths, killer work projects, startup companies, and graduate schools intervened, but the authors held together to bring this project to a successful completion. We believe the result is worthy of everyone's hard work and, most importantly, that it provides a valuable service. Brian also would like to thank the employees of Inktomi for their enthusiasm and support and for their deep insights about the use of HTTP in real-world applications. Also, thanks to the fine folks at Cajun-shop.com for allowing us to use their site for some of the examples in this book. David would like to thank his family, particularly his mother and grandfather for their ongoing support. He'd like to thank those that have put up with his erratic schedule over the years writing the book. He'd also like to thank Slurp, Orctomi, and Norma for everything they've done, and his fellow
authors for all their hard work. Finally, he would like to thank Brian for roping him into yet another adventure. Marjorie would like to thank her husband, Alan Liu, for technical insight, familial support and understanding. Marjorie thanks her fellow authors for many insights and inspirations. She is grateful for the experience of working together on this book. Sailu would like to thank David and Brian for the opportunity to work on this book, and Chuck Neerdaels for introducing him to HTTP. Anshu would like to thank his wife, Rashi, and his parents for their patience, support, and encouragement during the long years spent writing this book. Finally, the authors collectively thank the famous and nameless Internet pioneers, whose research, development, and evangelism over the past four decades contributed so much to our scientific, social, and economic community. Without these labors, there would be no subject for this book. Part I: HTTP: The Web's Foundation This section is an introduction to the HTTP protocol. The next four chapters describe the core technology of HTTP, the foundation of the Web: • Chapter 1 is a rapid-paced overview of HTTP. • Chapter 2 details the formats of URLs and the various types of resources that URLs name across the Internet. We also outline the evolution to URNs. • Chapter 3 details the HTTP messages that transport web content. • Chapter 4 discusses the commonly misunderstood and poorly documented rules and behavior for managing TCP connections by HTTP. Chapter 1. Overview of HTTP The world's web browsers, servers, and related web applications all talk to each other through HTTP, the Hypertext Transfer Protocol. HTTP is the common language of the modern global Internet. This chapter is a concise overview of HTTP. You'll see how web applications use HTTP to communicate, and you'll get a rough idea of how HTTP does its job. In particular, we talk about: • How web clients and servers communicate • Where resources (web content) come from • How web transactions work • The format of the messages used for HTTP communication • The underlying TCP network transport
• The different variations of the HTTP protocol • Some of the many HTTP architectural components installed around the Internet We've got a lot of ground to cover, so let's get started on our tour of HTTP. 1.1 HTTP: The Internet's Multimedia Courier Billions of JPEG images, HTML pages, text files, MPEG movies, WAV audio files, Java applets, and more cruise through the Internet each and every day. HTTP moves the bulk of this information quickly, conveniently, and reliably from web servers all around the world to web browsers on people's desktops. Because HTTP uses reliable data-transmission protocols, it guarantees that your data will not be damaged or scrambled in transit, even when it comes from the other side of the globe. This is good for you as a user, because you can access information without worrying about its integrity. Reliable transmission is also good for you as an Internet application developer, because you don't have to worry about HTTP communications being destroyed, duplicated, or distorted in transit. You can focus on programming the distinguishing details of your application, without worrying about the flaws and foibles of the Internet. Let's look more closely at how HTTP transports the Web's traffic. 1.2 Web Clients and Servers Web content lives onweb servers. Web servers speak the HTTP protocol, so they are often called HTTP servers. These HTTP servers store the Internet's data and provide the data when it is requested by HTTP clients. The clients send HTTP requests to servers, and servers return the requested data in HTTP responses, as sketched in Figure 1-1. Together, HTTP clients and HTTP servers make up the basic components of the World Wide Web. Figure 1-1. Web clients and servers You probably use HTTP clients every day. The most common client is a web browser, such as Microsoft Internet Explorer or Netscape Navigator. Web browsers request HTTP objects from servers and display the objects on your screen. When you browse to a page, such as "http://www.oreilly.com/index.html," your browser sends an HTTP request to the server www.oreilly.com (see Figure 1-1). The server tries to find the desired object (in this case, "/index.html") and, if successful, sends the object to the client in an HTTP response, along with the type of the object, the length of the object, and other information.
1.3 Resources Web servers host web resources. A web resource is the source of web content. The simplest kind of web resource is a static file on the web server's filesystem. These files can contain anything: they might be text files, HTML files, Microsoft Word files, Adobe Acrobat files, JPEG image files, AVI movie files, or any other format you can think of. However, resources don't have to be static files. Resources can also be software programs that generate content on demand. These dynamic content resources can generate content based on your identity, on what information you've requested, or on the time of day. They can show you a live image from a camera, or let you trade stocks, search real estate databases, or buy gifts from online stores (see Figure 1-2). Figure 1-2. A web resource is anything that provides web content In summary, a resource is any kind of content source. A file containing your company's sales forecast spreadsheet is a resource. A web gateway to scan your local public library's shelves is a resource. An Internet search engine is a resource. 1.3.1 Media Types Because the Internet hosts many thousands of different data types, HTTP carefully tags each object being transported through the Web with a data format label called a MIME type. MIME (Multipurpose Internet Mail Extensions) was originally designed to solve problems encountered in moving messages between different electronic mail systems. MIME worked so well for email that HTTP adopted it to describe and label its own multimedia content. Web servers attach a MIME type to all HTTP object data (see Figure 1-3). When a web browser gets an object back from a server, it looks at the associated MIME type to see if it knows how to handle the object. Most browsers can handle hundreds of popular object types: displaying image files, parsing
分享到:
收藏