HTTP_权威指南.pdf

发布时间：2022-05-30 发布人：admin 分类：说明书资料大小：5.24M 资料格式：pdf 举报版权申诉

第1页 / 共596页

zhouruitao-3717122-HTTP_权威指南.pdf-第2页.png

第2页 / 共596页

zhouruitao-3717122-HTTP_权威指南.pdf-第3页.png

第3页 / 共596页

zhouruitao-3717122-HTTP_权威指南.pdf-第4页.png

第4页 / 共596页

zhouruitao-3717122-HTTP_权威指南.pdf-第5页.png

第5页 / 共596页

zhouruitao-3717122-HTTP_权威指南.pdf-第6页.png

第6页 / 共596页

zhouruitao-3717122-HTTP_权威指南.pdf-第7页.png

第7页 / 共596页

zhouruitao-3717122-HTTP_权威指南.pdf-第8页.png

第8页 / 共596页

HTTP: The Definitive Guide

Preface

Running Example: Joe's Hardware Store

Chapter-by-Chapter Guide

Typographic Conventions

Comments and Questions

Acknowledgments

Part I: HTTP: The Web's Foundation

Chapter 1. Overview of HTTP

1.1 HTTP: The Internet's Multimedia Courier

1.2 Web Clients and Servers

Figure 1-1. Web clients and servers

1.3 Resources

Figure 1-2. A web resource is anything that provides web content

1.3.1 Media Types

Figure 1-3. MIME types are sent back with the data content

1.3.2 URIs

Figure 1-4. URLs specify protocol, server, and local resource

1.3.3 URLs

Table 1-1. Example URLs

1.3.4 URNs

1.4 Transactions

Figure 1-5. HTTP transactions consist of request and response messages

1.4.1 Methods

Table 1-2. Some common HTTP methods

1.4.2 Status Codes

Table 1-3. Some common HTTP status codes

1.4.3 Web Pages Can Consist of Multiple Objects

Figure 1-6. Composite web pages require separate HTTP transactions for each embedded resource

1.5 Messages

Figure 1-7. HTTP messages have a simple, line-oriented text structure

1.5.1 Simple Message Example

Figure 1-8. Example GET transaction for http://www.joes-hardware.com/tools.html

1.6 Connections

1.6.1 TCP/IP

Figure 1-9. HTTP network protocol stack

1.6.2 Connections, IP Addresses, and Port Numbers

Figure 1-10. Basic browser connection process

1.6.3 A Real Example Using Telnet

Example 1-1. An HTTP transaction using telnet

1.7 Protocol Versions

1.8 Architectural Components of the Web

1.8.1 Proxies

Figure 1-11. Proxies relay traffic between client and server

1.8.2 Caches

Figure 1-12. Caching proxies keep local copies of popular documents to improve performance

1.8.3 Gateways

Figure 1-13. HTTP/FTP gateway

1.8.4 Tunnels

Figure 1-14. Tunnels forward data across non-HTTP networks (HTTP/SSL tunnel shown)

1.8.5 Agents

Figure 1-15. Automated search engine "spiders" are agents, fetching web pages around the world

1.9 The End of the Beginning

1.10 For More Information

1.10.1 HTTP Protocol Information

1.10.2 Historical Perspective

1.10.3 Other World Wide Web Information

Chapter 2. URLs and Resources

2.1 Navigating the Internet's Resources

Figure 2-1. How URLs relate to browser, machine, server, and location on the server's filesystem

2.1.1 The Dark Days Before URLs

2.2 URL Syntax

Table 2-1. General URL components

2.2.1 Schemes: What Protocol to Use

2.2.2 Hosts and Ports

2.2.3 Usernames and Passwords

2.2.4 Paths

2.2.5 Parameters

2.2.6 Query Strings

Figure 2-2. The URL query component is sent along to the gateway application

2.2.7 Fragments

Figure 2-3. The URL fragment is used only by the client, because the server deals with entire objects

2.3 URL Shortcuts

2.3.1 Relative URLs

Example 2-1. HTML snippet with relative URLs

Figure 2-4. Using a base URL

2.3.1.1 Base URLs

2.3.1.2 Resolving relative references

Figure 2-5. Converting relative to absolute URLs

2.3.2 Expandomatic URLs

2.4 Shady Characters

2.4.1 The URL Character Set

2.4.2 Encoding Mechanisms

Table 2-2. Some encoded character examples

2.4.3 Character Restrictions

Table 2-3. Reserved and restricted characters

2.4.4 A Bit More

2.5 A Sea of Schemes

Table 2-4. Common scheme formats

2.6 The Future

Figure 2-6. PURLs use a resource locator server to name the current location of a resource

2.6.1 If Not Now, When?

2.7 For More Information

Chapter 3. HTTP Messages

3.1 The Flow of Messages

3.1.1 Messages Commute Inbound to the Origin Server

Figure 3-1. Messages travel inbound to the origin server and outbound back to the client

3.1.2 Messages Flow Downstream

Figure 3-2. All messages flow downstream

3.2 The Parts of a Message

Figure 3-3. Three parts of an HTTP message

3.2.1 Message Syntax

Figure 3-4. An HTTP transaction has request and response messages

Figure 3-5. Example request and response messages

3.2.2 Start Lines

3.2.2.1 Request line

3.2.2.2 Response line

3.2.2.3 Methods

Table 3-1. Common HTTP methods

3.2.2.4 Status codes

Table 3-2. Status code classes

Table 3-3. Common status codes

3.2.2.5 Reason phrases

3.2.2.6 Version numbers

3.2.3 Headers

3.2.3.1 Header classifications

Table 3-4. Common header examples

3.2.3.2 Header continuation lines

3.2.4 Entity Bodies

3.2.5 Version 0.9 Messages

Figure 3-6. HTTP/0.9 transaction

3.3 Methods

3.3.1 Safe Methods

3.3.2 GET

Figure 3-7. GET example

3.3.3 HEAD

Figure 3-8. HEAD example

3.3.4 PUT

Figure 3-9. PUT example

3.3.5 POST

Figure 3-10. POST example

3.3.6 TRACE

Figure 3-11. TRACE example

3.3.7 OPTIONS

Figure 3-12. OPTIONS example

3.3.8 DELETE

Figure 3-13. DELETE example

3.3.9 Extension Methods

Table 3-5. Example web publishing extension methods

3.4 Status Codes

3.4.1 100-199: Informational Status Codes

Table 3-6. Informational status codes and reason phrases

3.4.1.1 Clients and 100 Continue

3.4.1.2 Servers and 100 Continue

3.4.1.3 Proxies and 100 Continue

3.4.2 200-299: Success Status Codes

Table 3-7. Success status codes and reason phrases

3.4.3 300-399: Redirection Status Codes

Figure 3-14. Redirected request to new location

Figure 3-15. Request redirected to use local copy

Table 3-8. Redirection status codes and reason phrases

3.4.4 400-499: Client Error Status Codes

Table 3-9. Client error status codes and reason phrases

3.4.5 500-599: Server Error Status Codes

Table 3-10. Server error status codes and reason phrases

3.5 Headers

3.5.1 General Headers

Table 3-11. General informational headers

3.5.1.1 General caching headers

Table 3-12. General caching headers

3.5.2 Request Headers

Table 3-13. Request informational headers

3.5.2.1 Accept headers

Table 3-14. Accept headers

3.5.2.2 Conditional request headers

Table 3-15. Conditional request headers

3.5.2.3 Request security headers

Table 3-16. Request security headers

3.5.2.4 Proxy request headers

Table 3-17. Proxy request headers

3.5.3 Response Headers

Table 3-18. Response informational headers

3.5.3.1 Negotiation headers

Table 3-19. Negotiation headers

3.5.3.2 Response security headers

Table 3-20. Response security headers

3.5.4 Entity Headers

Table 3-21. Entity informational headers

3.5.4.1 Content headers

Table 3-22. Content headers

3.5.4.2 Entity caching headers

Table 3-23. Entity caching headers

3.6 For More Information

Chapter 4. Connection Management

4.1 TCP Connections

Figure 4-1. Web browsers talk to web servers over TCP connections

4.1.1 TCP Reliable Data Pipes

Figure 4-2. TCP carries HTTP data in order, and without corruption

4.1.2 TCP Streams Are Segmented and Shipped by IP Packets

Figure 4-3. HTTP and HTTPS network protocol stacks

Figure 4-4. IP packets carry TCP segments, which carry chunks of the TCP data stream

4.1.3 Keeping TCP Connections Straight

Table 4-1. TCP connection values

Figure 4-5. Four distinct TCP connections

4.1.4 Programming with TCP Sockets

Table 4-2. Common socket interface functions for programming TCP connections

Figure 4-6. How TCP clients and servers communicate using the TCP sockets interface

4.2 TCP Performance Considerations

4.2.1 HTTP Transaction Delays

Figure 4-7. Timeline of a serial HTTP transaction

4.2.2 Performance Focus Areas

4.2.3 TCP Connection Handshake Delays

Figure 4-8. TCP requires two packet transfers to set up the connection before it can send data

4.2.4 Delayed Acknowledgments

4.2.5 TCP Slow Start

4.2.6 Nagle's Algorithm and TCP_NODELAY

4.2.7 TIME_WAIT Accumulation and Port Exhaustion

4.3 HTTP Connection Handling

4.3.1 The Oft-Misunderstood Connection Header

Figure 4-9. The Connection header allows the sender to specify connection-specific options

4.3.2 Serial Transaction Delays

Figure 4-10. Four transactions (serial)

4.4 Parallel Connections

Figure 4-11. Each component of a page involves a separate HTTP transaction

4.4.1 Parallel Connections May Make Pages Load Faster

Figure 4-12. Four transactions (parallel)

4.4.2 Parallel Connections Are Not Always Faster

4.4.3 Parallel Connections May "Feel" Faster

4.5 Persistent Connections

4.5.1 Persistent Versus Parallel Connections

4.5.2 HTTP/1.0+ Keep-Alive Connections

Figure 4-13. Four transactions (serial versus persistent)

4.5.3 Keep-Alive Operation

Figure 4-14. HTTP/1.0 keep-alive transaction header handshake

4.5.4 Keep-Alive Options

4.5.5 Keep-Alive Connection Restrictions and Rules

4.5.6 Keep-Alive and Dumb Proxies

4.5.6.1 The Connection header and blind relays

Figure 4-15. Keep-alive doesn't interoperate with proxies that don't support Connection headers

4.5.6.2 Proxies and hop-by-hop headers

4.5.7 The Proxy-Connection Hack

Figure 4-16. Proxy-Connection header fixes single blind relay

Figure 4-17. Proxy-Connection still fails for deeper hierarchies of proxies

4.5.8 HTTP/1.1 Persistent Connections

4.5.9 Persistent Connection Restrictions and Rules

4.6 Pipelined Connections

Figure 4-18. Four transactions (pipelined connections)

4.7 The Mysteries of Connection Close

4.7.1 "At Will" Disconnection

4.7.2 Content-Length and Truncation

4.7.3 Connection Close Tolerance, Retries, and Idempotency

4.7.4 Graceful Connection Close

Figure 4-19. TCP connections are bidirectional

4.7.4.1 Full and half closes

Figure 4-20. Full and half close

4.7.4.2 TCP close and reset errors

Figure 4-21. Data arriving at closed connection generates "connection reset by peer" error

4.7.4.3 Graceful close

4.8 For More Information

4.8.1 HTTP Connections

4.8.2 HTTP Performance Issues

4.8.3 TCP/IP

Part II: HTTP Architecture

Chapter 5. Web Servers

5.1 Web Servers Come in All Shapes and Sizes

5.1.1 Web Server Implementations

5.1.2 General-Purpose Software Web Servers

Figure 5-1. Web server market share as estimated by Netcraft's automated survey

5.1.3 Web Server Appliances

5.1.4 Embedded Web Servers

5.2 A Minimal Perl Web Server

Example 5-1. type-o-serve—a minimal Perl web serv

Figure 5-2. The type-o-serve utility lets you type in server responses to send back to clients

5.3 What Real Web Servers Do

Figure 5-3. Steps of a basic web server request

5.4 Step 1: Accepting Client Connections

5.4.1 Handling New Connections

5.4.2 Client Hostname Identification

Example 5-2. Configuring Apache to look up hostnames for HTML and CGI resources

5.4.3 Determining the Client User Through ident

Figure 5-4. Using the ident protocol to determine HTTP client username

5.5 Step 2: Receiving Request Messages

Figure 5-5. Reading a request message from a connection

5.5.1 Internal Representations of Messages

Figure 5-6. Parsing a request message into a convenient internal representation

5.5.2 Connection Input/Output Processing Architectures

Figure 5-7. Web server input/output architectures

5.6 Step 3: Processing Requests

5.7 Step 4: Mapping and Accessing Resources

5.7.1 Docroots

Figure 5-8. Mapping request URI to local web server resource

5.7.1.1 Virtually hosted docroots

Figure 5-9. Different docroots for virtually hosted requests

Example 5-3. Apache web server virtual host docroot configuration

5.7.1.2 User home directory docroots

Figure 5-10. Different docroots for different users

5.7.2 Directory Listings

5.7.3 Dynamic Content Resource Mapping

Figure 5-11. A web server can serve static resources as well as dynamic resources

5.7.4 Server-Side Includes (SSI)

5.7.5 Access Controls

5.8 Step 5: Building Responses

5.8.1 Response Entities

5.8.2 MIME Typing

Figure 5-12. A web server uses MIME types file to set outgoing Content-Type of resources

5.8.3 Redirection

5.9 Step 6: Sending Responses

5.10 Step 7: Logging

5.11 For More Information

Chapter 6. Proxies

6.1 Web Intermediaries

Figure 6-1. A proxy must be both a server and a client

6.1.1 Private and Shared Proxies

6.1.2 Proxies Versus Gateways

Figure 6-2. Proxies speak the same protocol; gateways tie together different protocols

6.2 Why Use Proxies?

Figure 6-3. Proxy application example: child-safe Internet filter

Figure 6-4. Proxy application example: centralized document access control

Figure 6-5. Proxy application example: security firewall

Figure 6-6. Proxy application example: web cache

Figure 6-7. Proxy application example: surrogate (in a server accelerator deployment)

Figure 6-8. Proxy application example: content routing

Figure 6-9. Proxy application example: content transcoder

Figure 6-10. Proxy application example: anonymizer

6.3 Where Do Proxies Go?

6.3.1 Proxy Server Deployment

Figure 6-11. Proxies can be deployed many ways, depending on their intended use

6.3.2 Proxy Hierarchies

Figure 6-12. Three-level proxy hierarchy

6.3.2.1 Proxy hierarchy content routing

Figure 6-13. Proxy hierarchies can be dynamic, changing for each request

6.3.3 How Proxies Get Traffic

Figure 6-14. There are many techniques to direct web requests to proxies

6.4 Client Proxy Settings

6.4.1 Client Proxy Configuration: Manual

6.4.2 Client Proxy Configuration: PAC Files

Table 6-1. Proxy auto-configuration script return values

Example 6-1. Example proxy auto-configuration file

6.4.3 Client Proxy Configuration: WPAD

6.5 Tricky Things About Proxy Requests

6.5.1 Proxy URIs Differ from Server URIs

Figure 6-15. Intercepting proxies will get server requests

6.5.2 The Same Problem with Virtual Hosting

6.5.3 Intercepting Proxies Get Partial URIs

6.5.4 Proxies Can Handle Both Proxy and Server Requests

6.5.5 In-Flight URI Modification

6.5.6 URI Client Auto-Expansion and Hostname Resolution

6.5.7 URI Resolution Without a Proxy

Figure 6-16. Browser auto-expands partial hostnames when no explicit proxy is present

6.5.8 URI Resolution with an Explicit Proxy

Figure 6-17. Browser does not auto-expand partial hostnames when there is an explicit proxy

6.5.9 URI Resolution with an Intercepting Proxy

Figure 6-18. Browser doesn't detect dead server IP addresses when using intercepting proxies

6.6 Tracing Messages

Figure 6-19. Access proxies and CDN proxies create two-level proxy hierarchies

6.6.1 The Via Header

Figure 6-20. Via header example

6.6.1.1 Via syntax

6.6.1.2 Via request and response paths

Figure 6-21. The response Via is usually the reverse of the request Via

6.6.1.3 Via and gateways

Figure 6-22. HTTP/FTP gateway generates Via headers, logging the received protocol (FTP)

6.6.1.4 The Server and Via headers

6.6.1.5 Privacy and security implications of Via

6.6.2 The TRACE Method

Figure 6-23. TRACE response reflects back the received request message

6.6.2.1 Max-Forwards

Figure 6-24. You can limit the forwarding hop count with the Max-Forwards header field

6.7 Proxy Authentication

Figure 6-25. Proxies can implement authentication to control access to content

6.8 Proxy Interoperation

6.8.1 Handling Unsupported Headers and Methods

6.8.2 OPTIONS: Discovering Optional Feature Support

Figure 6-26. Using OPTIONS to find a server's supported methods

6.8.3 The Allow Header

6.9 For More Information

Chapter 7. Caching

7.1 Redundant Data Transfers

7.2 Bandwidth Bottlenecks

Figure 7-1. Limited wide area bandwidth creates a bottleneck that caches can improve

Table 7-1. Bandwidth-imposed transfer time delays, idealized (time in seconds)

7.3 Flash Crowds

Figure 7-2. Flash crowds can overload web servers

7.4 Distance Delays

Figure 7-3. Speed of light can cause significant delays, even with parallel, keep-alive connections

7.5 Hits and Misses

Figure 7-4. Cache hits, misses, and revalidations

7.5.1 Revalidations

Figure 7-5. Successful revalidations are faster than cache misses; failed revalidations are nearly identical to misses

Figure 7-6. HTTP uses If-Modified-Since header for revalidation

7.5.2 Hit Rate

7.5.3 Byte Hit Rate

7.5.4 Distinguishing Hits and Misses

7.6 Cache Topologies

Figure 7-7. Public and private caches

7.6.1 Private Caches

7.6.2 Public Proxy Caches

Figure 7-8. Shared, public caches can decrease network traffic

7.6.3 Proxy Cache Hierarchies

Figure 7-9. Accessing documents in a two-level cache hierarchy

7.6.4 Cache Meshes, Content Routing, and Peering

Figure 7-10. Sibling caches

7.7 Cache Processing Steps

Figure 7-11. Processing a fresh cache hit

7.7.1 Step 1: Receiving

7.7.2 Step 2: Parsing

7.7.3 Step 3: Lookup

7.7.4 Step 4: Freshness Check

7.7.5 Step 5: Response Creation

7.7.6 Step 6: Sending

7.7.7 Step 7: Logging

7.7.8 Cache Processing Flowchart

Figure 7-12. Cache GET request flowchart

7.8 Keeping Copies Fresh

7.8.1 Document Expiration

Figure 7-13. Expires and Cache Control headers

7.8.2 Expiration Dates and Ages

Table 7-2. Expiration response headers

7.8.3 Server Revalidation

7.8.4 Revalidation with Conditional Methods

Table 7-3. Two conditional headers used in cache revalidation

7.8.5 If-Modified-Since: Date Revalidation

Figure 7-14. If-Modified-Since revalidations return 304 if unchanged or 200 with new body if changed

7.8.6 If-None-Match: Entity Tag Revalidation

Figure 7-15. If-None-Match revalidates because entity tag still matches

7.8.7 Weak and Strong Validators

7.8.8 When to Use Entity Tags and Last-Modified Dates

7.9 Controlling Cachability

7.9.1 No-Cache and No-Store Headers

7.9.2 Max-Age Response Headers

7.9.3 Expires Response Headers

7.9.4 Must-Revalidate Response Headers

7.9.5 Heuristic Expiration

Figure 7-16. Computing a freshness period using the LM-Factor algorithm

7.9.6 Client Freshness Constraints

Table 7-4. Cache-Control request directives

7.9.7 Cautions

7.10 Setting Cache Controls

7.10.1 Controlling HTTP Headers with Apache

7.10.2 Controlling HTML Caching Through HTTP-EQUIV

Figure 7-17. HTTP-EQUIV tags cause problems, because most software ignores them

7.11 Detailed Algorithms

7.11.1 Age and Freshness Lifetime

7.11.2 Age Computation

Example 7-1. HTTP/1.1 age-calculation algorithm calculates the overall age of a cached document

7.11.2.1 Apparent age is based on the Date header

7.11.2.2 Hop-by-hop age calculations

7.11.2.3 Compensating for network delays

7.11.3 Complete Age-Calculation Algorithm

Figure 7-18. The age of a cached document includes resident time in the network and cache

7.11.4 Freshness Lifetime Computation

7.11.5 Complete Server-Freshness Algorithm

Example 7-2. Server freshness constraint calculation

Example 7-3. Client freshness constraint calculation

7.12 Caches and Advertising

7.12.1 The Advertiser's Dilemma

7.12.2 The Publisher's Response

7.12.3 Log Migration

7.12.4 Hit Metering and Usage Limiting

7.13 For More Information

Chapter 8. Integration Points: Gateways, Tunnels, and Relays

8.1 Gateways

Figure 8-1. Gateway magic

Figure 8-2. Three web gateway examples

8.1.1 Client-Side and Server-Side Gateways

8.2 Protocol Gateways

Figure 8-3. Configuring an HTTP/FTP gateway

Figure 8-4. Browsers can configure particular protocols to use particular gateways

8.2.1 HTTP/*: Server-Side Web Gateways

Figure 8-5. The HTTP/FTP gateway translates HTTP request into FTP requests

8.2.2 HTTP/HTTPS: Server-Side Security Gateways

Figure 8-6. Inbound HTTP/HTTPS security gateway

8.2.3 HTTPS/HTTP: Client-Side Security Accelerator Gateways

Figure 8-7. HTTPS/HTTP security accelerator gateway

8.3 Resource Gateways

Figure 8-8. An application server connects HTTP clients to arbitrary backend applications

Figure 8-9. Server gateway application mechanics

8.3.1 Common Gateway Interface (CGI)

8.3.2 Server Extension APIs

8.4 Application Interfaces and Web Services

8.5 Tunnels

8.5.1 Establishing HTTP Tunnels with CONNECT

Figure 8-10. Using CONNECT to establish an SSL tunnel

8.5.1.1 CONNECT requests

8.5.1.2 CONNECT responses

8.5.2 Data Tunneling, Timing, and Connection Management

8.5.3 SSL Tunneling

Figure 8-11. Tunnels let non-HTTP traffic flow through HTTP connections

Figure 8-12. Direct SSL connection vs. tunnelled SSL connection

8.5.4 SSL Tunneling Versus HTTP/HTTPS Gateways

8.5.5 Tunnel Authentication

Figure 8-13. Gateways can proxy-authenticate a client before it's allowed to use a tunnel

8.5.6 Tunnel Security Considerations

8.6 Relays

Figure 8-14. Simple blind relays can hang if they are single-tasking and don't support the Connection header

8.7 For More Information

Chapter 9. Web Robots

9.1 Crawlers and Crawling

9.1.1 Where to Start: The "Root Set"

Figure 9-1. A root set is needed to reach all pages

9.1.2 Extracting Links and Normalizing Relative Links

9.1.3 Cycle Avoidance

Figure 9-2. Crawling over a web of hyperlinks

9.1.4 Loops and Dups

9.1.5 Trails of Breadcrumbs

9.1.6 Aliases and Robot Cycles

Table 9-1. Different URLs that alias to the same documents

9.1.7 Canonicalizing URLs

9.1.8 Filesystem Link Cycles

Figure 9-3. Symbolic link cycles

9.1.9 Dynamic Virtual Web Spaces

Figure 9-4. Malicious dynamic web space example

9.1.10 Avoiding Loops and Dups

9.2 Robotic HTTP

9.2.1 Identifying Request Headers

9.2.2 Virtual Hosting

Figure 9-5. Example of virtual docroots causing trouble if no Host header is sent with the request

9.2.3 Conditional Requests

9.2.4 Response Handling

9.2.4.1 Status codes

9.2.4.2 Entities

9.2.5 User-Agent Targeting

9.3 Misbehaving Robots

9.4 Excluding Robots

Figure 9-6. Fetching robots.txt and verifying accessibility before crawling the target file

9.4.1 The Robots Exclusion Standard

Table 9-2. Robots Exclusion Standard versions

9.4.2 Web Sites and robots.txt Files

9.4.2.1 Fetching robots.txt

9.4.2.2 Response codes

9.4.3 robots.txt File Format

9.4.3.1 The User-Agent line

9.4.3.2 The Disallow and Allow lines

9.4.3.3 Disallow/Allow prefix matching

Table 9-3. Robots.txt path matching examples

9.4.4 Other robots.txt Wisdom

9.4.5 Caching and Expiration of robots.txt

9.4.6 Robot Exclusion Perl Code

Table 9-4. Robot accessibility to the Mary's Antiques web site

9.4.7 HTML Robot-Control META Tags

9.4.7.1 Robot META directives

9.4.7.2 Search engine META tags

Table 9-5. Additional META tag directives

9.5 Robot Etiquette

Table 9-6. Guidelines for web robot operators

9.6 Search Engines

9.6.1 Think Big

9.6.2 Modern Search Engine Architecture

Figure 9-7. A production search engine contains cooperating crawlers and query gateways

9.6.3 Full-Text Index

Figure 9-8. Three documents and a full-text index

9.6.4 Posting the Query

Figure 9-9. Example search query request

9.6.5 Sorting and Presenting the Results

9.6.6 Spoofing

9.7 For More Information

Chapter 10. HTTP-NG

10.1 HTTP's Growing Pains

10.2 HTTP-NG Activity

10.3 Modularize and Enhance

Figure 10-1. HTTP-NG separates functions into layers

10.4 Distributed Objects

10.5 Layer 1: Messaging

10.6 Layer 2: Remote Invocation

10.7 Layer 3: Web Application

10.8 WebMUX

Figure 10-2. WebMUX can multiplex multiple messages over a single connection

10.9 Binary Wire Protocol

10.10 Current Status

10.11 For More Information

Part III: Identification, Authorization, and Security

Chapter 11. Client Identification and Cookies

11.1 The Personal Touch

11.2 HTTP Headers

Table 11-1. HTTP headers carry clues about users

11.3 Client IP Address

Figure 11-1. Proxies can add extension headers to pass along the original client IP address

11.4 User Login

Figure 11-2. Registering username using HTTP authentication headers

11.5 Fat URLs

11.6 Cookies

11.6.1 Types of Cookies

11.6.2 How Cookies Work

Figure 11-3. Slapping a cookie onto a user

11.6.3 Cookie Jar: Client-Side State

11.6.3.1 Netscape Navigator cookies

11.6.3.2 Microsoft Internet Explorer cookies

Figure 11-4. Internet Explorer cookies are stored in individual text files in the cache directory

11.6.4 Different Cookies for Different Sites

11.6.4.1 Cookie Domain attribute

11.6.4.2 Cookie Path attribute

11.6.5 Cookie Ingredients

Table 11-2. Cookie specifications

11.6.6 Version 0 (Netscape) Cookies

11.6.6.1 Version 0 Set-Cookie header

Table 11-3. Version 0 (Netscape) Set-Cookie attributes

11.6.6.2 Version 0 Cookie header

11.6.7 Version 1 (RFC 2965) Cookies

11.6.7.1 Version 1 Set-Cookie2 header

Table 11-4. Version 1 (RFC 2965) Set-Cookie2 attributes

11.6.7.2 Version 1 Cookie header

11.6.7.3 Version 1 Cookie2 header and version negotiation

11.6.8 Cookies and Session Tracking

Figure 11-5. The Amazon.com web site uses session cookies to track users

11.6.9 Cookies and Caching

11.6.10 Cookies, Security, and Privacy

11.7 For More Information

Chapter 12. Basic Authentication

12.1 Authentication

12.1.1 HTTP's Challenge/Response Authentication Framework

Figure 12-1. Simplified challenge/response authentication

12.1.2 Authentication Protocols and Headers

Table 12-1. Four phases of authentication

Figure 12-2. Basic authentication example

12.1.3 Security Realms

Figure 12-3. Security realms in a web server

12.2 Basic Authentication

12.2.1 Basic Authentication Example

Table 12-2. Basic authentication headers

12.2.2 Base-64 Username/Password Encoding

Figure 12-4. Generating a basic Authorization header from username and password

12.2.3 Proxy Authentication

Table 12-3. Web server versus proxy authentication

12.3 The Security Flaws of Basic Authentication

12.4 For More Information

Chapter 13. Digest Authentication

13.1 The Improvements of Digest Authentication

13.1.1 Using Digests to Keep Passwords Secret

Figure 13-1. Using digests for password-obscured authentication

13.1.2 One-Way Digests

Table 13-1. MD5 digest examples

13.1.3 Using Nonces to Prevent Replays

13.1.4 The Digest Authentication Handshake

Figure 13-2. Digest authentication handshake

Figure 13-3. Basic versus digest authentication syntax

13.2 Digest Calculations

13.2.1 Digest Algorithm Input Data

13.2.2 The Algorithms H(d) and KD(s,d)

13.2.3 The Security-Related Data (A1)

Table 13-2. Definitions for A1 by algorithm

13.2.4 The Message-Related Data (A2)

Table 13-3. Definitions for A2 by algorithm (request digests)

13.2.5 Overall Digest Algorithm

Table 13-4. Old and new digest algorithms

Table 13-5. Unfolded digest algorithm cheat sheet

13.2.6 Digest Authentication Session

13.2.7 Preemptive Authorization

Figure 13-4. Preemptive authorization reduces message count

13.2.7.1 Next nonce pregeneration

13.2.7.2 Limited nonce reuse

13.2.7.3 Synchronized nonce generation

13.2.8 Nonce Selection

13.2.9 Symmetric Authentication

Table 13-6. Definitions for A2 by algorithm (request digests)

Table 13-7. Definitions for A2 by algorithm (response digests)

13.3 Quality of Protection Enhancements

13.3.1 Message Integrity Protection

13.3.2 Digest Authentication Headers

Table 13-8. HTTP authentication headers

13.4 Practical Considerations

13.4.1 Multiple Challenges

13.4.2 Error Handling

13.4.3 Protection Spaces

13.4.4 Rewriting URIs

13.4.5 Caches

13.5 Security Considerations

13.5.1 Header Tampering

13.5.2 Replay Attacks

13.5.3 Multiple Authentication Mechanisms

13.5.4 Dictionary Attacks

13.5.5 Hostile Proxies and Man-in-the-Middle Attacks

13.5.6 Chosen Plaintext Attacks

13.5.7 Storing Passwords

13.6 For More Information

Chapter 14. Secure HTTP

14.1 Making HTTP Safe

14.1.1 HTTPS

Figure 14-1. Browsing secure web sites

Figure 14-2. HTTPS is HTTP layered over a security layer, layered over TCP

14.2 Digital Cryptography

14.2.1 The Art and Science of Secret Coding

14.2.2 Ciphers

Figure 14-3. Plaintext and ciphertext

Figure 14-4. Rotate-by-3 cipher example

14.2.3 Cipher Machines

14.2.4 Keyed Ciphers

Figure 14-5. The rotate-by-N cipher, using different keys

14.2.5 Digital Ciphers

Figure 14-6. Plaintext is encoded with encoding key e, and decoded using decoding key d

14.3 Symmetric-Key Cryptography

Figure 14-7. Symmetric-key cryptography algorithms use the same key for encoding and decoding

14.3.1 Key Length and Enumeration Attacks

Table 14-1. Longer keys take more effort to crack (1995 data, from "Applied Cryptography")

14.3.2 Establishing Shared Keys

14.4 Public-Key Cryptography

Figure 14-8. Public-key cryptography is asymmetric, using different keys for encoding and decoding

Figure 14-9. Public-key cryptography assigns a single, public encoding key to each host

14.4.1 RSA

14.4.2 Hybrid Cryptosystems and Session Keys

14.5 Digital Signatures

14.5.1 Signatures Are Cryptographic Checksums

Figure 14-10. Unencrypted digital signature

14.6 Digital Certificates

14.6.1 The Guts of a Certificate

Figure 14-11. Typical digital signature format

14.6.2 X.509 v3 Certificates

Table 14-2. X.509 certificate fields

14.6.3 Using Certificates to Authenticate Servers

Figure 14-12. Verifying that a signature is real

14.7 HTTPS: The Details

14.7.1 HTTPS Overview

Figure 14-13. HTTP transport-level security

14.7.2 HTTPS Schemes

Figure 14-14. HTTP and HTTPS port numbers

14.7.3 Secure Transport Setup

Figure 14-15. HTTP and HTTPS transactions

14.7.4 SSL Handshake

Figure 14-16. SSL handshake (simplified)

14.7.5 Server Certificates

Figure 14-17. HTTPS certificates are X.509 certificates with site information

14.7.6 Site Certificate Validation

14.7.7 Virtual Hosting and Certificates

Figure 14-18. Certificate name mismatches bring up certificate error dialog boxes

14.8 A Real HTTPS Client

14.8.1 OpenSSL

14.8.2 A Simple HTTPS Client

14.8.3 Executing Our Simple OpenSSL Client

14.9 Tunneling Secure Traffic Through Proxies

Figure 14-19. Corporate firewall proxy

Figure 14-20. Proxy can't proxy an encrypted request

14.10 For More Information

14.10.1 HTTP Security

14.10.2 SSL and TLS

14.10.3 Public-Key Infrastructure

14.10.4 Digital Cryptography

Part IV: Entities, Encodings, and Internationalization

Chapter 15. Entities and Encodings

15.1 Messages Are Crates, Entities Are Cargo

Figure 15-1. Message entity is made up of entity headers and entity body

15.1.1 Entity Bodies

Figure 15-2. Hex dumps of real message content (raw message content follows blank CRLF)

15.2 Content-Length: The Entity's Size

15.2.1 Detecting Truncation

15.2.2 Incorrect Content-Length

15.2.3 Content-Length and Persistent Connections

15.2.4 Content Encoding

15.2.5 Rules for Determining Entity Body Length

15.3 Entity Digests

15.4 Media Type and Charset

Table 15-1. Common media types

15.4.1 Character Encodings for Text Media

15.4.2 Multipart Media Types

15.4.3 Multipart Form Submissions

15.4.4 Multipart Range Responses

15.5 Content Encoding

15.5.1 The Content-Encoding Process

Figure 15-3. Content-encoding example

15.5.2 Content-Encoding Types

Table 15-2. Content-encoding tokens

15.5.3 Accept-Encoding Headers

Figure 15-4. Content encoding

15.6 Transfer Encoding and Chunked Encoding

Figure 15-5. Content encodings versus transfer encodings

15.6.1 Safe Transport

15.6.2 Transfer-Encoding Headers

15.6.3 Chunked Encoding

15.6.3.1 Chunking and persistent connections

Figure 15-6. Anatomy of a chunked message

15.6.3.2 Trailers in chunked messages

15.6.4 Combining Content and Transfer Encodings

Figure 15-7. Combining content encoding with transfer encoding

15.6.5 Transfer-Encoding Rules

15.7 Time-Varying Instances

Figure 15-8. Instances are "snapshots" of a resource in time

15.8 Validators and Freshness

15.8.1 Freshness

Table 15-3. Cache-Control header directives

15.8.2 Conditionals and Validators

Table 15-4. Conditional request types

15.9 Range Requests

Figure 15-9. Entity range request example

15.10 Delta Encoding

Figure 15-10. Mechanics of delta-encoding

Table 15-5. Delta-encoding headers

15.10.1 Instance Manipulations, Delta Generators, and Delta Appliers

Table 15-6. IANA registered types of instance manipulations

15.11 For More Information

Chapter 16. Internationalization

16.1 HTTP Support for International Content

16.2 Character Sets and HTTP

16.3 Multilingual Character Encoding Primer

16.4 Language Tags and HTTP

16.5 Internationalized URIs

16.6 Other Considerations

16.7 For More Information

Chapter 17. Content Negotiation and Transcoding

17.1 Content-Negotiation Techniques

17.2 Client-Driven Negotiation

17.3 Server-Driven Negotiation

17.4 Transparent Negotiation

17.5 Transcoding

17.6 Next Steps

17.7 For More Information

Part V: Content Publishing and Distribution

Chapter 18. Web Hosting

18.1 Hosting Services

18.1.1 A Simple Example: Dedicated Hosting

Figure 18-1. Outsourced dedicated hosting

18.2 Virtual Hosting

Figure 18-2. Outsourced virtual hosting

18.2.1 Virtual Server Request Lacks Host Information

Figure 18-3. HTTP/1.0 server requests don't contain hostname information

18.2.2 Making Virtual Hosting Work

18.2.2.1 Virtual hosting by URL path

18.2.2.2 Virtual hosting by port number

18.2.2.3 Virtual hosting by IP address

Figure 18-4. Virtual IP hosting

18.2.2.4 Virtual hosting by Host header

Figure 18-5. Host headers distinguish virtual host requests

18.2.3 HTTP/1.1 Host Headers

18.2.3.1 Syntax and usage

18.2.3.2 Missing Host headers

18.2.3.3 Interpreting Host headers

18.2.3.4 Host headers and proxies

18.3 Making Web Sites Reliable

18.3.1 Mirrored Server Farms

Figure 18-6. Mirrored server farm

Figure 18-7. Dispersed mirrored servers

18.3.2 Content Distribution Networks

18.3.3 Surrogate Caches in CDNs

18.3.4 Proxy Caches in CDNs

Figure 18-8. Client requests intercepted by a switch and sent to a proxy

18.4 Making Web Sites Fast

18.5 For More Information

Chapter 19. Publishing Systems

19.1 FrontPage Server Extensions for Publishing Support

19.1.1 FrontPage Server Extensions

Figure 19-1. FrontPage publishing architecture

19.1.2 FrontPage Vocabulary

19.1.3 The FrontPage RPC Protocol

Figure 19-2. Initial request

19.1.3.1 Request

19.1.3.2 Response

19.1.4 FrontPage Security Model

19.2 WebDAV and Collaborative Authoring

19.2.1 WebDAV Methods

19.2.2 WebDAV and XML

19.2.3 WebDAV Headers

19.2.4 WebDAV Locking and Overwrite Prevention

Figure 19-3. Lost update problem

19.2.5 The LOCK Method

19.2.5.1 The opaquelocktoken scheme

19.2.5.2 The XML element

19.2.5.3 Lock refreshes and the Timeout header

19.2.6 The UNLOCK Method

Table 19-1. Status codes for LOCK and UNLOCK methods

19.2.7 Properties and META Data

19.2.8 The PROPFIND Method

19.2.9 The PROPPATCH Method

Table 19-2. Status codes for PROPFIND and PROPPATCH methods

19.2.10 Collections and Namespace Management

19.2.11 The MKCOL Method

19.2.12 The DELETE Method

19.2.13 The COPY and MOVE Methods

19.2.13.1 Overwrite header effect

19.2.13.2 COPY/MOVE of properties

19.2.13.3 Locked resources and COPY/MOVE

Table 19-3. Status codes for the MKCOL, DELETE, COPY, and MOVE methods

19.2.14 Enhanced HTTP/1.1 Methods

19.2.14.1 The PUT method

19.2.14.2 The OPTIONS method

19.2.15 Version Management in WebDAV

19.2.16 Future of WebDAV

19.3 For More Information

Chapter 20. Redirection and Load Balancing

20.1 Why Redirect?

20.2 Where to Redirect

20.3 Overview of Redirection Protocols

Table 20-1. General redirection methods

Table 20-2. Proxy and cache redirection techniques

20.4 General Redirection Methods

20.4.1 HTTP Redirection

Figure 20-1. HTTP redirection

20.4.2 DNS Redirection

Figure 20-2. DNS-based redirection

20.4.2.1 DNS round robin

Example 20-1. IP addresses for www.cnn.com

20.4.2.2 Multiple addresses and round-robin address rotation

Example 20-2. Rotating DNS address lists

20.4.2.3 DNS round robin for load balancing

Figure 20-3. DNS round robin load balances across servers in a server farm

20.4.2.4 The impact of DNS caching

20.4.2.5 Other DNS-based redirection algorithms

Figure 20-4. DNS request involving authoritative server

20.4.3 Anycast Addressing

Figure 20-5. Distributed anycast addressing

20.4.4 IP MAC Forwarding

Figure 20-6. Layer-2 switch sending client requests to a gateway

Figure 20-7. MAC forwarding using a layer-4 switch

20.4.5 IP Address Forwarding

Figure 20-8. A switch doing IP forwarding to a caching proxy or mirrored web server

Figure 20-9. Full NAT of a TCP/IP datagram

20.4.6 Network Element Control Protocol

20.4.6.1 Messages

Table 20-3. NECP messages

20.5 Proxy Redirection Methods

20.5.1 Explicit Browser Configuration

20.5.2 Proxy Auto-configuration

Figure 20-10. Proxy auto-configuration

20.5.3 Web Proxy Autodiscovery Protocol

20.5.3.1 PAC file autodiscovery

Figure 20-11. WPAD determines the PAC URL, which determines the proxy server

20.5.3.2 WPAD algorithm

20.5.3.3 CURL discovery using DHCP

20.5.3.4 DNS A record lookup

20.5.3.5 Retrieving the PAC file

20.5.3.6 When to execute WPAD

20.5.3.7 WPAD spoofing

20.5.3.8 Timeouts

20.5.3.9 Administrator considerations

20.6 Cache Redirection Methods

20.6.1 WCCP Redirection

20.6.1.1 How WCCP redirection works

20.6.1.2 WCCP2 messages

Table 20-4. WCCP2 messages

20.6.1.3 Message components

Table 20-5. WCCP2 message components

20.6.1.4 Service groups

20.6.1.5 GRE packet encapsulation

Figure 20-12. How a WCCP router changes an HTTP packet's destination IP address

20.6.1.6 WCCP load balancing

20.7 Internet Cache Protocol

20.8 Cache Array Routing Protocol

Figure 20-13. ICP queries

Figure 20-14. CARP redirection

20.9 Hyper Text Caching Protocol

Figure 20-15. HTCP message format

Table 20-6. HTCP data components

Table 20-7. HTCP opcodes

20.9.1 HTCP Authentication

Table 20-8. HTCP authentication components

20.9.2 Setting Caching Policies

Table 20-9. List of Cache headers for modifying caching policies

20.10 For More Information

Chapter 21. Logging and Usage Tracking

21.1 What to Log?

21.2 Log Formats

21.2.1 Common Log Format

Table 21-1. Common Log Format fields

Example 21-1. Common Log Format

21.2.2 Combined Log Format

Table 21-2. Additional Combined Log Format fields

Example 21-2. Combined Log Format

21.2.3 Netscape Extended Log Format

Table 21-3. Additional Netscape Extended Log Format fields

Example 21-3. Netscape Extended Log Format

21.2.4 Netscape Extended 2 Log Format

Table 21-4. Additional Netscape Extended 2 Log Format fields

Example 21-4. Netscape Extended 2 Log Format

Table 21-5. Netscape route codes

Table 21-6. Netscape finish status codes

Table 21-7. Netscape cache codes

21.2.5 Squid Proxy Log Format

Table 21-8. Squid Log Format fields

Example 21-5. Squid Log Format

Table 21-9. Squid result codes

21.3 Hit Metering

21.3.1 Overview

21.3.2 The Meter Header

Table 21-10. Hit Metering directives

Figure 21-1. Hit Metering example

21.4 A Word on Privacy

21.5 For More Information

Part VI: Appendixes

Appendix A. URI Schemes

Table A-1. URI schemes from the W3C registry

Appendix B. HTTP Status Codes

B.1 Status Code Classifications

Table B-1. Status code classifications

B.2 Status Codes

Table B-2. Status codes

Appendix C. HTTP Header Reference

Appendix D. MIME Types

D.1 Background

D.2 MIME Type Structure

D.2.1 Discrete Types

D.2.2 Composite Types

D.2.3 Multipart Types

D.2.4 Syntax

Table D-1. Common primary MIME types

D.3 MIME Type IANA Registration

D.3.1 Registration Trees

Table D-2. Four MIME media type registration trees

D.3.2 Registration Process

D.3.3 Registration Rules

D.3.4 Registration Template

Example D-1. IANA MIME registration email template

D.3.5 MIME Media Type Registry

D.4 MIME Type Tables

D.4.1 application/*

Table D-3. "Application" MIME types

D.4.2 audio/*

Table D-4. "Audio" MIME types

D.4.3 chemical/*

Table D-5. "Chemical" MIME types

D.4.4 image/*

Table D-6. "Image" MIME types

D.4.5 message/*

Table D-7. "Message" MIME types

D.4.6 model/*

Table D-8. "Model" MIME types

D.4.7 multipart/*

Table D-9. "Multipart" MIME types

D.4.8 text/*

Table D-10. "Text" MIME types

D.4.9 video/*

Table D-11. "Video" MIME types

D.4.10 Experimental Types

Table D-12. Extension MIME types

Appendix E. Base-64 Encoding

E.1 Base-64 Encoding Makes Binary Data Safe

E.2 Eight Bits to Six Bits

Table E-1. Base-64 alphabet

Figure E-1. Base-64 encoding example

E.3 Base-64 Padding

Table E-2. Base-64 padding examples

E.4 Perl Implementation

E.5 For More Information

Appendix F. Digest Authentication

F.1 Digest WWW-Authenticate Directives

Table F-1. Digest WWW-Authenticate header directives (from RFC 2617)

F.2 Digest Authorization Directives

Table F-2. Digest Authorization header directives (from RFC 2617)

F.3 Digest Authentication-Info Directives

Table F-3. Digest Authentication-Info header directives (from RFC 2617)

F.4 Reference Code

F.4.1 File "digcalc.h"

F.4.2 File "digcalc.c"

F.4.3 File "digtest.c"

Appendix G. Language Tags

G.1 First Subtag Rules

G.2 Second Subtag Rules

G.3 IANA-Registered Language Tags

Table G-1. Language tags

G.4 ISO 639 Language Codes

Table G-2. ISO 639 and 639-2 language codes

G.5 ISO 3166 Country Codes

Table G-3. ISO 3166 country codes

G.6 Language Administrative Organizations

Appendix H. MIME Charset Registry

H.1 MIME Charset Registry

H.2 Preferred MIME Names

H.3 Registered Charsets

Table H-1. IANA MIME charset tags

HTTP: The Definitive Guide Copyright © 2002 O'Reilly & Associates, Inc. All rights reserved. Printed in the United States of America. Published by O'Reilly & Associates, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472. O'Reilly & Associates books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles (http:// ). For more information contact our corporate/institutional sales department: 800-998-9938 or corporate@oreilly.com. Nutshell Handbook, the Nutshell Handbook logo, and the O'Reilly logo are registered trademarks of O'Reilly & Associates, Inc. Many of the designations used by manufacturers and sellers to distinguish their products are claimed as trademarks. Where those designations appear in this book, and O'Reilly & Associates, Inc. was aware of a trademark claim, the designations have been printed in caps or initial caps. The association between the image of a thirteen-lined ground squirrel and the topic of HTTP is a trademark of O'Reilly & Associates, Inc. While every precaution has been taken in the preparation of this book, the publisher and the authors assume no responsibility for errors or omissions, or for damages resulting from the use of the information contained herein. Preface The Hypertext Transfer Protocol (HTTP) is the protocol programs use to communicate over the World Wide Web. There are many applications of HTTP, but HTTP is most famous for two-way conversation between web browsers and web servers. HTTP began as a simple protocol, so you might think there really isn't that much to say about it. And yet here you stand, with a two-pound book in your hands. If you're wondering how we could have written 650 pages on HTTP, take a look at the Table of Contents. This book isn't just an HTTP header reference manual; it's a veritable bible of web architecture. In this book, we try to tease apart HTTP's interrelated and often misunderstood rules, and we offer you a series of topic-based chapters that explain all the aspects of HTTP. Throughout the book, we are careful to explain the "why" of HTTP, not just the "how." And to save you time chasing references, we explain many of the critical non-HTTP technologies that are required to make HTTP applications work. You can find the alphabetical header reference (which forms the basis of most conventional HTTP texts) in a conveniently organized appendix. We hope this conceptual design makes it easy for you to work with HTTP. This book is written for anyone who wants to understand HTTP and the underlying architecture of the Web. Software and hardware engineers can use this book as a coherent reference for HTTP and related web technologies. Systems architects and network administrators can use this book to better understand how to design, deploy, and manage complicated web architectures. Performance engineers and analysts can benefit from the sections on caching and performance optimization. Marketing and consulting professionals will be able to use the conceptual orientation to better understand the landscape of web technologies.

This book illustrates common misconceptions, advises on "tricks of the trade," provides convenient reference material, and serves as a readable introduction to dry and confusing standards specifications. In a single book, we detail the essential and interrelated technologies that make the Web work. This book is the result of a tremendous amount of work by many people who share an enthusiasm for Internet technologies. We hope you find it useful. Running Example: Joe's Hardware Store Many of our chapters include a running example of a hypothetical online hardware and home- improvement store called "Joe's Hardware" to demonstrate technology concepts. We have set up a real web site for the store (http://www.joes-hardware.com) for you to test some of the examples in the book. We will maintain this web site while this book remains in print. Chapter-by-Chapter Guide This book contains 21 chapters, divided into 5 logical parts (each with a technology theme), and 8 useful appendixes containing reference data and surveys of related technologies: Part I Part II Part III Part IV Part V Part VI Part I, describes the core technology of HTTP, the foundation of the Web, in four chapters: • Chapter 1 is a rapid-paced overview of HTTP. • Chapter 2 details the formats of uniform resource locators (URLs) and the various types of resources that URLs name across the Internet. It also outlines the evolution to uniform resource names (URNs). • Chapter 3 details how HTTP messages transport web content. • Chapter 4 explains the commonly misunderstood and poorly documented rules and behavior for managing HTTP connections. Part II highlights the HTTP server, proxy, cache, gateway, and robot applications that are the architectural building blocks of web systems. (Web browsers are another building block, of course, but browsers already were covered thoroughly in Part I of the book.) Part II contains the following six chapters: • Chapter 5 gives an overview of web server architectures. • Chapter 6 explores HTTP proxy servers, which are intermediary servers that act as platforms for HTTP services and controls.

• Chapter 7 delves into the science of web caches—devices that improve performance and reduce traffic by making local copies of popular documents. • Chapter 8 explains gateways and application servers that allow HTTP to work with software that speaks different protocols, including Secure Sockets Layer (SSL) encrypted protocols. • Chapter 9 describes the various types of clients that pervade the Web, including the ubiquitous browsers, robots and spiders, and search engines. • Chapter 10 talks about HTTP developments still in the works: the HTTP-NG protocol. Part III presents a suite of techniques and technologies to track identity, enforce security, and control access to content. It contains the following four chapters: • Chapter 11 talks about techniques to identify users so that content can be personalized to the user audience. • Chapter 12 highlights the basic mechanisms to verify user identity. The chapter also examines how HTTP authentication interfaces with databases. • Chapter 13 explains digest authentication, a complex proposed enhancement to HTTP that provides significantly enhanced security. • Chapter 14 is a detailed overview of Internet cryptography, digital certificates, and SSL. Part IV focuses on the bodies of HTTP messages (which contain the actual web content) and on the web standards that describe and manipulate content stored in the message bodies. Part IV contains three chapters: • Chapter 15 describes the structure of HTTP content. • Chapter 16 surveys the web standards that allow users around the globe to exchange content in different languages and character sets. • Chapter 17 explains mechanisms for negotiating acceptable content. Part V discusses the technology for publishing and disseminating web content. It contains four chapters: • Chapter 18 discusses the ways people deploy servers in modern web hosting environments and HTTP support for virtual web hosting. • Chapter 19 discusses the technologies for creating web content and installing it onto web servers. • Chapter 20 surveys the tools and techniques for distributing incoming web traffic among a collection of servers. • Chapter 21 covers log formats and common questions.

Part VI contains helpful reference appendixes and tutorials in related technologies: • Appendix A summarizes the protocols supported through uniform resource identifier (URI) schemes. • Appendix B conveniently lists the HTTP response codes. • Appendix C provides a reference list of HTTP header fields. • Appendix D provides an extensive list of MIME types and explains how MIME types are registered. • Appendix E explains base-64 encoding, used by HTTP authentication. • Appendix F gives details on how to implement various authentication schemes in HTTP. • Appendix G defines language tag values for HTTP language headers. • Appendix H provides a detailed list of character encodings, used for HTTP internationalization support. Each chapter contains many examples and pointers to additional reference material. Typographic Conventions In this book, we use the following typographic conventions: Italic Used for URLs, C functions, command names, MIME types, new terms where they are defined, and emphasis Constant width Used for computer output, code, and any literal text Constant width bold Used for user input Comments and Questions Please address comments and questions concerning this book to the publisher: O'Reilly & Associates, Inc. 1005 Gravenstein Highway North Sebastopol, CA 95472 (800) 998-9938 (in the United States or Canada) (707) 829-0515 (international/local)

(707) 829-0104 (fax) There is a web page for this book, which lists errata, examples, or any additional information. You can access this page at: http://www.oreilly.com/catalog/httptdg/ To comment or ask technical questions about this book, send email to: bookquestions@oreilly.com For more information about books, conferences, Resource Centers, and the O'Reilly Network, see the O'Reilly web site at: http://www.oreilly.com Acknowledgments This book is the labor of many. The five authors would like to hold up a few people in thanks for their significant contributions to this project. To start, we'd like to thank Linda Mui, our editor at O'Reilly. Linda first met with David and Brian way back in 1996, and she refined and steered several concepts into the book you hold today. Linda also helped keep our wandering gang of first-time book authors moving in a coherent direction and on a progressing (if not rapid) timeline. Most of all, Linda gave us the chance to create this book. We're very grateful. We'd also like to thank several tremendously bright, knowledgeable, and kind souls who devoted noteworthy energy to reviewing, commenting on, and correcting drafts of this book. These include Tony Bourke, Sean Burke, Mike Chowla, Shernaz Daver, Fred Douglis, Paula Ferguson, Vikas Jha, Yves Lafon, Peter Mattis, Chuck Neerdaels, Luis Tavera, Duane Wessels, Dave Wu, and Marco Zagha. Their viewpoints and suggestions have improved the book tremendously. Rob Romano from O'Reilly created most of the amazing artwork you'll find in this book. The book contains an unusually large number of detailed illustrations that make subtle concepts very clear. Many of these illustrations were painstakingly created and revised numerous times. If a picture is worth a thousand words, Rob added hundreds of pages of value to this book. Brian would like to personally thank all of the authors for their dedication to this project. A tremendous amount of time was invested by the authors in a challenge to make the first detailed but accessible treatment of HTTP. Weddings, childbirths, killer work projects, startup companies, and graduate schools intervened, but the authors held together to bring this project to a successful completion. We believe the result is worthy of everyone's hard work and, most importantly, that it provides a valuable service. Brian also would like to thank the employees of Inktomi for their enthusiasm and support and for their deep insights about the use of HTTP in real-world applications. Also, thanks to the fine folks at Cajun-shop.com for allowing us to use their site for some of the examples in this book. David would like to thank his family, particularly his mother and grandfather for their ongoing support. He'd like to thank those that have put up with his erratic schedule over the years writing the book. He'd also like to thank Slurp, Orctomi, and Norma for everything they've done, and his fellow

authors for all their hard work. Finally, he would like to thank Brian for roping him into yet another adventure. Marjorie would like to thank her husband, Alan Liu, for technical insight, familial support and understanding. Marjorie thanks her fellow authors for many insights and inspirations. She is grateful for the experience of working together on this book. Sailu would like to thank David and Brian for the opportunity to work on this book, and Chuck Neerdaels for introducing him to HTTP. Anshu would like to thank his wife, Rashi, and his parents for their patience, support, and encouragement during the long years spent writing this book. Finally, the authors collectively thank the famous and nameless Internet pioneers, whose research, development, and evangelism over the past four decades contributed so much to our scientific, social, and economic community. Without these labors, there would be no subject for this book. Part I: HTTP: The Web's Foundation This section is an introduction to the HTTP protocol. The next four chapters describe the core technology of HTTP, the foundation of the Web: • Chapter 1 is a rapid-paced overview of HTTP. • Chapter 2 details the formats of URLs and the various types of resources that URLs name across the Internet. We also outline the evolution to URNs. • Chapter 3 details the HTTP messages that transport web content. • Chapter 4 discusses the commonly misunderstood and poorly documented rules and behavior for managing TCP connections by HTTP. Chapter 1. Overview of HTTP The world's web browsers, servers, and related web applications all talk to each other through HTTP, the Hypertext Transfer Protocol. HTTP is the common language of the modern global Internet. This chapter is a concise overview of HTTP. You'll see how web applications use HTTP to communicate, and you'll get a rough idea of how HTTP does its job. In particular, we talk about: • How web clients and servers communicate • Where resources (web content) come from • How web transactions work • The format of the messages used for HTTP communication • The underlying TCP network transport

• The different variations of the HTTP protocol • Some of the many HTTP architectural components installed around the Internet We've got a lot of ground to cover, so let's get started on our tour of HTTP. 1.1 HTTP: The Internet's Multimedia Courier Billions of JPEG images, HTML pages, text files, MPEG movies, WAV audio files, Java applets, and more cruise through the Internet each and every day. HTTP moves the bulk of this information quickly, conveniently, and reliably from web servers all around the world to web browsers on people's desktops. Because HTTP uses reliable data-transmission protocols, it guarantees that your data will not be damaged or scrambled in transit, even when it comes from the other side of the globe. This is good for you as a user, because you can access information without worrying about its integrity. Reliable transmission is also good for you as an Internet application developer, because you don't have to worry about HTTP communications being destroyed, duplicated, or distorted in transit. You can focus on programming the distinguishing details of your application, without worrying about the flaws and foibles of the Internet. Let's look more closely at how HTTP transports the Web's traffic. 1.2 Web Clients and Servers Web content lives onweb servers. Web servers speak the HTTP protocol, so they are often called HTTP servers. These HTTP servers store the Internet's data and provide the data when it is requested by HTTP clients. The clients send HTTP requests to servers, and servers return the requested data in HTTP responses, as sketched in Figure 1-1. Together, HTTP clients and HTTP servers make up the basic components of the World Wide Web. Figure 1-1. Web clients and servers You probably use HTTP clients every day. The most common client is a web browser, such as Microsoft Internet Explorer or Netscape Navigator. Web browsers request HTTP objects from servers and display the objects on your screen. When you browse to a page, such as "http://www.oreilly.com/index.html," your browser sends an HTTP request to the server www.oreilly.com (see Figure 1-1). The server tries to find the desired object (in this case, "/index.html") and, if successful, sends the object to the client in an HTTP response, along with the type of the object, the length of the object, and other information.

1.3 Resources Web servers host web resources. A web resource is the source of web content. The simplest kind of web resource is a static file on the web server's filesystem. These files can contain anything: they might be text files, HTML files, Microsoft Word files, Adobe Acrobat files, JPEG image files, AVI movie files, or any other format you can think of. However, resources don't have to be static files. Resources can also be software programs that generate content on demand. These dynamic content resources can generate content based on your identity, on what information you've requested, or on the time of day. They can show you a live image from a camera, or let you trade stocks, search real estate databases, or buy gifts from online stores (see Figure 1-2). Figure 1-2. A web resource is anything that provides web content In summary, a resource is any kind of content source. A file containing your company's sales forecast spreadsheet is a resource. A web gateway to scan your local public library's shelves is a resource. An Internet search engine is a resource. 1.3.1 Media Types Because the Internet hosts many thousands of different data types, HTTP carefully tags each object being transported through the Web with a data format label called a MIME type. MIME (Multipurpose Internet Mail Extensions) was originally designed to solve problems encountered in moving messages between different electronic mail systems. MIME worked so well for email that HTTP adopted it to describe and label its own multimedia content. Web servers attach a MIME type to all HTTP object data (see Figure 1-3). When a web browser gets an object back from a server, it looks at the associated MIME type to see if it knows how to handle the object. Most browsers can handle hundreds of popular object types: displaying image files, parsing

分享到：

赞收藏

资料库

HTTP_权威指南.pdf

相关推荐

开发技术

热门标签

最新资料