Learning Scrapy 2016无水印pdf 0分.pdf

发布时间：2022-05-29 发布人：admin 分类：说明书资料大小：7.88M 资料格式：pdf 举报版权申诉

u011433684-9445990-4744302542903659517.pdf-第1页.png

第1页 / 共415页

u011433684-9445990-4744302542903659517.pdf-第2页.png

第2页 / 共415页

u011433684-9445990-4744302542903659517.pdf-第3页.png

第3页 / 共415页

u011433684-9445990-4744302542903659517.pdf-第4页.png

第4页 / 共415页

u011433684-9445990-4744302542903659517.pdf-第5页.png

第5页 / 共415页

u011433684-9445990-4744302542903659517.pdf-第6页.png

第6页 / 共415页

u011433684-9445990-4744302542903659517.pdf-第7页.png

第7页 / 共415页

u011433684-9445990-4744302542903659517.pdf-第8页.png

第8页 / 共415页

Cover

Table of Contents

About the Author

Preface

1. Introducing Scrapy

Hello Scrapy

More reasons to love Scrapy

About this book: aim and usage

The importance of mastering automated data scraping

Developing robust, quality applications, and providing realistic schedules

Developing quality minimum viable products quickly

Scraping gives you scale; Google couldn't use forms

Discovering and integrating into your ecosystem

Being a good citizen in a world full of spiders

What Scrapy is not

Summary

2. Understanding HTML and XPath

HTML, the DOM tree representation, and the XPath

The URL

The HTML document

The tree representation

What you see on the screen

Selecting HTML elements with XPath

Useful XPath expressions

Using Chrome to get XPath expressions

Examples of common tasks

Anticipating changes

Summary

3. Basic Crawling

Installing Scrapy

MacOS

Windows

Linux

Ubuntu or Debian Linux

Red Hat or CentOS Linux

From the latest source

Upgrading Scrapy

Vagrant: this book's official way to run examples

UR2IM – the fundamental scraping process

The URL

The request and the response

The Items

A Scrapy project

Defining items

Writing spiders

Populating an item

Saving to files

Cleaning up – item loaders and housekeeping fields

Creating contracts

Extracting more URLs

Two-direction crawling with a spider

Two-direction crawling with a CrawlSpider

Summary

4. From Scrapy to a Mobile App

Choosing a mobile application framework

Creating a database and a collection

Populating the database with Scrapy

Creating a mobile application

Creating a database access service

Setting up the user interface

Mapping data to the User Interface

Mappings between database fields and User Interface controls

Testing, sharing, and exporting your mobile app

Summary

5. Quick Spider Recipes

A spider that logs in

A spider that uses JSON APIs and AJAX pages

Passing arguments between responses

A 30-times faster property spider

A spider that crawls based on an Excel file

Summary

6. Deploying to Scrapinghub

Signing up, signing in, and starting a project

Deploying our spiders and scheduling runs

Accessing our items

Scheduling recurring crawls

Summary

7. Configuration and Management

Using Scrapy settings

Essential settings

Analysis

Logging

Stats

Telnet

Example 1 – using telnet

Performance

Stopping crawls early

HTTP caching and working offline

Example 2 – working offline by using the cache

Crawling style

Feeds

Downloading media

Other media

Example 3 – downloading images

Amazon Web Services

Using proxies and crawlers

Example 4 – using proxies and Crawlera's clever proxy

Further settings

Project-related settings

Extending Scrapy settings

Fine-tuning downloading

Autothrottle extension settings

Memory UsageExtension settings

Logging and debugging

Summary

8. Programming Scrapy

Scrapy is a Twisted application

Deferreds and deferred chains

Understanding Twisted and nonblocking I/O – a Python tale

Overview of Scrapy architecture

Example 1 - a very simple pipeline

Signals

Example 2 - an extension that measures throughput and latencies

Extending beyond middlewares

Summary

9. Pipeline Recipes

Using REST APIs

Using treq

A pipeline that writes to Elasticsearch

A pipeline that geocodes using the Google Geocoding API

Enabling geoindexing on Elasticsearch

Interfacing databases with standard Python clients

A pipeline that writes to MySQL

Interfacing services using Twisted-specific clients

A pipeline that reads/writes to Redis

Interfacing CPU-intensive, blocking, or legacy functionality

A pipeline that performs CPU-intensive or blocking operations

A pipeline that uses binaries or scripts

Summary

10. Understanding Scrapy's Performance

Scrapy's engine – an intuitive approach

Cascading queuing systems

Identifying the bottleneck

Scrapy's performance model

Getting component utilization using telnet

Our benchmark system

The standard performance model

Solving performance problems

Case #1 – saturated CPU

Case #2 – blocking code

Case #3 – "garbage" on the downloader

Case #4 – overflow due to many or large responses

Case #5 – overflow due to limited/excessive item concurrency

Case #6 – the downloader doesn't have enough to do

Troubleshooting flow

Summary

11. Distributed Crawling with Scrapyd and Real-Time Analytics

How does the title of a property affect the price?

Scrapyd

Overview of our distributed system

Changes to our spider and middleware

Sharded-index crawling

Batching crawl URLs

Getting start URLs from settings

Deploy your project to scrapyd servers

Creating our custom monitoring command

Calculating the shift with Apache Spark streaming

Running a distributed crawl

System performance

The key take-away

Summary

Appendix A. Installing and troubleshooting prerequisite software

Installing prerequisites

The system

Installation in a nutshell

Installing on Linux

Installing on Windows or Mac

Install Vagrant

How to access the terminal

Install VirtualBox and Git

Ensure that VirtualBox supports 64-bit images

Enable ssh client for Windows

Download this book's code and set up the system

System setup and operations FAQ

What do I download and how much time does it take?

What should I do if Vagrant freezes?

How do I shut down/resume the VM quickly?

How do I fully reset the VM?

How do I resize the virtual machine?

How do I resolve any port conflicts?

On Linux using Docker natively

On Windows or Mac using a VM

How do I make it work behind a corporate proxy?

How do I connect with the Docker provider VM?

How much CPU/memory does each server use?

How can I see the size of Docker container images?

How can I reset the system if Vagrant doesn't respond?

There's a problem I can't work around, what can I do?

Index

Learning Scrapy

Table of Contents Learning Scrapy Credits About the Author About the Reviewer www.PacktPub.com Support files, eBooks, discount offers, and more Why subscribe? Free access for Packt account holders Preface What this book covers What you need for this book Who this book is for Conventions Reader feedback Customer support Downloading the example code Errata Piracy Questions 1. Introducing Scrapy Hello Scrapy More reasons to love Scrapy About this book: aim and usage The importance of mastering automated data scraping Developing robust, quality applications, and providing realistic schedules Developing quality minimum viable products quickly Scraping gives you scale; Google couldn’t use forms Discovering and integrating into your ecosystem Being a good citizen in a world full of spiders

What Scrapy is not Summary 2. Understanding HTML and XPath HTML, the DOM tree representation, and the XPath The URL The HTML document The tree representation What you see on the screen Selecting HTML elements with XPath Useful XPath expressions Using Chrome to get XPath expressions Examples of common tasks Anticipating changes Summary 3. Basic Crawling Installing Scrapy MacOS Windows Linux Ubuntu or Debian Linux Red Hat or CentOS Linux From the latest source Upgrading Scrapy Vagrant: this book’s official way to run examples UR2IM – the fundamental scraping process The URL The request and the response The Items A Scrapy project Defining items Writing spiders

Populating an item Saving to files Cleaning up – item loaders and housekeeping fields Creating contracts Extracting more URLs Two-direction crawling with a spider Two-direction crawling with a CrawlSpider Summary 4. From Scrapy to a Mobile App Choosing a mobile application framework Creating a database and a collection Populating the database with Scrapy Creating a mobile application Creating a database access service Setting up the user interface Mapping data to the User Interface Mappings between database fields and User Interface controls Testing, sharing, and exporting your mobile app Summary 5. Quick Spider Recipes A spider that logs in A spider that uses JSON APIs and AJAX pages Passing arguments between responses A 30-times faster property spider A spider that crawls based on an Excel file Summary 6. Deploying to Scrapinghub Signing up, signing in, and starting a project Deploying our spiders and scheduling runs Accessing our items Scheduling recurring crawls

Summary 7. Configuration and Management Using Scrapy settings Essential settings Analysis Logging Stats Telnet Example 1 – using telnet Performance Stopping crawls early HTTP caching and working offline Example 2 – working offline by using the cache Crawling style Feeds Downloading media Other media Example 3 – downloading images Amazon Web Services Using proxies and crawlers Example 4 – using proxies and Crawlera’s clever proxy Further settings Project-related settings Extending Scrapy settings Fine-tuning downloading Autothrottle extension settings Memory UsageExtension settings Logging and debugging Summary 8. Programming Scrapy Scrapy is a Twisted application

Deferreds and deferred chains Understanding Twisted and nonblocking I/O – a Python tale Overview of Scrapy architecture Example 1 - a very simple pipeline Signals Example 2 - an extension that measures throughput and latencies Extending beyond middlewares Summary 9. Pipeline Recipes Using REST APIs Using treq A pipeline that writes to Elasticsearch A pipeline that geocodes using the Google Geocoding API Enabling geoindexing on Elasticsearch Interfacing databases with standard Python clients A pipeline that writes to MySQL Interfacing services using Twisted-specific clients A pipeline that reads/writes to Redis Interfacing CPU-intensive, blocking, or legacy functionality A pipeline that performs CPU-intensive or blocking operations A pipeline that uses binaries or scripts Summary 10. Understanding Scrapy’s Performance Scrapy’s engine – an intuitive approach Cascading queuing systems Identifying the bottleneck Scrapy’s performance model Getting component utilization using telnet Our benchmark system The standard performance model Solving performance problems

分享到：

赞收藏

资料库

Learning Scrapy 2016无水印pdf 0分.pdf

相关推荐

开发技术

热门标签

最新资料