logo资料库

Learning Scrapy 2016无水印pdf 0分.pdf

第1页 / 共415页
第2页 / 共415页
第3页 / 共415页
第4页 / 共415页
第5页 / 共415页
第6页 / 共415页
第7页 / 共415页
第8页 / 共415页
资料共415页,剩余部分请下载后查看
Cover
Table of Contents
About the Author
Preface
1. Introducing Scrapy
Hello Scrapy
More reasons to love Scrapy
About this book: aim and usage
The importance of mastering automated data scraping
Developing robust, quality applications, and providing realistic schedules
Developing quality minimum viable products quickly
Scraping gives you scale; Google couldn't use forms
Discovering and integrating into your ecosystem
Being a good citizen in a world full of spiders
What Scrapy is not
Summary
2. Understanding HTML and XPath
HTML, the DOM tree representation, and the XPath
The URL
The HTML document
The tree representation
What you see on the screen
Selecting HTML elements with XPath
Useful XPath expressions
Using Chrome to get XPath expressions
Examples of common tasks
Anticipating changes
Summary
3. Basic Crawling
Installing Scrapy
MacOS
Windows
Linux
Ubuntu or Debian Linux
Red Hat or CentOS Linux
From the latest source
Upgrading Scrapy
Vagrant: this book's official way to run examples
UR2IM – the fundamental scraping process
The URL
The request and the response
The Items
A Scrapy project
Defining items
Writing spiders
Populating an item
Saving to files
Cleaning up – item loaders and housekeeping fields
Creating contracts
Extracting more URLs
Two-direction crawling with a spider
Two-direction crawling with a CrawlSpider
Summary
4. From Scrapy to a Mobile App
Choosing a mobile application framework
Creating a database and a collection
Populating the database with Scrapy
Creating a mobile application
Creating a database access service
Setting up the user interface
Mapping data to the User Interface
Mappings between database fields and User Interface controls
Testing, sharing, and exporting your mobile app
Summary
5. Quick Spider Recipes
A spider that logs in
A spider that uses JSON APIs and AJAX pages
Passing arguments between responses
A 30-times faster property spider
A spider that crawls based on an Excel file
Summary
6. Deploying to Scrapinghub
Signing up, signing in, and starting a project
Deploying our spiders and scheduling runs
Accessing our items
Scheduling recurring crawls
Summary
7. Configuration and Management
Using Scrapy settings
Essential settings
Analysis
Logging
Stats
Telnet
Example 1 – using telnet
Performance
Stopping crawls early
HTTP caching and working offline
Example 2 – working offline by using the cache
Crawling style
Feeds
Downloading media
Other media
Example 3 – downloading images
Amazon Web Services
Using proxies and crawlers
Example 4 – using proxies and Crawlera's clever proxy
Further settings
Project-related settings
Extending Scrapy settings
Fine-tuning downloading
Autothrottle extension settings
Memory UsageExtension settings
Logging and debugging
Summary
8. Programming Scrapy
Scrapy is a Twisted application
Deferreds and deferred chains
Understanding Twisted and nonblocking I/O – a Python tale
Overview of Scrapy architecture
Example 1 - a very simple pipeline
Signals
Example 2 - an extension that measures throughput and latencies
Extending beyond middlewares
Summary
9. Pipeline Recipes
Using REST APIs
Using treq
A pipeline that writes to Elasticsearch
A pipeline that geocodes using the Google Geocoding API
Enabling geoindexing on Elasticsearch
Interfacing databases with standard Python clients
A pipeline that writes to MySQL
Interfacing services using Twisted-specific clients
A pipeline that reads/writes to Redis
Interfacing CPU-intensive, blocking, or legacy functionality
A pipeline that performs CPU-intensive or blocking operations
A pipeline that uses binaries or scripts
Summary
10. Understanding Scrapy's Performance
Scrapy's engine – an intuitive approach
Cascading queuing systems
Identifying the bottleneck
Scrapy's performance model
Getting component utilization using telnet
Our benchmark system
The standard performance model
Solving performance problems
Case #1 – saturated CPU
Case #2 – blocking code
Case #3 – "garbage" on the downloader
Case #4 – overflow due to many or large responses
Case #5 – overflow due to limited/excessive item concurrency
Case #6 – the downloader doesn't have enough to do
Troubleshooting flow
Summary
11. Distributed Crawling with Scrapyd and Real-Time Analytics
How does the title of a property affect the price?
Scrapyd
Overview of our distributed system
Changes to our spider and middleware
Sharded-index crawling
Batching crawl URLs
Getting start URLs from settings
Deploy your project to scrapyd servers
Creating our custom monitoring command
Calculating the shift with Apache Spark streaming
Running a distributed crawl
System performance
The key take-away
Summary
Appendix A. Installing and troubleshooting prerequisite software
Installing prerequisites
The system
Installation in a nutshell
Installing on Linux
Installing on Windows or Mac
Install Vagrant
How to access the terminal
Install VirtualBox and Git
Ensure that VirtualBox supports 64-bit images
Enable ssh client for Windows
Download this book's code and set up the system
System setup and operations FAQ
What do I download and how much time does it take?
What should I do if Vagrant freezes?
How do I shut down/resume the VM quickly?
How do I fully reset the VM?
How do I resize the virtual machine?
How do I resolve any port conflicts?
On Linux using Docker natively
On Windows or Mac using a VM
How do I make it work behind a corporate proxy?
How do I connect with the Docker provider VM?
How much CPU/memory does each server use?
How can I see the size of Docker container images?
How can I reset the system if Vagrant doesn't respond?
There's a problem I can't work around, what can I do?
Index
Learning Scrapy
Table of Contents Learning Scrapy Credits About the Author About the Reviewer www.PacktPub.com Support files, eBooks, discount offers, and more Why subscribe? Free access for Packt account holders Preface What this book covers What you need for this book Who this book is for Conventions Reader feedback Customer support Downloading the example code Errata Piracy Questions 1. Introducing Scrapy Hello Scrapy More reasons to love Scrapy About this book: aim and usage The importance of mastering automated data scraping Developing robust, quality applications, and providing realistic schedules Developing quality minimum viable products quickly Scraping gives you scale; Google couldn’t use forms Discovering and integrating into your ecosystem Being a good citizen in a world full of spiders
What Scrapy is not Summary 2. Understanding HTML and XPath HTML, the DOM tree representation, and the XPath The URL The HTML document The tree representation What you see on the screen Selecting HTML elements with XPath Useful XPath expressions Using Chrome to get XPath expressions Examples of common tasks Anticipating changes Summary 3. Basic Crawling Installing Scrapy MacOS Windows Linux Ubuntu or Debian Linux Red Hat or CentOS Linux From the latest source Upgrading Scrapy Vagrant: this book’s official way to run examples UR2IM – the fundamental scraping process The URL The request and the response The Items A Scrapy project Defining items Writing spiders
Populating an item Saving to files Cleaning up – item loaders and housekeeping fields Creating contracts Extracting more URLs Two-direction crawling with a spider Two-direction crawling with a CrawlSpider Summary 4. From Scrapy to a Mobile App Choosing a mobile application framework Creating a database and a collection Populating the database with Scrapy Creating a mobile application Creating a database access service Setting up the user interface Mapping data to the User Interface Mappings between database fields and User Interface controls Testing, sharing, and exporting your mobile app Summary 5. Quick Spider Recipes A spider that logs in A spider that uses JSON APIs and AJAX pages Passing arguments between responses A 30-times faster property spider A spider that crawls based on an Excel file Summary 6. Deploying to Scrapinghub Signing up, signing in, and starting a project Deploying our spiders and scheduling runs Accessing our items Scheduling recurring crawls
Summary 7. Configuration and Management Using Scrapy settings Essential settings Analysis Logging Stats Telnet Example 1 – using telnet Performance Stopping crawls early HTTP caching and working offline Example 2 – working offline by using the cache Crawling style Feeds Downloading media Other media Example 3 – downloading images Amazon Web Services Using proxies and crawlers Example 4 – using proxies and Crawlera’s clever proxy Further settings Project-related settings Extending Scrapy settings Fine-tuning downloading Autothrottle extension settings Memory UsageExtension settings Logging and debugging Summary 8. Programming Scrapy Scrapy is a Twisted application
Deferreds and deferred chains Understanding Twisted and nonblocking I/O – a Python tale Overview of Scrapy architecture Example 1 - a very simple pipeline Signals Example 2 - an extension that measures throughput and latencies Extending beyond middlewares Summary 9. Pipeline Recipes Using REST APIs Using treq A pipeline that writes to Elasticsearch A pipeline that geocodes using the Google Geocoding API Enabling geoindexing on Elasticsearch Interfacing databases with standard Python clients A pipeline that writes to MySQL Interfacing services using Twisted-specific clients A pipeline that reads/writes to Redis Interfacing CPU-intensive, blocking, or legacy functionality A pipeline that performs CPU-intensive or blocking operations A pipeline that uses binaries or scripts Summary 10. Understanding Scrapy’s Performance Scrapy’s engine – an intuitive approach Cascading queuing systems Identifying the bottleneck Scrapy’s performance model Getting component utilization using telnet Our benchmark system The standard performance model Solving performance problems
分享到:
收藏