logo资料库

Packt-Web.Scraping.with.Python.Richard Lawson.pdf

第1页 / 共174页
第2页 / 共174页
第3页 / 共174页
第4页 / 共174页
第5页 / 共174页
第6页 / 共174页
第7页 / 共174页
第8页 / 共174页
资料共174页,剩余部分请下载后查看
Cover
Copyright
Credits
About the Author
About the Reviewers
www.PacktPub.com
Table of Contents
Preface
Chapter 1: Introduction to Web Scraping
When is web scraping useful?
Is web scraping legal?
Background research
Checking robots.txt
Examining the Sitemap
Estimating the size of a website
Identifying the technology used by a website
Finding the owner of a website
Crawling your first website
Downloading a web page
Retrying downloads
Setting a user agent
Sitemap crawler
ID iteration crawler
Link crawler
Advanced features
Summary
Chapter 2: Scraping the Data
Analyzing a web page
Three approaches to scrape a web page
Regular expressions
Beautiful Soup
Lxml
CSS selectors
Comparing performance
Scraping results
Overview
Adding a scrape callback to the link crawler
Summary
Chapter 3: Caching Downloads
Adding cache support to the link crawler
Disk cache
Implementation
Testing the cache
Saving disk space
Expiring stale data
Drawbacks
Database cache
What is NoSQL?
Installing MongoDB
Overview of MongoDB
MongoDB cache implementation
Compression
Testing the cache
Summary
Chapter 4: Concurrent Downloading
One million web pages
Parsing the Alexa list
Sequential crawler
Threaded crawler
How threads and processes work
Implementation
Cross-process crawler
Performance
Summary
Chapter 5: Dynamic Content
An example dynamic web page
Reverse engineering a dynamic web page
Edge cases
Rendering a dynamic web page
PyQt or PySide
Executing JavaScript
Website interaction with WebKit
Waiting for results
The Render class
Selenium
Summary
Chapter 6: Interacting with Forms
The Login form
Loading cookies from the web browser
Extending the login script to update content
Automating forms with the Mechanize module
Summary
Chapter 7: Solving CAPTCHA
Registering an account
Loading the CAPTCHA image
Optical Character Recognition
Further improvements
Solving complex CAPTCHAs
Using a CAPTCHA solving service
Getting started with 9kw
9kw CAPTCHA API
Integrating with registration
Summary
Chapter 8: Scrapy
Installation
Starting a project
Defining a model
Creating a spider
Tuning settings
Testing the spider
Scraping with the shell command
Checking results
Interrupting and resuming a crawl
Visual scraping with Portia
Installation
Annotation
Tuning a spider
Checking results
Automated scraping with Scrapely
Summary
Chapter 9: Overview
Google search engine
Facebook
The website
The API
Gap
BMW
Summary
Index
Web Scraping with Python Scrape data from any website with the power of Python Richard Lawson BIRMINGHAM - MUMBAI
Web Scraping with Python Copyright © 2015 Packt Publishing All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews. Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book. Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information. First published: October 2015 Production reference: 1231015 Published by Packt Publishing Ltd. Livery Place 35 Livery Street Birmingham B3 2PB, UK. ISBN 978-1-78216-436-4 www.packtpub.com
Credits Author Richard Lawson Reviewers Martin Burch Christopher Davis William Sankey Ayush Tiwari Acquisition Editor Rebecca Youé Project Coordinator Milton Dsouza Proofreader Safis Editing Indexer Mariammal Chettiar Production Coordinator Nilesh R. Mohite Content Development Editor Akashdeep Kundu Cover Work Nilesh R. Mohite Technical Editors Novina Kewalramani Shruti Rawool Copy Editor Sonia Cheema
About the Author Richard Lawson is from Australia and studied Computer Science at the University of Melbourne. Since graduating, he built a business specializing at web scraping while traveling the world, working remotely from over 50 countries. He is a fluent Esperanto speaker, conversational at Mandarin and Korean, and active in contributing to and translating open source software. He is currently undertaking postgraduate studies at Oxford University and in his spare time enjoys developing autonomous drones. I would like to thank Professor Timothy Baldwin for introducing me to this exciting field and Tharavy Douc for hosting me in Paris while I wrote this book.
About the Reviewers Martin Burch is a data journalist based in New York City, where he makes interactive graphics for The Wall Street Journal. He holds a master of arts in journalism from the City University of New York's Graduate School of Journalism, and has a baccalaureate from New Mexico State University, where he studied journalism and information systems. I would like to thank my wife, Lisa, who encouraged me to assist with this book; my uncle, Michael, who has always patiently answered my programming questions; and my father, Richard, who inspired my love of journalism and writing. William Sankey is a data professional and hobbyist developer who lives in College Park, Maryland. He graduated in 2012 from Johns Hopkins University with a master's degree in public policy and specializes in quantitative analysis. He is currently a health services researcher at L&M Policy Research, LLC, working on projects for the Centers for Medicare and Medicaid Services (CMS). The scope of these projects range from evaluating Accountable Care Organizations to monitoring the Inpatient Psychiatric Facility Prospective Payment System. I would like to thank my devoted wife, Julia, and rambunctious puppy, Ruby, for all their love and support.
Ayush Tiwari is a Python developer and undergraduate at IIT Roorkee. He has been working at Information Management Group, IIT Roorkee, since 2013, and has been actively working in the web development field. Reviewing this book has been a great experience for him. He did his part not only as a reviewer, but also as an avid learner of web scraping. He recommends this book to all Python enthusiasts so that they can enjoy the benefits of scraping. He is enthusiastic about Python web scraping and has worked on projects such as live sports feeds, as well as a generalized Python e-commerce web scraper (at Miranj). He has also been handling a placement portal with the help of a Django app to assist the placement process at IIT Roorkee. Besides backend development, he loves to work on computational Python/data analysis using Python libraries, such as NumPy, SciPy, and is currently working in the CFD research field. You can visit his projects on GitHub. His username is tiwariayush. He loves trekking through Himalayan valleys and participates in several treks every year, adding this to his list of interests, besides playing the guitar. Among his accomplishments, he is a part of the internationally acclaimed Super 30 group and has also been a rank holder in it. When he was in high school, he also qualified for the International Mathematical Olympiad. I have been provided a lot of help by my family members (my sister, Aditi, my parents, and Anand sir), my friends at VI and IMG, and my professors. I would like to thank all of them for the support they have given me. Last but not least, kudos to the respected author and the Packt Publishing team for publishing these fantastic tech books. I commend all the hard work involved in producing their books.
www.PacktPub.com Support files, eBooks, discount offers, and more For support files and downloads related to your book, please visit www.PacktPub.com. Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub. com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at service@packtpub.com for more details. At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters and receive exclusive discounts and offers on Packt books and eBooks. TM https://www2.packtpub.com/books/subscription/packtlib Do you need instant solutions to your IT questions? PacktLib is Packt's online digital book library. Here, you can search, access, and read Packt's entire library of books. Why subscribe? • Fully searchable across every book published by Packt • Copy and paste, print, and bookmark content • On demand and accessible via a web browser Free access for Packt account holders If you have an account with Packt at www.PacktPub.com, you can use this to access PacktLib today and view 9 entirely free books. Simply use your login credentials for immediate access.
分享到:
收藏