Packt-Web.Scraping.with.Python.Richard Lawson.pdf

发布时间：2022-06-15 发布人：admin 分类：说明书资料大小：5.24M 资料格式：pdf 举报版权申诉

8df00300-eedb-4af7-b4b6-3eb0fdb6eb9a.pdf-第1页.png

第1页 / 共174页

8df00300-eedb-4af7-b4b6-3eb0fdb6eb9a.pdf-第2页.png

第2页 / 共174页

8df00300-eedb-4af7-b4b6-3eb0fdb6eb9a.pdf-第3页.png

第3页 / 共174页

8df00300-eedb-4af7-b4b6-3eb0fdb6eb9a.pdf-第4页.png

第4页 / 共174页

8df00300-eedb-4af7-b4b6-3eb0fdb6eb9a.pdf-第5页.png

第5页 / 共174页

8df00300-eedb-4af7-b4b6-3eb0fdb6eb9a.pdf-第6页.png

第6页 / 共174页

8df00300-eedb-4af7-b4b6-3eb0fdb6eb9a.pdf-第7页.png

第7页 / 共174页

8df00300-eedb-4af7-b4b6-3eb0fdb6eb9a.pdf-第8页.png

第8页 / 共174页

Cover

Credits

About the Author

About the Reviewers

www.PacktPub.com

Table of Contents

Preface

Chapter 1: Introduction to Web Scraping

When is web scraping useful?

Is web scraping legal?

Background research

Checking robots.txt

Examining the Sitemap

Estimating the size of a website

Identifying the technology used by a website

Finding the owner of a website

Crawling your first website

Downloading a web page

Retrying downloads

Setting a user agent

Sitemap crawler

ID iteration crawler

Link crawler

Advanced features

Summary

Chapter 2: Scraping the Data

Analyzing a web page

Three approaches to scrape a web page

Regular expressions

Beautiful Soup

Lxml

CSS selectors

Comparing performance

Scraping results

Overview

Adding a scrape callback to the link crawler

Summary

Chapter 3: Caching Downloads

Adding cache support to the link crawler

Disk cache

Implementation

Testing the cache

Saving disk space

Expiring stale data

Drawbacks

Database cache

What is NoSQL?

Installing MongoDB

Overview of MongoDB

MongoDB cache implementation

Compression

Testing the cache

Summary

Chapter 4: Concurrent Downloading

One million web pages

Parsing the Alexa list

Sequential crawler

Threaded crawler

How threads and processes work

Implementation

Cross-process crawler

Performance

Summary

Chapter 5: Dynamic Content

An example dynamic web page

Reverse engineering a dynamic web page

Edge cases

Rendering a dynamic web page

PyQt or PySide

Executing JavaScript

Website interaction with WebKit

Waiting for results

The Render class

Selenium

Summary

Chapter 6: Interacting with Forms

The Login form

Loading cookies from the web browser

Extending the login script to update content

Automating forms with the Mechanize module

Summary

Chapter 7: Solving CAPTCHA

Registering an account

Loading the CAPTCHA image

Optical Character Recognition

Further improvements

Solving complex CAPTCHAs

Using a CAPTCHA solving service

Getting started with 9kw

9kw CAPTCHA API

Integrating with registration

Summary

Chapter 8: Scrapy

Installation

Starting a project

Defining a model

Creating a spider

Tuning settings

Testing the spider

Scraping with the shell command

Checking results

Interrupting and resuming a crawl

Visual scraping with Portia

Installation

Annotation

Tuning a spider

Checking results

Automated scraping with Scrapely

Summary

Chapter 9: Overview

Google search engine

Facebook

The website

The API

Gap

BMW

Summary

Index

Web Scraping with Python Scrape data from any website with the power of Python Richard Lawson BIRMINGHAM - MUMBAI

Web Scraping with Python Copyright © 2015 Packt Publishing All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews. Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book. Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information. First published: October 2015 Production reference: 1231015 Published by Packt Publishing Ltd. Livery Place 35 Livery Street Birmingham B3 2PB, UK. ISBN 978-1-78216-436-4 www.packtpub.com

Credits Author Richard Lawson Reviewers Martin Burch Christopher Davis William Sankey Ayush Tiwari Acquisition Editor Rebecca Youé Project Coordinator Milton Dsouza Proofreader Safis Editing Indexer Mariammal Chettiar Production Coordinator Nilesh R. Mohite Content Development Editor Akashdeep Kundu Cover Work Nilesh R. Mohite Technical Editors Novina Kewalramani Shruti Rawool Copy Editor Sonia Cheema

About the Author Richard Lawson is from Australia and studied Computer Science at the University of Melbourne. Since graduating, he built a business specializing at web scraping while traveling the world, working remotely from over 50 countries. He is a fluent Esperanto speaker, conversational at Mandarin and Korean, and active in contributing to and translating open source software. He is currently undertaking postgraduate studies at Oxford University and in his spare time enjoys developing autonomous drones. I would like to thank Professor Timothy Baldwin for introducing me to this exciting field and Tharavy Douc for hosting me in Paris while I wrote this book.

About the Reviewers Martin Burch is a data journalist based in New York City, where he makes interactive graphics for The Wall Street Journal. He holds a master of arts in journalism from the City University of New York's Graduate School of Journalism, and has a baccalaureate from New Mexico State University, where he studied journalism and information systems. I would like to thank my wife, Lisa, who encouraged me to assist with this book; my uncle, Michael, who has always patiently answered my programming questions; and my father, Richard, who inspired my love of journalism and writing. William Sankey is a data professional and hobbyist developer who lives in College Park, Maryland. He graduated in 2012 from Johns Hopkins University with a master's degree in public policy and specializes in quantitative analysis. He is currently a health services researcher at L&M Policy Research, LLC, working on projects for the Centers for Medicare and Medicaid Services (CMS). The scope of these projects range from evaluating Accountable Care Organizations to monitoring the Inpatient Psychiatric Facility Prospective Payment System. I would like to thank my devoted wife, Julia, and rambunctious puppy, Ruby, for all their love and support.

Ayush Tiwari is a Python developer and undergraduate at IIT Roorkee. He has been working at Information Management Group, IIT Roorkee, since 2013, and has been actively working in the web development field. Reviewing this book has been a great experience for him. He did his part not only as a reviewer, but also as an avid learner of web scraping. He recommends this book to all Python enthusiasts so that they can enjoy the benefits of scraping. He is enthusiastic about Python web scraping and has worked on projects such as live sports feeds, as well as a generalized Python e-commerce web scraper (at Miranj). He has also been handling a placement portal with the help of a Django app to assist the placement process at IIT Roorkee. Besides backend development, he loves to work on computational Python/data analysis using Python libraries, such as NumPy, SciPy, and is currently working in the CFD research field. You can visit his projects on GitHub. His username is tiwariayush. He loves trekking through Himalayan valleys and participates in several treks every year, adding this to his list of interests, besides playing the guitar. Among his accomplishments, he is a part of the internationally acclaimed Super 30 group and has also been a rank holder in it. When he was in high school, he also qualified for the International Mathematical Olympiad. I have been provided a lot of help by my family members (my sister, Aditi, my parents, and Anand sir), my friends at VI and IMG, and my professors. I would like to thank all of them for the support they have given me. Last but not least, kudos to the respected author and the Packt Publishing team for publishing these fantastic tech books. I commend all the hard work involved in producing their books.

www.PacktPub.com Support files, eBooks, discount offers, and more For support files and downloads related to your book, please visit www.PacktPub.com. Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub. com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at service@packtpub.com for more details. At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters and receive exclusive discounts and offers on Packt books and eBooks. TM https://www2.packtpub.com/books/subscription/packtlib Do you need instant solutions to your IT questions? PacktLib is Packt's online digital book library. Here, you can search, access, and read Packt's entire library of books. Why subscribe? • Fully searchable across every book published by Packt • Copy and paste, print, and bookmark content • On demand and accessible via a web browser Free access for Packt account holders If you have an account with Packt at www.PacktPub.com, you can use this to access PacktLib today and view 9 entirely free books. Simply use your login credentials for immediate access.

分享到：

赞收藏

资料库

Packt-Web.Scraping.with.Python.Richard Lawson.pdf

相关推荐

后端

热门标签

最新资料