logo资料库

Big Data with Apache Spark and Python 无水印pdf.pdf

第1页 / 共108页
第2页 / 共108页
第3页 / 共108页
第4页 / 共108页
第5页 / 共108页
第6页 / 共108页
第7页 / 共108页
第8页 / 共108页
资料共108页,剩余部分请下载后查看
Cover
Contents
1. Getting Started with Spark
2. Spark Basics and Spark Examples
3. Advanced Examples of Spark Programs
4. Running Spark on a Cluster
5. SparkSQL, DataFrames, and DataSets
6. Other Spark Technologies and Libraries
7. Where to Go From Here? – Learning More About Spark and Data Science
Contents
1: Getting Started with Spark b'Chapter 1: Getting Started with Spark' b'Getting set up - installing Python, a JDK, and Spark and its dependencies' b'Installing the MovieLens movie rating dataset' b'Run your first Spark program - the ratings histogram example' b'Summary' 2: Spark Basics and Spark Examples b'Chapter 2: Spark Basics and Spark Examples' b'What is Spark?' b'The Resilient Distributed Dataset (RDD)' b'Ratings histogram walk-through' b'Key/value RDDs and the average friends by age example' b'Running the average friends by age example' b'Filtering RDDs and the minimum temperature by location example' b'Running the minimum temperature example and modifying it for maximums' b'Running the maximum temperature by location example' b'Counting word occurrences using flatmap()' b'Improving the word-count script with regular expressions' b'Sorting the word count results' b'Find the total amount spent by customer' b'Check your results and sort them by the total amount spent' b'Check your sorted implementation and results against mine' b'Summary' b'Chapter 3: Advanced Examples of Spark Programs' b'Finding the most popular movie' b'Using broadcast variables to display movie names instead of ID numbers' b'Finding the most popular superhero in a social graph' b'Running the script - discover who the most popular superhero is' b'Superhero degrees of separation - introducing the breadth-first search algorithm' b'Accumulators and implementing BFS in Spark' b'Superhero degrees of separation - review the code and run it' 3: Advanced Examples of Spark Programs
b'Item-based collaborative filtering in Spark, cache(), and persist()' b'Running the similar-movies script using Spark's cluster manager' b'Improving the quality of the similar movies example' b'Summary' 4: Running Spark on a Cluster b'Chapter 4: Running Spark on a Cluster' b'Introducing Elastic MapReduce' b'Setting up our Amazon Web Services / Elastic MapReduce account and PuTTY' b'Partitioning' b'Creating similar movies from one million ratings - part 1' b'Creating similar movies from one million ratings - part 2' b'Creating similar movies from one million ratings \xc3\xa2\xc2\x80\xc2\x93 part 3' b'Troubleshooting Spark on a cluster' b'More troubleshooting and managing dependencies' b'Summary' 5: SparkSQL, DataFrames, and DataSets b'Chapter 5: SparkSQL, DataFrames, and DataSets' b'Introducing SparkSQL' b'Executing SQL commands and SQL-style functions on a DataFrame' b'Using DataFrames instead of RDDs' b'Summary' 6: Other Spark Technologies and Libraries b'Chapter 6: Other Spark Technologies and Libraries' b'Introducing MLlib' b'Using MLlib to produce movie recommendations' b'Analyzing the ALS recommendations results' b'Using DataFrames with MLlib' b'Spark Streaming and GraphX' b'Summary' 7: Where to Go From Here? Learning More About Spark and Data Science b'Chapter 7: Where to Go From Here? \xe2\x80\x93 Learning More About Spark and Data Science'
Chapter 1. Getting Started with Spark Spark is one of the hottest technologies in big data analysis right now, and with good reason. If you work for, or you hope to work for, a company that has massive amounts of data to analyze, Spark offers a very fast and very easy way to analyze that data across an entire cluster of computers and spread that processing out. This is a very valuable skill to have right now. My approach in this book is to start with some simple examples and work our way up to more complex ones. We'll have some fun along the way too. We will use movie ratings data and play around with similar movies and movie recommendations. I also found a social network of superheroes, if you can believe it; we can use this data to do things such as figure out who's the most popular superhero in the fictional superhero universe. Have you heard of the Kevin Bacon number, where everyone in Hollywood is supposedly connected to a Kevin Bacon to a certain extent? We can do the same thing with our superhero data and figure out the degrees of separation between any two superheroes in their fictional universe too. So, we'll have some fun along the way and use some real examples here and turn them into Spark problems. Using Apache Spark is easier than you might think and, with all the exercises and activities in this book, you'll get plenty of practice as we go along. I'll guide you through every line of code and every concept you need along the way. So let's get started and learn Apache Spark.
Getting set up - installing Python, a JDK, and Spark and its dependencies Let's get you started. There is a lot of software we need to set up. Running Spark on Windows involves a lot of moving pieces, so make sure you follow along carefully, or else you'll have some trouble. I'll try to walk you through it as easily as I can. Now, this chapter is written for Windows users. This doesn't mean that you're out of luck if you're on Mac or Linux though. If you open up the download package for the book or go to this URL, http://media.sundog-soft.com/spark-python-install.pdf, you will find written instructions on getting everything set up on Windows, macOS, and Linux. So, again, you can read through the chapter here for Windows users, and I will call out things that are specific to Windows, so you'll find it useful in other platforms as well; however, either refer to that spark-python- install.pdf file or just follow the instructions here on Windows and let's dive in and get it done. Installing Enthought Canopy This book uses Python as its programming language, so the first thing you need is a Python development environment installed on your PC. If you don't have one already, just open up a web browser and head on to https://www.enthought.com/, and we'll install Enthought Canopy:
Enthought Canopy is just my development environment of choice; if you have a different one already that's probably okay. As long as it's Python 3 or a newer environment, you should be covered, but if you need to install a new Python environment or you just want to minimize confusion, I'd recommend that you install Canopy. So, head up to the big friendly download Canopy button here and select your operating system and architecture:
    For me, the operating system is going to be Windows (64-bit). Make sure you choose Python 3.5 or a newer version of the package. I can't guarantee the scripts in this book will work with Python 2.7; they are built for Python 3, so select Python 3.5 for your OS and download the installer:
There's nothing special about it; it's just your standard Windows Installer, or whatever platform you're on. We'll just accept the defaults, go through it, and allow it to become our default Python environment. Then, when we launch it for the first time, it will spend a couple of minutes setting itself up and all the Python packages that we need. You might want to read the license agreement before you accept it; that's up to you. We'll go ahead, start the installation, and let it run. Once Canopy installer has finished installing, we should have a nice little Enthought Canopy icon sitting on our desktop. Now, if you're on Windows, I want you to right-click on the Enthought Canopy icon, go to Properties and then to Compatibility (this is on Windows 10), and make sure Run this program as an administrator is checked:
分享到:
收藏