logo资料库

Apache_Spark_Graph_Processing.pdf

第1页 / 共148页
第2页 / 共148页
第3页 / 共148页
第4页 / 共148页
第5页 / 共148页
第6页 / 共148页
第7页 / 共148页
第8页 / 共148页
资料共148页,剩余部分请下载后查看
Cover
Copyright
Credits
Foreword
About the Author
About the Reviewer
www.PacktPub.com
Table of Contents
Preface
Chapter 1: Getting Started with Spark and GraphX
Downloading and installing Spark 1.4.1
Experimenting with the Spark shell
Getting started with GraphX
Building a tiny social network
Loading the data
The property graph
Transforming RDDs to VertexRDD and EdgeRDD
Introducing graph operations
Building and submitting a standalone application
Writing and configuring a Spark program
Building the program with the Scala Build Tool
Deploying and running with spark-submit
Summary
Chapter 2: Building and Exploring Graphs
Network datasets
The communication network
Flavor networks
Social ego networks
Graph builders
The Graph factory method
edgeListFile
fromEdges
fromEdgeTuples
Building graphs
Building directed graphs
Building a bipartite graph
Building a weighted social ego network
Computing the degrees of the network nodes
In-degree and out-degree of the Enron email network
Degrees in the bipartite food network
Degree histogram of the social ego networks
Summary
Chapter 3: Graph Analysis and Visualization
Network datasets
The graph visualization
Installing the GraphStream and BreezeViz libraries
Visualizing the graph data
Plotting the degree distribution
The analysis of network connectedness
Finding the connected components
Counting triangles and computing clustering coefficients
The network centrality and PageRank
How PageRank works
Ranking web pages
Scala Build Tool revisited
Organizing build definitions
Managing library dependencies
A preview of the steps
Running tasks with SBT commands
Summary
Chapter 4: Transforming and Shaping Up Graphs to Your Needs
Transforming the vertex and edge attributes
mapVertices
mapEdges
mapTriplets
Modifying graph structures
The reverse operator
The subgraph operator
The mask operator
The groupEdges operator
Joining graph datasets
joinVertices
outerJoinVertices
Example – Hollywood movie graph
Data operations on VertexRDD and EdgeRDD
Mapping VertexRDD and EdgeRDD
Filtering VertexRDDs
Joining VertexRDDs
Joining EdgeRDDs
Reversing edge directions
Collecting neighboring information
Example – from food network to flavor pairing
Summary
Chapter 5: Creating Custom Graph Aggregation Operators
NCAA College Basketball datasets
The aggregateMessages operator
EdgeContext
Abstracting out the aggregation
Keeping things DRY
Coach wants more numbers
Calculating average points per game
Defense stats – D matters as in direction
Joining average stats into a graph
Performance optimization
The MapReduceTriplets operator
Summary
Chapter 6: Iterative Graph-Parallel Processing with Pregel
The Pregel computational model
Example – iterating towards the social equality
The Pregel API in GraphX
Community detection through label propagation
The Pregel implementation of PageRank
Summary
Chapter 7: Learning Graph Structures
Community clustering in graphs
Spectral clustering
Power iteration clustering
Applications – music fan community detection
Step 1 – load the data into a Spark graph property
Step 2 – extract the features of nodes
Step 3 – define a similarity measure between two nodes
Step 4 – create an affinity matrix
Step 5 – run k-means clustering on the affinity matrix
Exercise – collaborative clustering through playlists
Summary
References
Chapter 2, Building and Exploring Graphs
Chapter 3, Graph Analysis and Visualization
Chapter 7, Learning Graph Structures
Index
Apache Spark Graph Processing Build, process, and analyze large-scale graphs with Spark Rindra Ramamonjison BIRMINGHAM - MUMBAI
Apache Spark Graph Processing Copyright © 2015 Packt Publishing All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews. Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book. Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information. First published: September 2015 Production reference: 1040915 Published by Packt Publishing Ltd. Livery Place 35 Livery Street Birmingham B3 2PB, UK. ISBN 978-1-78439-180-5 www.packtpub.com
Credits Author Rindra Ramamonjison Project Coordinator Nikhil Nair Proofreader Safis Editing Indexer Tejal Soni Production Coordinator Aparna Bhagat Cover Work Aparna Bhagat Reviewer Thomas W. Dinsmore Ryan Mccune Francoise Provencher Commissioning Editor Amit Ghodke Acquisition Editor Larissa Pinto Content Development Editor Dharmesh Parmar Technical Editor Prajakta Mhatre Copy Editor Yesha Gangani
Foreword Apache Spark is one of the most compelling technologies in the big data space and for good reason. It allows data scientists and data engineers alike to work in their language of choice (Java, Scala, Python, SQL, and R as of this writing) to make sense of their data. As ReynoldXin noted, Apache Spark is the Swiss Army Knife of big data analytics tools. It allows you to use one tool to do many things from real-time streaming to advanced analytics. And in no small part, the versatility and power of GraphX has helped Spark propel forward. Apache Spark Graph Processing follows Rindra's journey into solving complex analytics problems. As a PhD graduate in electrical engineering from the University of British Columbia, he focused on applying learning and optimization algorithms to achieve energy-efficient wireless networks. As he dove further into these problems, he realized the ease of which he could solve graph-processing problems by using Apache Spark GraphX. With a tutorial style and hands-on projects with interesting datasets, this book is a reflection of his path from getting started with Apache Spark GraphX to iterative graph parallel processing to learning graph structures. This book is a great jump-start into GraphX, a practical guide for large-scale graph processing, and a testament to the author's enthusiasm for the Spark community (and the community as a whole). Denny Lee Technology Evangelist, Databricks Advisor, WearHacks
About the Author Rindra Ramamonjison is a fourth year PhD student of electrical engineering at the University of British Columbia, Vancouver. He received his master's degree from Tokyo Institute of Technology. He has played various roles in many engineering companies, within telecom and finance industries. His primary research interests are machine learning, optimization, graph processing, and statistical signal processing. Rindra is also the co-organizer of the Vancouver Spark Meetup.
About the Reviewer Thomas W. Dinsmore is a consultant and author with more than 30 years of service to enterprises around the world. He is an expert in business analytics, and has working experience with the leading analytic tools, languages, and databases. In his practice, Thomas helps organizations streamline analytics for improved performance and time to value. Previously, Thomas served with The Boston Consulting Group, IBM, PriceWaterhouseCoopers and SAS, as well as several startups. Thomas coauthored Modern Analytics Methodologies and Advanced Analytics Methodologies, published in 2014 by FT Press. He is currently under contract to publish a book on disruptive technologies in business analytics, scheduled for publication in Q2 2016. I would like to thank the entire editorial and production team at Packt Publishing, who work tirelessly to bring quality books to the public.
分享到:
收藏