Apache_Spark_Graph_Processing.pdf

发布时间：2022-06-14 发布人：admin 分类：说明书资料大小：1.89M 资料格式：pdf 举报版权申诉

20239368-2ce8-44b2-a96e-5428ae5536dc.pdf-第1页.png

第1页 / 共148页

20239368-2ce8-44b2-a96e-5428ae5536dc.pdf-第2页.png

第2页 / 共148页

20239368-2ce8-44b2-a96e-5428ae5536dc.pdf-第3页.png

第3页 / 共148页

20239368-2ce8-44b2-a96e-5428ae5536dc.pdf-第4页.png

第4页 / 共148页

20239368-2ce8-44b2-a96e-5428ae5536dc.pdf-第5页.png

第5页 / 共148页

20239368-2ce8-44b2-a96e-5428ae5536dc.pdf-第6页.png

第6页 / 共148页

20239368-2ce8-44b2-a96e-5428ae5536dc.pdf-第7页.png

第7页 / 共148页

20239368-2ce8-44b2-a96e-5428ae5536dc.pdf-第8页.png

第8页 / 共148页

Cover

Credits

Foreword

About the Author

About the Reviewer

www.PacktPub.com

Table of Contents

Preface

Chapter 1: Getting Started with Spark and GraphX

Downloading and installing Spark 1.4.1

Experimenting with the Spark shell

Getting started with GraphX

Building a tiny social network

Loading the data

The property graph

Transforming RDDs to VertexRDD and EdgeRDD

Introducing graph operations

Building and submitting a standalone application

Writing and configuring a Spark program

Building the program with the Scala Build Tool

Deploying and running with spark-submit

Summary

Chapter 2: Building and Exploring Graphs

Network datasets

The communication network

Flavor networks

Social ego networks

Graph builders

The Graph factory method

edgeListFile

fromEdges

fromEdgeTuples

Building graphs

Building directed graphs

Building a bipartite graph

Building a weighted social ego network

Computing the degrees of the network nodes

In-degree and out-degree of the Enron email network

Degrees in the bipartite food network

Degree histogram of the social ego networks

Summary

Chapter 3: Graph Analysis and Visualization

Network datasets

The graph visualization

Installing the GraphStream and BreezeViz libraries

Visualizing the graph data

Plotting the degree distribution

The analysis of network connectedness

Finding the connected components

Counting triangles and computing clustering coefficients

The network centrality and PageRank

How PageRank works

Ranking web pages

Scala Build Tool revisited

Organizing build definitions

Managing library dependencies

A preview of the steps

Running tasks with SBT commands

Summary

Chapter 4: Transforming and Shaping Up Graphs to Your Needs

Transforming the vertex and edge attributes

mapVertices

mapEdges

mapTriplets

Modifying graph structures

The reverse operator

The subgraph operator

The mask operator

The groupEdges operator

Joining graph datasets

joinVertices

outerJoinVertices

Example – Hollywood movie graph

Data operations on VertexRDD and EdgeRDD

Mapping VertexRDD and EdgeRDD

Filtering VertexRDDs

Joining VertexRDDs

Joining EdgeRDDs

Reversing edge directions

Collecting neighboring information

Example – from food network to flavor pairing

Summary

Chapter 5: Creating Custom Graph Aggregation Operators

NCAA College Basketball datasets

The aggregateMessages operator

EdgeContext

Abstracting out the aggregation

Keeping things DRY

Coach wants more numbers

Calculating average points per game

Defense stats – D matters as in direction

Joining average stats into a graph

Performance optimization

The MapReduceTriplets operator

Summary

Chapter 6: Iterative Graph-Parallel Processing with Pregel

The Pregel computational model

Example – iterating towards the social equality

The Pregel API in GraphX

Community detection through label propagation

The Pregel implementation of PageRank

Summary

Chapter 7: Learning Graph Structures

Community clustering in graphs

Spectral clustering

Power iteration clustering

Applications – music fan community detection

Step 1 – load the data into a Spark graph property

Step 2 – extract the features of nodes

Step 3 – define a similarity measure between two nodes

Step 4 – create an affinity matrix

Step 5 – run k-means clustering on the affinity matrix

Exercise – collaborative clustering through playlists

Summary

References

Chapter 2, Building and Exploring Graphs

Chapter 3, Graph Analysis and Visualization

Chapter 7, Learning Graph Structures

Index

Apache Spark Graph Processing Build, process, and analyze large-scale graphs with Spark Rindra Ramamonjison BIRMINGHAM - MUMBAI

Apache Spark Graph Processing Copyright © 2015 Packt Publishing All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews. Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book. Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information. First published: September 2015 Production reference: 1040915 Published by Packt Publishing Ltd. Livery Place 35 Livery Street Birmingham B3 2PB, UK. ISBN 978-1-78439-180-5 www.packtpub.com

Credits Author Rindra Ramamonjison Project Coordinator Nikhil Nair Proofreader Safis Editing Indexer Tejal Soni Production Coordinator Aparna Bhagat Cover Work Aparna Bhagat Reviewer Thomas W. Dinsmore Ryan Mccune Francoise Provencher Commissioning Editor Amit Ghodke Acquisition Editor Larissa Pinto Content Development Editor Dharmesh Parmar Technical Editor Prajakta Mhatre Copy Editor Yesha Gangani

Foreword Apache Spark is one of the most compelling technologies in the big data space and for good reason. It allows data scientists and data engineers alike to work in their language of choice (Java, Scala, Python, SQL, and R as of this writing) to make sense of their data. As ReynoldXin noted, Apache Spark is the Swiss Army Knife of big data analytics tools. It allows you to use one tool to do many things from real-time streaming to advanced analytics. And in no small part, the versatility and power of GraphX has helped Spark propel forward. Apache Spark Graph Processing follows Rindra's journey into solving complex analytics problems. As a PhD graduate in electrical engineering from the University of British Columbia, he focused on applying learning and optimization algorithms to achieve energy-efficient wireless networks. As he dove further into these problems, he realized the ease of which he could solve graph-processing problems by using Apache Spark GraphX. With a tutorial style and hands-on projects with interesting datasets, this book is a reflection of his path from getting started with Apache Spark GraphX to iterative graph parallel processing to learning graph structures. This book is a great jump-start into GraphX, a practical guide for large-scale graph processing, and a testament to the author's enthusiasm for the Spark community (and the community as a whole). Denny Lee Technology Evangelist, Databricks Advisor, WearHacks

About the Author Rindra Ramamonjison is a fourth year PhD student of electrical engineering at the University of British Columbia, Vancouver. He received his master's degree from Tokyo Institute of Technology. He has played various roles in many engineering companies, within telecom and finance industries. His primary research interests are machine learning, optimization, graph processing, and statistical signal processing. Rindra is also the co-organizer of the Vancouver Spark Meetup.

About the Reviewer Thomas W. Dinsmore is a consultant and author with more than 30 years of service to enterprises around the world. He is an expert in business analytics, and has working experience with the leading analytic tools, languages, and databases. In his practice, Thomas helps organizations streamline analytics for improved performance and time to value. Previously, Thomas served with The Boston Consulting Group, IBM, PriceWaterhouseCoopers and SAS, as well as several startups. Thomas coauthored Modern Analytics Methodologies and Advanced Analytics Methodologies, published in 2014 by FT Press. He is currently under contract to publish a book on disruptive technologies in business analytics, scheduled for publication in Q2 2016. I would like to thank the entire editorial and production team at Packt Publishing, who work tirelessly to bring quality books to the public.

分享到：

赞收藏

资料库

Apache_Spark_Graph_Processing.pdf

相关推荐

开发技术

热门标签

最新资料