logo资料库

Mahout tutorial.pdf

第1页 / 共38页
第2页 / 共38页
第3页 / 共38页
第4页 / 共38页
第5页 / 共38页
第6页 / 共38页
第7页 / 共38页
第8页 / 共38页
资料共38页,剩余部分请下载后查看
About this Tutorial Apache Mahout is an open source project that is primarily used in producing scalable machine learning algorithms. This brief tutorial provides a quick introduction to Apache Mahout and explains how it can be applied to make recommendations and organize documents in more useable clusters. Audience This tutorial has been prepared for professionals aspiring to learn the basics of Mahout and develop applications involving machine learning techniques such as recommendation, classification, and clustering. Prerequisites Before you start proceeding with this tutorial, we assume that you have prior exposure to Core Java, Hadoop, and any of the Linux operating system flavors. Copyright & Disclaimer  Copyright 2015 by Tutorials Point (I) Pvt. Ltd. All the content and graphics published in this e-book are the property of Tutorials Point (I) Pvt. Ltd. The user of this e-book is prohibited to reuse, retain, copy, distribute or republish any contents or a part of contents of this e-book in any manner without written consent of the publisher. We strive to update the contents of our website and tutorials as timely and as precisely as possible, however, the contents may contain inaccuracies or errors. Tutorials Point (I) Pvt. Ltd. provides no guarantee regarding the accuracy, timeliness or completeness of our website or its contents including this tutorial. If you discover any errors on our website or in this tutorial, please notify us at contact@tutorialspoint.com i
Table of Contents About this Tutorial ....................................................................................................................................... i Audience ..................................................................................................................................................... i Prerequisites ............................................................................................................................................... i Copyright & Disclaimer ................................................................................................................................ i Table of Contents ....................................................................................................................................... ii 1. INTRODUCTION ...................................................................................................................... 1 What is Apache Mahout? ........................................................................................................................... 1 Features of Mahout .................................................................................................................................... 1 Applications of Mahout .............................................................................................................................. 2 2. MACHINE LEARNING .............................................................................................................. 3 What is Machine Learning?......................................................................................................................... 3 Supervised Learning ................................................................................................................................... 3 Unsupervised Learning ............................................................................................................................... 4 Recommendation ....................................................................................................................................... 4 Classification .............................................................................................................................................. 5 Clustering ................................................................................................................................................... 5 3. ENVIRONMENT ....................................................................................................................... 7 Pre-Installation Setup ................................................................................................................................. 7 Installing Java ............................................................................................................................................. 8 Downloading Hadoop ................................................................................................................................. 9 Installing Hadoop...................................................................................................................................... 10 core-site.xml ............................................................................................................................................. 11 hdfs-site.xml ............................................................................................................................................. 12 yarn-site.xml ............................................................................................................................................ 13 mapred-site.xml ....................................................................................................................................... 13 Verifying Hadoop Installation ................................................................................................................... 13 ii
Downloading Mahout ............................................................................................................................... 16 Maven Repository .................................................................................................................................... 17 4. RECOMMENDATION ............................................................................................................. 18 Recommendation ..................................................................................................................................... 18 Mahout Recommender Engine ................................................................................................................. 19 Building a Recommender using Mahout ................................................................................................... 21 5. CLUSTERING ......................................................................................................................... 25 Applications of Clustering ......................................................................................................................... 25 Procedure of Clustering ............................................................................................................................ 25 Clustering Algorithms ............................................................................................................................... 28 6. CLASSIFICATION.................................................................................................................... 31 What is Classification? .............................................................................................................................. 31 How Classification Works ......................................................................................................................... 31 Applications of Classification .................................................................................................................... 32 Naive Bayes Classifier ............................................................................................................................... 32 Procedure of Classification ....................................................................................................................... 32 iii
1. INTRODUCTION We are living in a day and age where information is available in abundance. The information overload has scaled to such heights that sometimes it becomes difficult to manage our little mailboxes! Imagine the volume of data and records some of the popular websites (the likes of Facebook, Twitter, and Youtube) have to collect and manage on a daily basis. It is not uncommon even for lesser known websites to receive huge amounts of information in bulk. Normally we fall back on data mining algorithms to analyze bulk data to identify trends and draw conclusions. However, no data mining algorithm can be efficient enough to process very large datasets and provide outcomes in quick time, unless the computational tasks are run on multiple machines distributed over the cloud. We now have new frameworks that allow us to break down a computation task into multiple segments and run each segment on a different machine. Mahout is such a data mining framework that normally runs coupled with the Hadoop infrastructure at its background to manage huge volumes of data. What is Apache Mahout? A mahout is one who drives an elephant as its master. The name comes from its close association with Apache Hadoop which uses an elephant as its logo. Hadoop is an open-source framework from Apache that allows to store and process big data in a distributed environment across clusters of computers using simple programming models. Apache Mahout is an open source project that is primarily used for creating scalable machine learning algorithms. It implements popular machine learning techniques such as: ● Recommendation ● Classification ● Clustering Apache Mahout started as a sub-project of Apache’s Lucene in 2008. In 2010, Mahout became a top level project of Apache. Features of Mahout The primitive features of Apache Mahout are listed below. ● The algorithms of Mahout are written on top of Hadoop, so it works well in distributed environment. Mahout uses the Apache Hadoop library to scale effectively in the cloud. 1
● Mahout offers the coder a ready-to-use framework for doing data mining tasks on large volumes of data. ● Mahout lets applications to analyze large sets of data effectively and in quick time. ● Includes several MapReduce enabled clustering implementations such as k- means, fuzzy k-means, Canopy, Dirichlet, and Mean-Shift. ● Supports Distributed Naive Bayes and Complementary Naive Bayes classification implementations. ● Comes with distributed fitness function capabilities for evolutionary programming. ● Includes matrix and vector libraries. Applications of Mahout ● Companies such as Adobe, Facebook, LinkedIn, Foursquare, Twitter, and Yahoo use Mahout internally. ● Foursquare helps you in finding out places, food, and entertainment available in a particular area. It uses the recommender engine of Mahout. ● Twitter uses Mahout for user interest modelling. ● Yahoo! uses Mahout for pattern mining. 2
2. MACHINE LEARNING Apache Mahout is a highly scalable machine learning library that enables developers to use optimized algorithms. Mahout implements popular machine learning techniques such as recommendation, classification, and clustering. Therefore, it is prudent to have a brief section on machine learning before we move further. What is Machine Learning? Machine learning is a branch of science that deals with programming the systems in such a way that they automatically learn and improve with experience. Here, learning means recognizing and understanding the input data and making wise decisions based on the supplied data. It is very difficult to cater to all the decisions based on all possible inputs. To tackle this problem, algorithms are developed. These algorithms build knowledge from specific data and past experience with the principles of statistics, probability theory, logic, combinatorial optimization, search, reinforcement learning, and control theory. The developed algorithms form the basis of various applications such as:  Vision processing  Language processing  Forecasting (e.g., stock market trends)  Pattern recognition  Games  Data mining  Expert systems  Robotics Machine learning is a vast area and it is quite beyond the scope of this tutorial to cover all its features. There are several ways to implement machine learning techniques, however the most commonly used ones are supervised and unsupervised learning. Supervised Learning Supervised learning deals with learning a function from available training data. A supervised learning algorithm analyzes the training data and produces an inferred function, which can be used for mapping new examples. Common examples of supervised learning include:  classifying e-mails as spam, 3
 labeling webpages based on their content, and  voice recognition. There are many supervised learning algorithms such as neural networks, Support Vector Machines (SVMs), and Naive Bayes classifiers. Mahout implements Naive Bayes classifier. Unsupervised Learning Unsupervised learning makes sense of unlabeled data without having any predefined dataset for its training. Unsupervised learning is an extremely powerful tool for analyzing available data and look for patterns and trends. It is most commonly used for clustering similar input into logical groups. Common approaches to unsupervised learning include:  k-means,  self-organizing maps, and  hierarchical clustering. Recommendation Recommendation is a popular technique that provides close recommendations based on user information such as previous purchases, clicks, and ratings.  Amazon uses this technique to display a list of recommended items that you might be interested in, drawing information from your past actions. There are recommender engines that work behind Amazon to capture user behavior and recommend selected items based on your earlier actions.  Facebook uses the recommender technique to identify and recommend the “people you may know list”. 4
分享到:
收藏