logo资料库

Cloudera hadoop spark培训题.pdf

第1页 / 共87页
第2页 / 共87页
第3页 / 共87页
第4页 / 共87页
第5页 / 共87页
第6页 / 共87页
第7页 / 共87页
第8页 / 共87页
资料共87页,剩余部分请下载后查看
Hands-On Exercises Cloudera Custom Training: General Notes ............................................................................................................................ 3 Hands-On Exercise: Query Hadoop Data with Apache Impala ................................. 6 Hands-On Exercise: Access HDFS with the Command Line and Hue ...................... 8 Hands-On Exercise: Run a YARN Job .............................................................................. 14 Hands-On Exercise: Explore RDDs Using the Spark Shell ....................................... 20 Hands-On Exercise: Process Data Files with Apache Spark ................................... 30 Hands-On Exercise: Use Pair RDDs to Join Two Datasets ....................................... 34 Hands-On Exercise: Write and Run an Apache Spark Application ...................... 39 Hands-On Exercise: Configure an Apache Spark Application ................................ 44 Hands-On Exercise: View Jobs and Stages in the Spark Application UI .............. 49 Hands-On Exercise: Persist an RDD ................................................................................ 55
Hands-On Exercise: Implement an Iterative Algorithm with Apache Spark .... 58 Hands-On Exercise: Use Apache Spark SQL for ETL .................................................. 62 Hands-On Exercise: Write an Apache Spark Streaming Application .................. 68 Hands-On Exercise: Process Multiple Batches with Apache Spark Streaming 73 Streaming ................................................................................................................................ 78 Appendix A: Enabling Jupyter Notebook for PySpark .............................................. 82 Appendix B: Managing Services on the Course Virtual Machine .......................... 85 Hands-On Exercise: Process Apache Kafka Messages with Apache Spark 2 Copyright © 2010-2016 Cloudera, Inc. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 2
3 Points to Note while Working in the Virtual Machine General Notes Cloudera’s training courses use a Virtual Machine running the CentOS Linux distribution. This VM has CDH installed in pseudo-distributed mode. Pseudo- distributed mode is a method of running Hadoop whereby all Hadoop daemons run on the same machine. It is, essentially, a cluster consisting of a single machine. It works just like a larger Hadoop cluster; the only key difference is that the HDFS block replication factor is set to 1, since there is only a single DataNode available. • The Virtual Machine (VM) is set to log in as the user training automatically. If you log out, you can log back in as the user training with the password training. If you need it, the root password is training. You may be prompted for this if, for example, you want to change the keyboard layout. In general, you should not need this password since the training user has unlimited sudo privileges. In some command-line steps in the exercises, you will see lines like this: The dollar sign ($) at the beginning of each line indicates the Linux shell prompt. The actual prompt will include additional information (for example, [training@localhost training_materials]$) but this is omitted from these instructions for brevity. The backslash (\) at the end of the first line signifies that the command is not completed, and continues on the next line. You can enter the code exactly as shown (on two lines), or you can enter it on a single line. If you do the latter, you should not type in the backslash. • Although most students are comfortable using UNIX text editors like vi or emacs, some might prefer a graphical text editor. To invoke the graphical editor from the command line, type gedit followed by the path of the file you wish to edit. $ hdfs dfs -put united_states_census_data_2010 \ /user/training/example • • Copyright © 2010-2016 Cloudera, Inc. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 3
Appending & to the command allows you to type additional commands while the editor is still open. Here is an example of how to edit a file named myfile.txt: 4 $ gedit myfile.txt & Points to Note during the Exercises Directories • The main directory for the exercises is ~/training_materials/devsh/exercises. Each directory under that one corresponds to an exercise or set of exercises—this is referred to in the instructions as “the exercise directory.” Any scripts or files required for the exercise (other than data) are in the exercise directory. • Within each exercise directory you may find the following subdirectories: • solution—This contains solution code for each exercise. • stubs—A few of the exercises depend on provided starter files containing skeleton code. • Maven project directories—For exercises for which you must write Scala classes, you have been provided with preconfigured Maven project directories. Within these projects are two packages: stubs, where you will do your work using starter skeleton classes; and solution, containing the solution class. • Data files used in the exercises are in ~/training_materials/data. Usually you will upload the files to HDFS before working with them. • The VM defines a few environment variables that are used in place of longer paths in the instructions. Since each variable is automatically replaced with its corresponding values when you run commands in the terminal, this makes it easier and faster for you to enter a command. • The two environment variables for this course are $DEVSH and $DEVDATA. Under $DEVSH you can find exercises, examples, and scripts. • You can always use the echo command if you would like to see the value of an environment variable: Copyright © 2010-2016 Cloudera, Inc. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 4
$ echo $DEVSH 5 Step-by-Step Instructions As the exercises progress, and you gain more familiarity with the tools and environment, we provide fewer step-by-step instructions; as in the real world, we merely give you a requirement and it’s up to you to solve the problem! You should feel free to refer to the hints or solutions provided, ask your instructor for assistance, or consult with your fellow students. There are additional challenges for some of the hands-on exercises. If you finish the main exercise, please attempt the additional steps. If you are unable to complete an exercise, we have provided a script to catch you up automatically. Each exercise has instructions for running the catch-up script, where applicable. The script will prompt for which exercise you are starting; it will set up all the required data as if you had completed all the previous exercises. Warning: If you run the catch up script, you may lose your work. (For example, all data will be deleted from HDFS.) $ $DEVSH/scripts/catchup.sh Bonus Exercises Catch-Up Script Copyright © 2010-2016 Cloudera, Inc. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 5
Hands-On Exercise: Query Hadoop Data with Apache Impala Files and Data Used in This Exercise: Impala/Hive table: device 6 Using the Hue Impala Query Editor In this exercise, you will use the Hue Impala Query Editor to explore data in a Hadoop cluster. This exercise is intended to let you begin to familiarize yourself with the course Virtual Machine as well as Hue. You will also briefly explore the Impala Query Editor. 1. Start Firefox on the VM using the shortcut provided on the main menu panel at the top of the screen. 2. Click the Hue bookmark, or visit http://localhost:8888/. 3. Because this is the first time anyone has logged into Hue on this server, you will be prompted to create a new user account. Enter username training and password training, and then click Create account. (If prompted you may click Remember.) • Note: When you first log in to Hue you may see a misconfiguration warning. This is because not all the services Hue depends on are installed on the course VM in order to save space. You can disregard the message. 4. Open the Impala query editor from the Query Editors menu. Copyright © 2010-2016 Cloudera, Inc. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 6
5. In the left panel Assist tab, select the device table. This will display the table’s column definitions. • Note: There are four columns defined. If you do not see all four, try resizing the Firefox window. 6. Hover your pointer over the device table to reveal the associated control icons, then click the Preview Sample data icon as shown below: 7 SELECT * FROM device WHERE name LIKE 'Me%'; A popup with the first several rows of data in the table will appear. When you are done viewing the data, click the X in the upper right corner of the popup to close the preview. 7. In the main panel in the query text box, enter a SQL query like the one below: 8. Click the Execute button to execute the command. 9. To see results, view the Results tab below the query area. 10. Optional: If you have extra time, continue exploring the Impala Query Editor on your own. For instance, try selecting other tabs after viewing the results. This is the end of the exercise Copyright © 2010-2016 Cloudera, Inc. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 7
Hands-On Exercise: Access HDFS with the Command Line and Hue 8 Files and Data Used in This Exercise: Data files (local): $DEVDATA/kb/* $DEVDATA/base_stations.tsv Exploring the HDFS Command Line Interface In this exercise, you will practice working with HDFS, the Hadoop Distributed File System. You will use the HDFS command line tool and the Hue File Browser web-based interface to manipulate files in HDFS. The simplest way to interact with HDFS is by using the hdfs command. To execute filesystem commands within HDFS, use the hdfs dfs command. 1. Open a terminal window by double-clicking the Terminal icon on the desktop. 2. Enter: This shows you the contents of the root directory in HDFS. There will be multiple entries, one of which is /user. Each user has a “home” directory under this directory, named after their username; your username in this course is training, therefore your home directory is /user/training. 3. Try viewing the contents of the /user directory by running: You will see your home directory in the directory listing. $ hdfs dfs -ls / $ hdfs dfs -ls /user Copyright © 2010-2016 Cloudera, Inc. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 8
分享到:
收藏