Spark Standalone 模式集成 HDFS 配置清单
系统配置:centos7 64 位
硬件配置:cpu-i52410m(双核四线程)
前置步骤:
1. 新建 Linux 用户 z1。
2. 配置静态路由地址。
3.
/etc/hosts 中添加主机名地址映射关系。
/etc/hosts
192.168.255.123
192.168.255.121
192.168.255.122
os1
os2
os3
4. 安装 jdk1.8.0_181
5. 安装 scala2.12.4
6. 配置 PATH 和 CLASSPATH
——添加
export JAVA_HOME=/home/z1/java/jdk1.8.0_181
export HADOOP_HOME=/home/z1/hadoop/hadoop-2.8.5
export SCALA_HOME=/home/z1/scala/scala-2.12.4
export SPARK_HOME=/home/z1/hadoop/spark/spark-2.3.1-bin-hadoop2.7
export PATH=$SCALA_HOME/bin:$SPARK_HOME/bin
:$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$JAVA_HOME/bin:$PATH
export CLASSPATH=.:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar
一、配置 hadoop
1. 下载相应版本 tar 文件,解压安装包到指定目录下。
2. 修改$HADOOP_HOME/etc/hadoop 下的文件:
修改 hadoop-env.sh
——修改 export JAVA_HOME=/home/hadoop/java/jdk1.8.0181
修改 slaves
——添加从机
os2
os3
3. 修改 core-site.xml 参数
fs.defaultFS
hdfs://os1:9000/
hadoop.tmp.dir
/home/z1/hadoop/hadoop-2.8.5/tmp
4. 修改 hdfs-site.xml 参数
dfs.replication
2
5. 修改 mapred-site.xml 参数
mapreduce.framework.name
yarn
6. 修改 yarn-site.xml 参数
yarn.nodemanager.aux-services
mapreduce_shuffle
yarn.resourcemanager.hostname
os1
7. 初始化,输入命令,hdfs namenode -format
8. 启动 start-dfs.sh
9. 可视化 webUI——http://192.168.255(主机名 ip).123:50070/
二、配置 spark
1. 下载相应版本 tar 文件,解压安装包到指定目录下。
2. 配置$SPARK_HOME/conf/文件夹下的 spark-env.sh 和 slaves
修改 spark-env.sh
——添加
export JAVA_HOME=/home/z1/java/jdk1.8.0_181
#绑定主机 ip 或主机名
export SPARK_MASTER_HOST=os1
#绑定本地 ip
export SPARK_LOCAL_IP=192.168.255.123
#worker 逻辑核数设定
export SPARK_WORKER_CORES=1
#worker 内存设定
#计算公式:max(384m,0.1*worker 内存)+executor memory=worker 内存
export SPARK_WORKER_MEMORY=896m
export HADOOP_CONF_DIR=/home/z1/hadoop/hadoop-2.8.5/etc/hadoop
#本地内存——磁盘交换文件夹
export SPARK_LOCAL_DIRS=/home/z1/hadoop/spark
/spark-2.3.1-bin-hadoop2.7/scratch
修改 slaves
——添加从机
os2
os3
3. 作业提交 spark-submit --master spark://192.168.255.123(主机地址 or
主机名):7077
--class
indata1.InData1(入口类) /home/z1/indata1.jar(作业本地
主机位置)
4. 可视化 WebUI——http://192.168.255.123(Master 地址):8080/
5. 提交帮助 spark-submit --h
Usage: spark-submit [options]
[app arguments]
Usage: spark-submit --kill [submission ID] --master [spark://...]
Usage: spark-submit --status [submission ID] --master [spark://...]
Usage: spark-submit run-example [options] example-class [example args]
Options:
--master MASTER_URL
spark://host:port, mesos://host:port, yarn,
k8s://https://host:port,or local(Default: local[*]).
--deploy-mode DEPLOY_MODE
Whether to launch the driver program locally ("client") or
on one of the worker machines inside the cluster ("cluster")
(Default: client).
--class CLASS_NAME
Your application's main class (for Java / Scala apps).
--name NAME
--jars JARS
A name of your application.
Comma-separated list of jars to include on the driver
and executor classpaths.
--packages
Comma-separated list of maven coordinates of jars to include
on the driver and executor classpaths. Will search the local
maven repo, then maven central and any additional remote
repositories given by --repositories. The format for the
coordinates should be groupId:artifactId:version.
--exclude-packages
Comma-separated list of groupId:artifactId, to exclude while
resolving the dependencies provided in --packages to avoid
dependency conflicts.
--repositories
Comma-separated list of additional remote repositories to
search for the maven coordinates given with --packages.
--py-files PY_FILES
Comma-separated list of .zip, .egg, or .py files to place
on the PYTHONPATH for Python apps.
--files FILES
Comma-separated list of files to be placed in the working
directory of each executor. File paths of these files
in executors can be accessed via SparkFiles.get(fileName).
--conf PROP=VALUE
Arbitrary Spark configuration property.
--properties-file FILE
Path to a file from which to load extra properties. If not
specified, this will look for conf/spark-defaults.conf.
--driver-memory MEM
Memory for driver (e.g. 1000M, 2G) (Default: 1024M).
--driver-java-options
Extra Java options to pass to the driver.
--driver-library-path
Extra library path entries to pass to the driver.
--driver-class-path
Extra class path entries to pass to the driver. Note that
jars added with --jars are automatically included in the
classpath.
--executor-memory MEM
Memory per executor (e.g. 1000M, 2G) (Default: 1G).
--proxy-user NAME
User to impersonate when submitting the application.
This argument does not work with --principal / --keytab.
--help, -h
--verbose, -v
--version,
Show this help message and exit.
Print additional debug output.
Print the version of current Spark.
Cluster deploy mode only:
--driver-cores NUM
Number of cores used by the driver, only in cluster mode
(Default: 1).
Spark standalone or Mesos with cluster deploy mode only:
--supervise
If given, restarts the driver on failure.
--kill SUBMISSION_ID
If given, kills the driver specified.
--status SUBMISSION_ID
If given, requests the status of the driver specified.
Spark standalone and Mesos only:
--total-executor-cores NUM
Total cores for all executors.
Spark standalone and YARN only:
--executor-cores NUM
Number of cores per executor. (Default: 1 in YARN mode,
or all available cores on the worker in standalone mode)
YARN-only:
--queue QUEUE_NAME
The YARN queue to submit to (Default: "default").
--num-executors NUM
Number of executors to launch (Default: 2).
If dynamic allocation is enabled, the initial number of
executors will be at least NUM.
--archives ARCHIVES
Comma separated list of archives to be extracted into the
--principal PRINCIPAL
Principal to be used to login to KDC, while running on
working directory of each executor.
secure HDFS.
--keytab KEYTAB
The full path to the file that contains the keytab for the
principal specified above. This keytab will be copied to
the node running the Application Master via the Secure
Distributed Cache, for renewing the login tickets and the
delegation tokens periodically.
优化建议:
1.运行前设定 executor-memory 可有效防止内存溢出;
计算公式:max(384m,0.1*worker 内存)+最大 executor memory=worker 内
存
2.合适设定 Worker cores 与 executor cores。Worker cores 过
低会浪费 cpu 性能;executor cores 若不设置则默认为全部的
worker cores,多作业进行时会因此拖慢其他作业。