spark 环境搭建

    xiaoxiao2022-07-06  201

    文章目录

    1、为 VMware 的虚拟机绑定 IP 地址2、环境参数3、环境搭建3.1 安装 jdk 1.83.2 安装 Hadoop -- cdh5.7.03.3 安装 mysql3.4 安装 Hive3.5 安装 Scala -- 2.12.83.6 安装 maven -- 3.5.43.7 spark 源码编译3.8 spark local 环境搭建3.9 spark Standdalone 环境搭建

    1、为 VMware 的虚拟机绑定 IP 地址

    查看IP地址子网ip、子网掩码、和网关 编辑 --> 虚拟网络编辑器 --> NET设置

    修改虚拟机网关

    sudo vim /etc/sysconfig/network-scripts/ifcfg-ens33

    设置 BOOTPROTO=static,并在文件最后添加 IPADDR、NETMASK、NETMASK 和 DNS1 的配置

    IPADDR=192.168.48.143 NETMASK=255.255.255.0 GATEWAY=192.168.48.2 DNS1=8.8.8.8

    重启网卡,使得配置生效

    sudo service network restart

    2、环境参数

    Linux版本: CentOS 7.2jdk版本: 1.8Hadoop版本: hadoop-2.6.0-cdh5.7.0Scala版本: 2.12.8Spark版本: spark-2.4.3 (spark最新版本2.4.3依赖scala2.12,maven3.5.4以上)开发工具: IDEACDH相关下载地址:http://archive.cloudera.com/cdh5/cdh/5/项目目录:# 登录用户:zcx # ~ 目录下,新建文件夹 # app 存放所有软件的安装目录 # data 存放测试数据 # lib 存放开发的jar # software 存放软件安装包的目录 # source 存放框架源码 # maven_repository maven仓库 # shell 存放运行的脚本

    3、环境搭建

    3.1 安装 jdk 1.8

    # 服务器登录用户为 zcx # software路径:~/software # 上传jdk -- jdk-8u191-linux-x64.tar.gz [zcx@zoucaoxin software]$ rz -y # 解压到:~/app [zcx@zoucaoxin software]$ tar -zxvf jdk-8u191-linux-x64.tar.gz -C ~/app/ # 添加环境变量 [zcx@zoucaoxin app]$ vim ~/.bash_profile # 添加如下 export JAVA_HOME=/home/zcx/app/jdk1.8.0_191 export JRE_HOME=/home/zcx/app/jdk1.8.0_191/jre export CLASSPATH=.:$JAVA_HOME/lib:$JRE_HOME/lib:$CLASSPATH export PATH=$JAVA_HOME/bin:$JRE_HOME/bin:$PATH # 让新加的环境变量生效 source ~/.bash_profile # 验证成功 java -version

    3.2 安装 Hadoop – cdh5.7.0

    # 服务器登录用户为 zcx # 下载地址:http://archive.cloudera.com/cdh5/cdh/5/ # software路径:~/software # 配置host [zcx@zoucaoxin ~]$ sudo vim /etc/hosts # 如果使用阿里云服务器,配置时结点使用阿里云服务器的内网ip,其他的结点使用外网ip.结点名称为主机名 192.168.48.143 zoucaoxin # 配置免密登录 [zcx@zoucaoxin ~]$ ssh-keygen -t rsa #一直回车即可 # ls -la 显示 ~ 目录下 隐藏的 .ssh 文件 # cd .ssh/ 进入 .ssh 文件 # 将根目录里面的 .ssh 目录下的 id_rsa.pub 文件拷贝到一个叫 authorized_keys 的固定文件中去 [zcx@zoucaoxin ~]$ cp ~/.ssh/id_rsa.pub ~/.ssh/authorized_keys # cat authorized_keys 查看该文件内容 # 验证配置成功 # 第一次登录需要验证身份,exit 退出后,再次登录不需要再验证身份了 [zcx@zoucaoxin ~]$ ssh localhost # 上传hadoop -- hadoop-2.6.0-cdh5.7.0.tar.gz [zcx@zoucaoxin software]$ rz -y # 解压到:~/app [zcx@zoucaoxin software]$ tar -zxvf hadoop-2.6.0-cdh5.7.0.tar.gz -C ~/app/ # 添加环境变量 [zcx@zoucaoxin app]$ vim ~/.bash_profile # 添加如下 export HADOOP_HOME=/home/zcx/app/hadoop-2.6.0-cdh5.7.0 export PATH=$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$PATH # 让新加的环境变量生效 source ~/.bash_profile # 配置 hadoop-env.sh [zcx@zoucaoxin hadoop-2.6.0-cdh5.7.0]$ vim etc/hadoop/hadoop-env.sh # 修改hadoop-env.sh中的 JAVA_HOME 路径 export JAVA_HOME=/home/zcx/app/jdk1.8.0_191 # 配置 core-site.xml [zcx@zoucaoxin hadoop-2.6.0-cdh5.7.0]$ vim etc/hadoop/core-site.xml # 修改core-site.xml如下 <configuration> <property> <name>fs.defaultFS</name> <value>hdfs://zoucaoxin:9000</value> </property> <property> <name>hadoop.tmp.dir</name> <value>/home/zcx/app/tmp</value> </property> </configuration> # 配置 hdfs-site.xml [zcx@zoucaoxin hadoop-2.6.0-cdh5.7.0]$ vim etc/hadoop/hdfs-site.xml # 修改hdfs-site.xml如下 <configuration> <property> <name>dfs.replication</name> <value>1</value> </property> </configuration> # 启动 hdfs # 格式化文件系统,仅第一次执行即可,不要每次都重复执行 # 启动是在hadoop的bin目录下 [zcx@zoucaoxin hadoop-2.6.0-cdh5.7.0]$ hadoop namenode -format # 启动 namenode 进程和 datanode 进程 # 启动时在hadoop的sbin目录 [zcx@zoucaoxin hadoop-2.6.0-cdh5.7.0]$ start-dfs.sh # 验证启动成功 [zcx@zoucaoxin hadoop-2.6.0-cdh5.7.0]$ jps # 显示DataNode、SecondaryNameNode、NameNode三个进程,成功 # 打开namenode后台管理界面,ps:如果是阿里云服务器,需开放阿里云服务器的50070端口 # http://192.168.48.143:50070 # 如果后台页面打不开,检查防火墙是否关闭了 # 配置yarn # 修改 hadoop-2.6.0-cdh5.7.0/etc/hadoop 目录下的 mapred-site.xml 文件 [zcx@zoucaoxin hadoop-2.6.0-cdh5.7.0]$ cd etc/hadoop [zcx@zoucaoxin hadoop]$ cp mapred-site.xml.template mapred-site.xml [zcx@zoucaoxin hadoop]$ vim mapred-site.xml <configuration> <property> <name>mapreduce.framework.name</name> <value>yarn</value> </property> </configuration> # 修改 hadoop-2.6.0-cdh5.7.0/etc/hadoop 目录下的 yarn-site.xml 文件 [zcx@zoucaoxin hadoop]$ vim yarn-site.xml <configuration> <property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property> </configuration> # 启动yarn # 如果是阿里云服务器,需开放阿里云服务器端口:8088 [zcx@zoucaoxin hadoop]$ start-yarn.sh # 验证启动成功 [zcx@zoucaoxin hadoop]$ jps # 显示ResourceManager、NodeManager # 控制台页面:http://192.168.48.143:8088/cluster

    3.3 安装 mysql

    # 下载 mysql [zcx@zoucaoxin software]$ wget http://repo.mysql.com/mysql-community-release-el7-5.noarch.rpm [zcx@zoucaoxin ~]$ sudo rpm -ivh mysql-community-release-el7-5.noarch.rpm [zcx@zoucaoxin ~]$ sudo yum update [zcx@zoucaoxin ~]$ sudo yum install mysql mysql-server # 启动mysql [zcx@zoucaoxin ~]$ sudo systemctl start mysqld # 修改root密码 [zcx@zoucaoxin ~]$ mysqladmin -u root password "zoucaoxin" # 通过root用户连接 mysql [zcx@zoucaoxin software]$ mysql -u root -p Enter password: # 开启远程访问,"zoucaoxin" 为自己设置的 mysql 密码 mysql> grant all privileges on *.* to root@'%' identified by "zoucaoxin"; # 使用Navicat工具连接虚拟机(或远程服务器)上的mysql

    3.4 安装 Hive

    # 下载地址:http://archive.cloudera.com/cdh5/cdh/5/ # 下载到 ~/software 目录 # 解压到 ~/app 目录 # 配置环境变量 export HIVE_HOME=/home/zcx/app/hive-1.1.0-cdh5.7.0 export PATH=$HIVE_HOME/bin:$PATH # 配置hive-env.sh文件中的 HADOOP_HOME [zcx@zoucaoxin hive-1.1.0-cdh5.7.0]$ cd conf/ [zcx@zoucaoxin conf]$ cp hive-env.sh.template hive-env.sh [zcx@zoucaoxin conf]$ vim hive-env.sh HADOOP_HOME=/home/zcx/app/hadoop-2.6.0-cdh5.7.0 # 配置mysql的元数据信息,创建 hive-site.xml # pwd:/home/zcx/app/hive-1.1.0-cdh5.7.0/conf [zcx@zoucaoxin conf]$ vim hive-site.xml <configuration> <property> <name>javax.jdo.option.ConnectionURL</name> <value>jdbc:mysql://localhost:3306/sparksql?createDatabaseIfNotExist=true</value> </property> <property> <name>javax.jdo.option.ConnectionDriverName</name> <value>com.mysql.jdbc.Driver</value> </property> # mysql 用户名 <property> <name>javax.jdo.option.ConnectionUserName</name> <value>root</value> </property> # mysql 密码 <property> <name>javax.jdo.option.ConnectionPassword</name> <value>zoucaoxin</value> </property> </configuration> # 拷贝mysql中的驱动包 mysql-connector-java-5.1.27.jar 拷贝到 hive 中去 [zcx@zoucaoxin software]$ cd ~/app/hive-1.1.0-cdh5.7.0/lib/ [zcx@zoucaoxin lib]$ cp ~/software/mysql-connector-java-5.1.27.jar . # 启动hive [zcx@zoucaoxin ~]$ hive # 如果hive启动失败,检查hadoop是否启动成功,三个结点是否都启动了

    3.5 安装 Scala – 2.12.8

    # 官网下载 scala-2.12.8.tgz # 安装 Scala # 上传 scala-2.12.8.tgz 到 software 路径:~/software [zcx@zoucaoxin software]$ rz -y # 解压到 ~/app 目录 [zcx@zoucaoxin software]$ tar -zxvf scala-2.12.8.tgz -C ~/app/ # 添加环境变量 [zcx@zoucaoxin app]$ vim ~/.bash_profile # 添加如下内容 export SCALA_HOME=/home/zcx/app/scala-2.12.8 export PATH=$SCALA_HOME/bin:$PATH # 让新加的环境变量生效 [zcx@zoucaoxin app]$ source ~/.bash_profile # 验证,输入scala,进入scala命令行,成功 [zcx@zoucaoxin app]$ scala

    3.6 安装 maven – 3.5.4

    # 下载到software目录:~/software # 解压到app目录:~/app # 配置环境变量 [zcx@zoucaoxin app]$ vim ~/.bash_profile export MAVEN_HOME=/home/zcx/app/apache-maven-3.5.4 export PATH=$MAVEN_HOME/bin:$PATH # 使环境变量生效 [zcx@zoucaoxin app]$ source ~/.bash_profile # 验证配置成功: mvn -v # 在 ~ 目录下创建maven_repository文件夹 mkdir maven_repository # 修改conf目录下的setting.xml文件 <localRepository>/home/zcx/maven_repository</localRepository>

    3.7 spark 源码编译

    # 下载地址:https://archive.apache.org/dist/spark/ # 说明:根据 spark官网,spark 最新版本2.4.3 依赖于 scala2.12,maven3.5.4以上 # 下载 spark-2.4.3.tgz 到 ~/source 目录下 # 解压到 ~/source 目录 [zcx@zoucaoxin source]$ tar -zxvf spark-2.4.3.tgz # 参考官网配置说明: https://spark.apache.org/docs/latest/building-spark.html#apache-maven # 修改解压后 spark2.4.3文件夹下 的 pom.xml 文件,在 # <repositories> # <repository> # <id>central</id> # <!-- This should be at top, it makes maven try the central repo first and then others and hence faster dep resolution --> # <name>Maven Repository</name> # <url>https://repo.maven.apache.org/maven2</url> # …… # </repository> # 后面添加 <repository> <repository> <id>cloudera</id> <url>https://repository.cloudera.com/artifactory/cloudera-repos/</url> </repository> # 编译方式1 # ./build/mvn ./build/mvn -Pyarn -Phive -Phive-thriftserver -Phadoop-2.6 -Dhadoop.version=2.6.0-cdh5.7.0 -DskipTests clean package # 编译方式2 # ./dev/make-distribution.sh # 参数说明 # --name:指定编译完成后Spark安装包的名字 # --tgz:以tgz的方式进行压缩 # -Psparkr:编译出来的Spark支持R语言 # -Phadoop-2.6:以hadoop-2.6的profile进行编译,具体的profile可以看出源码根目录中的pom.xml中查看 # -Phive和-Phive-thriftserver:编译出来的Spark支持对Hive的操作 # -Pmesos:编译出来的Spark支持运行在Mesos上 # -Pyarn:编译出来的Spark支持运行在YARN上 [zcx@zoucaoxin spark-2.4.3]$ ./dev/make-distribution.sh --name 2.6.0-cdh5.7.0 --tgz -Pyarn -Phadoop-2.6 -Phive -Phive-thriftserver -Dhadoop.version=2.6.0-cdh5.7.0 # 打包完后,在根目录下会生成编译后的 spark 文件:spark-2.4.3-bin-2.6.0-cdh5.7.0.tgz

    3.8 spark local 环境搭建

    # 解压编译后的spark包到app目录下 [zcx@zoucaoxin spark-2.4.3]$ tar -zxvf spark-2.4.3-bin-2.6.0-cdh5.7.0.tgz -C ~/app/ # 在bin目录下,*.cmd 文件,这类文件是在Windows下运行,可以把这些文件都删除掉 [zcx@zoucaoxin bin]$ rm -rf *.cmd # 配置spark home 环境变量 [zcx@zoucaoxin bin]$ vim ~/.bash_profile export SPARK_HOME=/home/zcx/app/spark-2.4.3-bin-2.6.0-cdh5.7.0 export PATH=$SPARK_HOME/bin:$PATH [zcx@zoucaoxin bin]$ source ~/.bash_profile # 启动local模式 # --master :指定spark启动的模式 # [2] : 多线程环境下启动 这里启动两个线程 [zcx@zoucaoxin spark-2.4.3-bin-2.6.0-cdh5.7.0]$ spark-shell --master local[2] # 打开后台界面:http://192.168.48.143:4040/

    3.9 spark Standdalone 环境搭建

    # 参考官方文档:https://spark.apache.org/docs/latest/spark-standalone.html # 修改conf目录下的spark-env.sh文件 # pwd:/home/zcx/app/spark-2.4.3-bin-2.6.0-cdh5.7.0/conf [zcx@zoucaoxin conf]$ cp spark-env.sh.template spark-env.sh [zcx@zoucaoxin conf]$ vim spark-env.sh SPARK_MASTER_HOST=zoucaoxin SPARK_WORKER_CORES=2 SPARK_WORKER_MEMORY=2g SPARK_WORKER_INSTANCES=1 # 启动的worker数目 # 修改conf目录下的slaves文件 [zcx@zoucaoxin conf]$ vim slaves zoucaoxin # 修改sbin目录下的spark-config.sh文件,添加 JAVA_HOME 路径 [zcx@zoucaoxin sbin]$ vim spark-config.sh export JAVA_HOME=/home/zcx/app/jdk1.8.0_191 # 在sbin目录下启动进程 [zcx@zoucaoxin sbin]$ ./start-all.sh starting org.apache.spark.deploy.master.Master, logging to /home/zcx/app/spark-2.4.3-bin-2.6.0-cdh5.7.0/logs/spark-zcx-org.apache.spark.deploy.master.Master-1-zoucaoxin.out zoucaoxin: starting org.apache.spark.deploy.worker.Worker, logging to /home/zcx/app/spark-2.4.3-bin-2.6.0-cdh5.7.0/logs/spark-zcx-org.apache.spark.deploy.worker.Worker-1-zoucaoxin.out # 进入master日志查看 [zcx@zoucaoxin spark-2.4.3-bin-2.6.0-cdh5.7.0]$ cat /home/zcx/app/spark-2.4.3-bin-2.6.0-cdh5.7.0/logs/spark-zcx-org.apache.spark.deploy.master.Master-1-zoucaoxin.out Spark Command: /home/zcx/app/jdk1.8.0_191/bin/java -cp /home/zcx/app/spark-2.4.3-bin-2.6.0-cdh5.7.0/conf/:/home/zcx/app/spark-2.4.3-bin-2.6.0-cdh5.7.0/jars/* -Xmx1g org.apache.spark.deploy.master.Master --host zoucaoxin --port 7077 --webui-port 8080 ======================================== Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties 19/05/15 10:49:43 INFO Master: Started daemon with process name: 23060@zoucaoxin 19/05/15 10:49:43 INFO SignalUtils: Registered signal handler for TERM 19/05/15 10:49:43 INFO SignalUtils: Registered signal handler for HUP 19/05/15 10:49:43 INFO SignalUtils: Registered signal handler for INT 19/05/15 10:49:44 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 19/05/15 10:49:44 INFO SecurityManager: Changing view acls to: zcx 19/05/15 10:49:44 INFO SecurityManager: Changing modify acls to: zcx 19/05/15 10:49:44 INFO SecurityManager: Changing view acls groups to: 19/05/15 10:49:44 INFO SecurityManager: Changing modify acls groups to: 19/05/15 10:49:44 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(zcx); groups with view permissions: Set(); users with modify permissions: Set(zcx); groups with modify permissions: Set() 19/05/15 10:49:45 INFO Utils: Successfully started service 'sparkMaster' on port 7077. 19/05/15 10:49:45 INFO Master: Starting Spark master at spark://zoucaoxin:7077 19/05/15 10:49:45 INFO Master: Running Spark version 2.4.3 19/05/15 10:49:46 INFO Utils: Successfully started service 'MasterUI' on port 8080. 19/05/15 10:49:46 INFO MasterWebUI: Bound MasterWebUI to 0.0.0.0, and started at http://zoucaoxin:8080 19/05/15 10:49:46 INFO Master: I have been elected leader! New state: ALIVE 19/05/15 10:49:51 INFO Master: Registering worker 192.168.48.143:38593 with 2 cores, 2.0 GB RAM # 绑定在7077端口,masterUI在8080端口 # 打开后台界面:http://192.168.48.143:8080/ # 查看worker日志 [zcx@zoucaoxin spark-2.4.3-bin-2.6.0-cdh5.7.0]$ cat /home/zcx/app/spark-2.4.3-bin-2.6.0-cdh5.7.0/logs/spark-zcx-org.apache.spark.deploy.worker.Worker-1-zoucaoxin.out Spark Command: /home/zcx/app/jdk1.8.0_191/bin/java -cp /home/zcx/app/spark-2.4.3-bin-2.6.0-cdh5.7.0/conf/:/home/zcx/app/spark-2.4.3-bin-2.6.0-cdh5.7.0/jars/* -Xmx1g org.apache.spark.deploy.worker.Worker --webui-port 8081 spark://zoucaoxin:7077 ======================================== Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties 19/05/15 10:49:47 INFO Worker: Started daemon with process name: 23139@zoucaoxin 19/05/15 10:49:47 INFO SignalUtils: Registered signal handler for TERM 19/05/15 10:49:48 INFO SignalUtils: Registered signal handler for HUP 19/05/15 10:49:48 INFO SignalUtils: Registered signal handler for INT 19/05/15 10:49:48 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 19/05/15 10:49:48 INFO SecurityManager: Changing view acls to: zcx 19/05/15 10:49:48 INFO SecurityManager: Changing modify acls to: zcx 19/05/15 10:49:48 INFO SecurityManager: Changing view acls groups to: 19/05/15 10:49:48 INFO SecurityManager: Changing modify acls groups to: 19/05/15 10:49:48 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(zcx); groups with view permissions: Set(); users with modify permissions: Set(zcx); groups with modify permissions: Set() 19/05/15 10:49:49 INFO Utils: Successfully started service 'sparkWorker' on port 38593. 19/05/15 10:49:50 INFO Worker: Starting Spark worker 192.168.48.143:38593 with 2 cores, 2.0 GB RAM 19/05/15 10:49:50 INFO Worker: Running Spark version 2.4.3 19/05/15 10:49:50 INFO Worker: Spark home: /home/zcx/app/spark-2.4.3-bin-2.6.0-cdh5.7.0 19/05/15 10:49:50 INFO Utils: Successfully started service 'WorkerUI' on port 8081. 19/05/15 10:49:50 INFO WorkerWebUI: Bound WorkerWebUI to 0.0.0.0, and started at http://zoucaoxin:8081 19/05/15 10:49:50 INFO Worker: Connecting to master zoucaoxin:7077... 19/05/15 10:49:50 INFO TransportClientFactory: Successfully created connection to zoucaoxin/192.168.48.143:7077 after 120 ms (0 ms spent in bootstraps) 19/05/15 10:49:51 INFO Worker: Successfully registered with master spark://zoucaoxin:7077 # workerUI在8081端口:http://192.168.48.143:8081/ # 结束spark进程 [zcx@zoucaoxin sbin]$ ./stop-all.sh # 启动集群 # ./bin/spark-shell --master spark://IP:PORT # To run an application on the Spark cluster, simply pass the spark://IP:PORT URL of the master as to the SparkContext constructor. [zcx@zoucaoxin bin]$ spark-shell --master spark://zoucaoxin:7077
    最新回复(0)