主要内容:
1.使用 VMWare Workstation pro安装 CentOS 7,安装并配置 Hadoop,实现伪分布式与分布式部署 2.启动自带MapReduce示例程序 WordCount 3.编写程序,利用 Hadoop 的 Java API 实现简单的文件系统读写功能 4.编写程序,利用 Hadoop 的 Java API 实现启动自带MapReduce示例程序 WordCount 5.安装并配置 Hbase,编写程序,利用HBase 的 Java API 实现简单的CRUD操作 6.安装 Redis、MongoDB,了解其基本操作 7.安装并配置 Hive、MySQL 及其 JDBC 驱动,编写 HiveSQL 语句实现简单的CRUD操作
VMWare Workstation pro 版本:12.5.7(这是我用的版本) CentOS 版本:7.6(目前的最新版,老版本镜像使用 yum -y update 也是能升到最新版的) JDK 版本:8u211-linux-x64 Hadoop 版本:2.7.3 HBase 版本:1.4.9(目前最新的 stable 版本) Hive 版本:2.3.4 MySQL 版本:5.7.25 Scala 版本:2.12.8 Spark 版本:2.4.2 eclipse最新版本
跟着可以进行centos7的安装,此处我省略了。但是注意将虚拟机的网络适配器选为“自定义”中的“VMnet8(NAT模式)。
配置静态IP 点击【NAT设置】,记下网关IP 然后以root身份登入Centos,修改网卡配置文件(注意 ifcfg-ens33 是网卡名,请根据自己的实际情况修改) vi /etc/sysconfig/network-scripts/ifcfg-ens33将 BOOTPROTO=DHCP 改为 BOOTPROTO=static 将 ONBOOT=no 改为 ONBOOT=yes 在最后面添加: 指定的 IP 地址(任取,须保证在你的子网网段范围内且不能与网关相同):IPADDR 子网掩码:NETMASK 默认网关:GATEWAY
保存并退出,然后重启网络服务
systemctl restart network 可以把系统升级到最新版本: yum -y update 安装常用工具: yum -y install net-tools wget vim 关闭防火墙 firewall-cmd --state # 显示防火墙状态running/not running systemctl stop firewalld #临时关闭防火墙,每次开机重新开启防火墙 systemctl disable firewalld # 禁止防火墙服务。 使用 XShell 连接该机器(方便复制粘贴命令),使用XFtp把那些要装的软件传上去(建议放到 /usr/local 文件夹,方便统一管理,文件夹不存在就新建一下)下载:https://www.oracle.com/technetwork/java/javase/downloads/index.html
然后进行安装,CentOS下的软件的安装方法也都是这样。
# rmp包安装方法 rpm -ivh jdk-8u212-linux-x64.rpm # tar.gz(二进制包)安装方法 # 注意:不是说tar.gz格式的就都是二进制包,也有些是源码包,要自行编译,请注意区分 tar -zxvf jdk-8u212-linux-x64.tar.gz配置环境变量
vim /etc/profile最后面添加
export JAVA_HOME=/usr/java/jdk1.8.0_212 export PATH=$JAVA_HOME/bin:$PATH使其生效:
source /etc/profile后面的各个软件的环境变量也是这么配置
下载:https://archive.apache.org/dist/hadoop/common/hadoop-2.7.3/ 下载之后可以根据 https://blog.csdn.net/fyihdg/article/details/82317512 进行安装,这个操作需要细致小心一些。 安装好之后可以对hadoop进行伪分布式与全分布式配置与部署,详情可看 https://blog.csdn.net/l_15156024189/article/details/81810553 我就是根据这一个来的,它的jdk版本是jdk-8u144,而我用的是/jdk1.8.0_212,这里要改一下。 注意,要先手动配置 hadoop-env.sh 中的 JAVA_HOME:
# The java implementation to use. export JAVA_HOME=/usr/java/jdk1.8.0_212另外伪分布式配置的配置文件中 core-site.xml 文件那里不要使用 hdfs://localhost:9000 ,须使用 hdfs://机器名:9000,并配置系统的 hosts 文件(/etc/hosts),在 hosts 文件末行添加 本机IP 机器名 如果你想改机器名可以进入:
vim /etc/hostname运行wordcount
hdfs dfs -put in.txt /adir 上传本地当前路径下的in.txt文件 到hdfs的/adir目录下。 运行hadoop jar hadoop-mapreduce-examples-2.7.3.jar wordcount /adir/in.txt output/。 在http://192.168.180.6:50070 查看/user/root/output/part-r-00000文件里的词频统计结果。
下载:https://mirrors.tuna.tsinghua.edu.cn/apache/hbase/stable/hbase-1.4.9-bin.tar.gz
hbase的文档:https://hbase.apache.org/book.html
这是我跟着做的具体步骤还请参考:https://www.2cto.com/net/201805/750759.html
注意配置 hbase-site.xml 时还需要解除 export HBASE_MANAGES_ZK=true 的注释
注意配置hbase-site.xml 文件时 hbase.rootdir 那里要配置在 HDFS 实例,而非本机的文件系统中(其实就是设成 Hadoop 的 core-site.xml 配置文件中 fs.defaultFS 配置的值) 且不需要配置 hbase.zookeeper.property.dataDir 、hbase.zookeeper.quorum 和 hbase.unsafe.stream.capability.enforce,同时需按照 2.3. Pseudo-Distributed Local Install 小节的要求配置成伪分布式,最终效果如下:
<configuration> <property> <name>hbase.rootdir</name> <value>hdfs://bigdata001:9000/hbase</value> </property> <property> <name>hbase.cluster.distributed</name> <value>true</value> </property> </configuration>Redis: 下载及安装(安装方法在页面下边,这软件要你自己在机器上编译):https://redis.io/download 安装部署具体可参考:https://www.cnblogs.com/it-cen/p/4295984.html 挺详细的。 这是启动后的Redis MongoDB: 文档(下载安装方法都在里面了):https://docs.mongodb.com/manual/tutorial/install-mongodb-on-red-hat/
Hive: 下载:https://mirrors.tuna.tsinghua.edu.cn/apache/hive/hive-2.3.4/apache-hive-2.3.4-bin.tar.gz
MySQL:
下载(这个下的是MySQL的Yum Repository):https://dev.mysql.com/get/mysql80-community-release-el7-2.noarch.rpm
安装与配置:https://dev.mysql.com/doc/mysql-yum-repo-quick-guide/en/ MySQL JDBC 驱动下载地址:https://dev.mysql.com/downloads/connector/j/
下载时选择“Platform Independent”,提取压缩包内的 jar 文件即可。
hive的安装可以看:https://www.cnblogs.com/dxxblog/p/8193967.html
安装好之后就可以进行配置 hive:
先在 hive 主目录下新建名为 intmp 和 tmp 的文件夹:
mkdir /usr/local/hive/apache-hive-2.3.4/intmp mkdir /usr/local/hive/apache-hive-2.3.4/tmp① 配置 hive-env.sh:
cd /usr/local/hive/apache-hive-2.3.4/conf cp hive-env.sh.template hive-env.sh # Set HADOOP_HOME to point to a specific hadoop install directory HADOOP_HOME=/usr/local/hadoop/hadoop-2.7.3 # Hive Configuration Directory can be controlled by: export HIVE_CONF_DIR=/usr/local/hive/apache-hive-2.3.4/conf② 配置 hive-site.xml:
cp hive-default.xml.template hive-site.xml分别找到这几处参数,进行配置,如下所示:
<property> <name>javax.jdo.option.ConnectionURL</name> <value>jdbc:mysql://127.0.0.1:3306/hive?characterEncoding=UTF-8&serverTimezone=GMT+8</value> <description> JDBC connect string for a JDBC metastore. To use SSL to encrypt/authenticate the connection, provide database-specific SSL flag in the connection URL. For example, jdbc:postgresql://myhost/db?ssl=true for postgres database. </description> </property> <property> <name>javax.jdo.option.ConnectionDriverName</name> <value>com.mysql.cj.jdbc.Driver</value> <description>Driver class name for a JDBC metastore</description> </property> <property> <name>javax.jdo.option.ConnectionUserName</name> <value>root</value> <description>Username to use against metastore database</description> </property> <property> <name>javax.jdo.option.ConnectionPassword</name> <value>123456</value> <description>password to use against metastore database</description> </property>最后找到以下参数,进行如下配置:
<property> <name>hive.exec.local.scratchdir</name> <value>/usr/local/hive/apache-hive-2.3.4/tmp/${user.name}</value> <description>Local scratch space for Hive jobs</description> </property> <property> <name>hive.downloaded.resources.dir</name> <value>/usr/local/hive/apache-hive-2.3.4/intmp/${hive.session.id}_resources</value> <description>Temporary local directory for added resources in the remote file system.</description> </property> <property> <name>hive.querylog.location</name> <value>/usr/local/hive/apache-hive-2.3.4/intmp/${system:user.name}</value> <description>Location of Hive run time structured log file</description> </property> <property> <name>hive.server2.logging.operation.log.location</name> <value>/usr/local/hive/apache-hive-2.3.4/intmp/${system:user.name}/operation_logs</value> <description>Top level directory where operation logs are stored if logging functionality is enabled</description> </property> <property> <name>hive.server2.thrift.bind.host</name> <value>bigdata</value> <description>Bind host on which to run the HiveServer2 Thrift service.</description> </property> # 还需要在开头处添加以下配置: <property> <name>system:java.io.tmpdir</name> <value>/usr/local/hive/apache-hive-2.3.4/intmp</value> <description/> </property>完成后你还需要将 MySQL的 JDBC 驱动 mysql-connector-java-8.0.15.jar 复制到 /usr/local/hive/apache-hive-2.3.4/lib 文件夹中。
最后,按照官方教程指示,执行以下命令:
schematool -dbType mysql -initSchema至此就完成了使用外部 MySQL 数据库服务器配置 Metastore 的全过程,控制台结果输出如下:
[root@bigdata001 apache-hive-2.3.4]# schematool -dbType mysql -initSchema SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/usr/local/hive/apache-hive-2.3.4/lib/log4j-slf4j-impl-2.6.2.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/usr/local/hadoop/hadoop-2.7.3/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory] Metastore connection URL: jdbc:mysql://127.0.0.1:3306/hive?characterEncoding=UTF-8&serverTimezone=UTC Metastore Connection Driver : com.mysql.cj.jdbc.Driver Metastore connection User: root Starting metastore schema initialization to 2.3.0 Initialization script hive-schema-2.3.0.mysql.sql Initialization script completed schemaTool completed编写 HiveQL 语句,实现数据库、表、视图的基本操作: 运行hiveser2和beeline (根据官方文档描述,Hive cli 现在已被弃用,取而代之的是 HiveServer2 自己的 Beeline )。在此之前,还需要向 Hadoop 的配置文件 core-site.xml 中加入以下内容:
<property> <name>hadoop.proxyuser.root.hosts</name> <value>*</value> </property> <property> <name>hadoop.proxyuser.root.groups</name> <value>*</value> </property>然后重启Hadoop,完成后启动 hiveserver2:
hiveserver2另开一个新 Terminal,打开 beeline(第一次打开可能会提示Permission Denied,关闭后再重开一次即可)
beeline -u jdbc:hive2://bigdata001:10000 -n root成功执行之后的话将进入 beeline:
[root@bigdata001 ~]# beeline -u jdbc:hive2://bigdata001:10000 SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/usr/local/hive/apache-hive-2.3.4/lib/log4j-slf4j-impl-2.6.2.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/usr/local/hadoop/hadoop-2.7.3/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory] Connecting to jdbc:hive2://bigdata001:10000 Connected to: Apache Hive (version 2.3.4) Driver: Hive JDBC (version 2.3.4) Transaction isolation: TRANSACTION_REPEATABLE_READ Beeline version 2.3.4 by Apache Hive 0: jdbc:hive2://bigdata001:10000>然后可以进行创建数据库:
0: jdbc:hive2://bigdata:10000> CREATE DATABASE userdb; No rows affected (0.177 seconds) 0: jdbc:hive2://bigdata:10000> SHOW DATABASES; +----------------+ | database_name | +----------------+ | default | | userdb | +----------------+ 2 rows selected (0.106 seconds)创建表:
0: jdbc:hive2://bigdata:10000> USE userdb; No rows affected (0.119 seconds) 0: jdbc:hive2://bigdata:10000> CREATE TABLE dokx (foo INT, bar STRING); No rows affected (0.182 seconds) 0: jdbc:hive2://bigdata:10000> SHOW TABLES; +-----------+ | tab_name | +-----------+ | dokx | +-----------+ 1 row selected (0.132 seconds)编写 HiveQL 语句实现 WordCount 程序:
参考:https://blog.csdn.net/mylittlered/article/details/42148863
先把要统计的文件传到 HDFS 上:
[root@bigdata001 ~]# vim a.txt [root@bigdata001 ~]# hdfs dfs -mkdir /input [root@bigdata001 ~]# hdfs dfs -put a.txt /input [root@bigdata001 ~]# hdfs dfs -ls /input Found 1 items -rw-r--r-- 1 root supergroup 20799 2019-05-10 22:24 /input/a.txt打开 beeline,创建内部表 words:
0: jdbc:hive2://bigdata001:10000> create table words(line string); No rows affected (0.192 seconds)导入内容:
0: jdbc:hive2://bigdata001:10000> load data inpath '/input/a.txt' overwrite into table words; No rows affected (0.557 seconds)执行 WordCount 操作,将结果保存到新表 wordcount 中:
0: jdbc:hive2://bigdata001:10000> create table wordcount as select word, count(1) as count from (select explode(split(line,' '))as word from words) w group by word order by word; WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases. No rows affected (43.34 seconds)查看统计结果:
0: jdbc:hive2://bigdata:10000> select * from wordcount; +---------------------------------------------+------------------+ | wordcount.word | wordcount.count | +---------------------------------------------+------------------+ | | 1136 | | "bing" | 2 | | "nihao" | 1 | | "hello" | 1 | | "Derivative" | 1 | | "License" | 1 | | "Work" | 1 | | "You" | 1 | | "Your") | 1 | | "world" | 1 |Scala 首先安装scala,下载:https://downloads.lightbend.com/scala/2.12.8/scala-2.12.8.rpm Spark 然后下载安装spark,下载:https://www.apache.org/dyn/closer.lua/spark/spark-2.4.2/spark-2.4.2-bin-hadoop2.7.tgz 对spark进行伪分布式配置,进入 conf 文件夹,cd module/etc/spark/conf 复制配置文件模板:
cp spark-env.sh.template spark-env.sh修改 spark-env.sh:
export JAVA_HOME=/usr/java/jdk1.8.0_212 export SCALA_HOME=/usr/share/scala export HADOOP_HOME=/usr/local/hadoop/hadoop-2.7.3 export HADOOP_CONF_DIR=/usr/local/hadoop/hadoop-2.7.3/etc/hadoop export SPARK_MASTER_HOST=bigdata001 export SPARK_MASTER_PORT=7077 export LD_LIBRARY_PATH=$HADOOP_HOME/lib/native然后为 Spark 配置环境变量 SPARK_HOME ,并将其 bin 目录加入 path, 此外,还需配置 LD_LIBRARY_PATH 避免 Hadoop 依赖问题:
export JAVA_HOME=/usr/java/jdk1.8.0_212 export HADOOP_HOME=/usr/local/hadoop/hadoop-2.7.3 export HBASE_HOME=/usr/local/hbase/hbase-1.4.9 export HIVE_HOME=/usr/local/hive/apache-hive-2.3.4-bin export SPARK_HOME=/usr/local/spark/spark-2.4.2-bin-hadoop2.7 export LD_LIBRARY_PATH=$HADOOP_HOME/lib/native export PATH=$JAVA_HOME/bin:$HADOOP_HOME/sbin:$HADOOP_HOME/bin:$HBASE_HOME/bin:$HIVE_HOME/bin:$SPARK_HOME/bin:$PATH保存并执行 source /etc/profile 使其生效。
结束后分别执行 start-dfs.sh 和 start-yarn.sh 启动 Hadoop; 最后,进入Spark 的 sbin 目录执行 start-all.sh 启动 spark:
./start-all.sh使用 jps 命令查看进程:
[root@bigdata001 sbin]# jps 10005 NameNode 10151 DataNode 10535 ResourceManager 12455 Jps 10345 SecondaryNameNode 12282 Worker 10653 NodeManager 12191 Master启动 Spark Shell :
[root@bigdata001 spark-2.4.2-bin-hadoop2.7]# ./bin/spark-shell Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). Spark context Web UI available at http://bigdata:4040 Spark context available as 'sc' (master = local[*], app id = local-1557543612970). Spark session available as 'spark'. Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 2.4.2 /_/ Using Scala version 2.12.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_211) Type in expressions to have them evaluated. Type :help for more information. scala>参考;https://blog.csdn.net/bluejoe2000/article/details/41556979
它用了java和scala语言,挺详细的,可以参考。
在 CentOS 中打包要先安装 sbt ,下载链接: https://sbt.bintray.com/rpm/sbt-1.2.8.rpm 安装后运行一次 sbt 命令,会开始下载依赖包,但特别慢,解决办法: https://blog.csdn.net/wawa8899/article/details/74276515
在 Spark Shell 使用本地文件进行统计,结果如下: scala> val textFile = sc.textFile("file:///usr/local/spark/mycode/wordcount/word.txt") textFile: org.apache.spark.rdd.RDD[String] = file:///usr/local/spark/mycode/wordcount/word.txt MapPartitionsRDD[8] at textFile at <console>:24 scala> val wordCount = textFile.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey((a, b) => a + b) wordCount: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[11] at reduceByKey at <console>:25 scala> wordCount.collect() res4: Array[(String, Int)] = Array((under,9), (Unless,3), (Contributions),1), (offer,1), (NON-INFRINGEMENT,,1), (agree,1), (its,3), (event,1), (intentionally,2), (Grant,2), (have,2), (include,3), (responsibility,,1), (writing,1), (MERCHANTABILITY,,1), (Contribution,3), (express,2), ("Your"),1), ((i),1), (However,,1), (files;,1), (been,2), (This,1), (stating,1), (conditions.,1), (non-exclusive,,2), (appropriateness,1), (marked,1), (risks,1), (any,28), (IS",2), (filed.,1), (Sections,1), (fee,1), (losses),,1), (out,1), (contract,1), (from,,1), (4.,1), (names,,1), 在 CentOS中 编写 WordCount 程序,在 Spark Shell 中执行程序: [root@bigdata001 scala-2.12]# spark-submit --class "WordCount" /usr/local/spark/mycode/wordcount/target/scala-2.12/simple-project_2.12-1.0.jar ... ... (under,9) (Contributor,8) (owner,4) (executed,1) (For,3) (Unless,3) (Contributions),1) (modifications,,3) (reproduce,,1) (The,2) (offer,1) (NON-INFRINGEMENT,,1) (agree,1) (legal,1) (its,3) (event,1) (informational,1) ((50%),1) ((or,3) ("Contributor",1) (document.,1) (work.,1) (intentionally,2) (Grant,2) (have,2) ...3.编写 Java 版的 WordCount 程序并执行: 代码:
import org.apache.spark.SparkConf; import org.apache.spark.api.java.JavaPairRDD; import org.apache.spark.api.java.JavaRDD; import org.apache.spark.api.java.JavaSparkContext; import scala.Tuple2; import java.util.Arrays; public class JavaWordCount { public static void main(String[] args) { SparkConf conf = new SparkConf().setAppName("Spark WordCount written by java!"); JavaSparkContext sc = new JavaSparkContext(conf); // 在这里指定 hdfs 中的 待统计文件目录 JavaRDD<String> textFile = sc.textFile("hdfs:///dit/a.txt"); JavaPairRDD<String, Integer> counts = textFile .flatMap(s -> Arrays.asList(s.split(" ")).iterator()) .mapToPair(word -> new Tuple2<>(word, 1)) .reduceByKey((a, b) -> a + b); // 在这里指定输出结果存储位置 counts.saveAsTextFile("hdfs:///dit/result"); sc.close(); } }pom.xml:
<?xml version="1.0" encoding="UTF-8"?> <project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd"> <modelVersion>4.0.0</modelVersion> <groupId>wang.oo0oo</groupId> <artifactId>sparktest</artifactId> <version>1.0</version> <dependencies> <dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-core_2.12</artifactId> <version>2.4.3</version> </dependency> </dependencies> <build> <pluginManagement> <plugins> <plugin> <artifactId>maven-assembly-plugin</artifactId> <configuration> <appendAssemblyId>false</appendAssemblyId> <descriptorRefs> <descriptorRef>jar-with-dependencies</descriptorRef> </descriptorRefs> <archive> <manifest> <mainClass>JavaWordCount</mainClass> </manifest> </archive> </configuration> <executions> <execution> <id>make-assembly</id> <phase>package</phase> <goals> <goal>assembly</goal> </goals> </execution> </executions> </plugin> <plugin> <groupId>org.apache.maven.plugins</groupId> <artifactId>maven-compiler-plugin</artifactId> <configuration> <source>8</source> <target>8</target> </configuration> </plugin> </plugins> </pluginManagement> </build> </project>打包后上传到 CentOS 中,执行以下命令:
./bin/spark-submit --class JavaWordCount --master spark://bigdata001:7077 /usr/local/spark/mycode/sparktest-1.0.jar输出结果在 HDFS 看。
注:运行时若提示 hdfs 正在安全模式,可使用以下命令关闭安全模式:
hadoop dfsadmin -safemode leaveSpark context Web UI:4040
