本节书摘来自华章社区《Hadoop大数据分析与挖掘实战》一书中的第3章,第3.3节动手实践,作者张良均 樊哲 赵云龙 李成华 ,更多章节内容可以访问云栖社区“华章社区”公众号查看
3.3 动手实践按照3.1.2节以及第2章的详细配置步骤进行操作,部署完成后即可进行下面的实验(默认使用Hadoop 2.6和Hive 1.2.1版本)。实践一:Hive表1)下载“02-上机实验/visits_data.txt”文件,并查看数据。
\[root@slave2 opt\]# head -n 5 visits_data.txt BUCKLEY SUMMER 10/12/2010 14:48 10/12/2010 14:45 WH CLOONEYGEORGE10/12/2010 14:47 10/12/2010 14:45 WH PRENDERGASTJOHN10/12/2010 14:48 10/12/2010 14:45 WH LANIERJAZMIN10/13/2010 13:00 WH BILL SIGNING/ MAYNARDELIZABETH10/13/2010 12:34 10/13/2010 13:00 WH BILL SIGNING/visits_data.txt数据包含6列,分别对应名字,姓,访问时间,计划访问时间,地点,备注,使用“\t”分隔。2)下载“02-上机实验/visits.hive”,并查看。[root@ slave2 opt]# cat visits.
hive --cat visits.hive create table people_visits ( last_name string, first_name string, arrival_time string, scheduled_time string, meeting_location string, info_comment string) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' ;上述代码是Hive中新建表的代码,使用上述代码即可建立Hive中的表。 3)使用Hive命令,建立Hive的people_visits表。root@ slave2 bin\]# ./hive -f /opt/visits.hive Logging initialized using configuration in jar:file:/opt/apache-hive-1.2.1-bin/lib/hive-common-1.2.1.jar!/hive-log4j.properties OK Time taken: 2.391 seconds4)使用hive shell命令行,查看生产的表。\[root@ slave2 ~\]# hive Logging initialized using configuration in jar:file:/opt/apache-hive-1.2.1-bin/lib/hive-common-1.2.1.jar!/hive-log4j.properties hive> show tables; OK people_visits Time taken: 1.344 seconds, Fetched: 1 row(s) hive> describe people_visits ; OK last_name string first_namestring arrival_timestring scheduled_timestring meeting_locationstring info_commentstring Time taken: 0.338 seconds, Fetched: 6 row(s)这里可以看到刚才建立的表,以及表的描述。 5)插入数据。 ①使用查询命令查看表中的数据。hive> select * from people_visits limit 10; OK Time taken: 0.863 seconds可以看到表中没有数据。 ②使用hadoop fs命令,拷贝visits_data.txt到HDFS的/user/hive/warehouse/people_visits目录中。\[root@ slave2 opt\]# hadoop fs -put visits_data.txt /user/hive/warehouse/people_visits \[root@ slave2 opt\]# hadoop fs -ls /user/hive/warehouse/people_visits -rw-r--r-- 3 root supergroup 989239 2015-08-17 10:30 /user/hive/warehouse/people_visits/visits_data.txt③再次查看数据。hive> select * from people_visits limit 5; OK BUCKLEY SUMMER 10/12/2010 14:48 10/12/2010 14:45 WH CLOONEYGEORGE10/12/2010 14:47 10/12/2010 14:45 WH PRENDERGASTJOHN10/12/2010 14:48 10/12/2010 14:45 WH LANIERJAZMIN10/13/2010 13:00WH BILL SIGNING/ MAYNARDELIZABETH10/13/2010 12:34 10/13/2010 13:00 WH BILL SIGNING/ Time taken: 0.155 seconds, Fetched: 5 row(s)可以看到已经查看到数据了。 6)使用MR进行查询。hive> select count(*) from people_visits; Query ID = root_20150817103724_d20ca51d-06ca-4efb-be59-6f66aec97489 Total jobs = 1 Launching Job 1 out of 1 Number of reduce tasks determined at compile time: 1 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number> In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number> In order to set a constant number of reducers: set mapreduce.job.reduces=<number> Starting Job = job_1439775378077_0003, Tracking URL = http://node101:8088/proxy/application_1439775378077_0003/ Kill Command = /opt/hadoop-2.6.0/bin/hadoop job -kill job_1439775378077_0003 Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1 2015-08-17 10:37:33,759 Stage-1 map = 0%, reduce = 0% 2015-08-17 10:37:41,432 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 2.11 sec 2015-08-17 10:37:48,932 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 4.57 sec MapReduce Total cumulative CPU time: 4 seconds 570 msec Ended Job = job_1439775378077_0003 MapReduce Jobs Launched: Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 4.57 sec HDFS Read: 996387 HDFS Write: 6 SUCCESS Total MapReduce CPU Time Spent: 4 seconds 570 msec OK 17977 Time taken: 25.92 seconds, Fetched: 1 row(s)这里使用MR查询看到查询所有行数。 7)删除people_visits表。hive> drop table people_visits; OK Time taken: 1.355 seconds hive> dfs -ls /user/hive/warehouse/people_visits; ls: '/user/hive/warehouse/people_visits': No such file or directory Command failed with exit code = 1 Query returned non-zero code: 1, cause: null这里看到删除表之后,HDFS中的数据也被删除了。 实践二:Hive外部表 1)拷贝“02-上机实验/names.txt”到客户端机器/opt目录下,并上传至HDFS。\[root@ slave2 ~\]# hadoop fs -put /opt/names.txt /user/root/names.txt \[root@ slave2 ~\]# hadoop fs -ls /user/root/names.txt -rw-r--r-- 3 root supergroup 78 2015-08-17 11:11 /user/root/names.txt \[root@ slave2 ~\]#2)在HDFS上新建/user/root/hivedemo文件夹。\[root@ slave2 ~\]# hadoop fs -mkdir /user/root/hivedemo3)新建Hive外部表,并指定数据存储位置为/user/root/hivedemo。hive> create external table names(id int,name string) > ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' > LOCATION '/user/root/hivedemo'; OK Time taken: 0.206 seconds4)把数据导入Hive的外部表names表中。hive> load data inpath '/user/root/names.txt' into table names; Loading data to table default.names Table default.names stats: \[numFiles=0, numRows=0, totalSize=0, rawDataSize=0\] OK Time taken: 0.451 seconds5)查看表中的数据。hive> select * from names; OK 0 Rich 1 Barry 2 George 3 Ulf 4 Danielle 5 Tom 6 manish 7 Brian 8 Mark Time taken: 0.102 seconds, Fetched: 9 row(s) hive> dfs -ls hivedemo; Found 1 items -rwxr-xr-x 3 root supergroup 78 2015-08-17 11:11 hivedemo/names.txt hive> dfs -ls /user/hive/warehouse;这里可以看到表中有数据,同时数据存储在指定的/user/root/hivedemo中,并没有存储在默认的/user/hive/warehouse中。 6)删除表。hive> drop table names; OK Time taken: 0.136 seconds hive> show tables; OK Time taken: 0.049 seconds hive> dfs -ls hivedemo; Found 1 items -rwxr-xr-x 3 root supergroup 78 2015-08-17 11:11 hivedemo/names.txt这里可以看到虽然表已经删除了,但是HDFS中的数据并没有删除。
相关资源:Hadoop实战中文版.pdf