《Hadoop大数据分析与挖掘实战》——3.3节动手实践

xiaoxiao2023-06-13 164

本节书摘来自华章社区《Hadoop大数据分析与挖掘实战》一书中的第3章，第3.3节动手实践，作者张良均　樊哲　赵云龙　李成华　，更多章节内容可以访问云栖社区“华章社区”公众号查看

3.3　动手实践按照3.1.2节以及第2章的详细配置步骤进行操作，部署完成后即可进行下面的实验（默认使用Hadoop 2.6和Hive 1.2.1版本）。实践一：Hive表1）下载“02-上机实验/visits_data.txt”文件，并查看数据。

\[root@slave2 opt\]# head -n 5 visits_data.txt BUCKLEY　　　SUMMER　　　10/12/2010 14:48　10/12/2010 14:45　WH CLOONEYGEORGE10/12/2010 14:47　10/12/2010 14:45　WH PRENDERGASTJOHN10/12/2010 14:48　10/12/2010 14:45　WH LANIERJAZMIN10/13/2010 13:00　　　　　　　　　　WH　BILL SIGNING/ MAYNARDELIZABETH10/13/2010 12:34　10/13/2010 13:00　WH　BILL SIGNING/visits_data.txt数据包含6列，分别对应名字，姓，访问时间，计划访问时间，地点，备注，使用“＼t”分隔。

2）下载“02-上机实验/visits.hive”，并查看。[root@ slave2 opt]# cat visits.

hive --cat visits.hive create table people_visits ( last_name string, first_name string, arrival_time string, scheduled_time string, meeting_location string, info_comment string) ROW FORMAT DELIMITED FIELDS TERMINATED BY '＼t' ;上述代码是Hive中新建表的代码，使用上述代码即可建立Hive中的表。 3）使用Hive命令，建立Hive的people_visits表。root@ slave2 bin\]# ./hive -f /opt/visits.hive Logging initialized using configuration in jar:file:/opt/apache-hive-1.2.1-bin/lib/hive-common-1.2.1.jar!/hive-log4j.properties OK Time taken: 2.391 seconds4）使用hive shell命令行，查看生产的表。\[root@ slave2 ~\]# hive Logging initialized using configuration in jar:file:/opt/apache-hive-1.2.1-bin/lib/hive-common-1.2.1.jar!/hive-log4j.properties hive> show tables; OK people_visits Time taken: 1.344 seconds, Fetched: 1 row(s) hive> describe people_visits ; OK last_name　　　　　　　string first_namestring arrival_timestring scheduled_timestring meeting_locationstring info_commentstring Time taken: 0.338 seconds, Fetched: 6 row(s)这里可以看到刚才建立的表，以及表的描述。 5）插入数据。 ①使用查询命令查看表中的数据。hive> select * from people_visits limit 10; OK Time taken: 0.863 seconds可以看到表中没有数据。 ②使用hadoop fs命令，拷贝visits_data.txt到HDFS的/user/hive/warehouse/people_visits目录中。\[root@ slave2 opt\]# hadoop fs -put visits_data.txt /user/hive/warehouse/people_visits \[root@ slave2 opt\]# hadoop fs -ls /user/hive/warehouse/people_visits -rw-r--r--　 3 root supergroup　　 989239 2015-08-17 10:30 /user/hive/warehouse/people_visits/visits_data.txt③再次查看数据。hive> select * from people_visits limit 5; OK BUCKLEY　　　SUMMER　　　10/12/2010 14:48　10/12/2010 14:45　WH CLOONEYGEORGE10/12/2010 14:47　10/12/2010 14:45　WH PRENDERGASTJOHN10/12/2010 14:48　10/12/2010 14:45　WH LANIERJAZMIN10/13/2010 13:00WH　BILL SIGNING/ MAYNARDELIZABETH10/13/2010 12:34　10/13/2010 13:00　WH　BILL SIGNING/ Time taken: 0.155 seconds, Fetched: 5 row(s)可以看到已经查看到数据了。 6）使用MR进行查询。hive> select count(*) from people_visits; Query ID = root_20150817103724_d20ca51d-06ca-4efb-be59-6f66aec97489 Total jobs = 1 Launching Job 1 out of 1 Number of reduce tasks determined at compile time: 1 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number> In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number> In order to set a constant number of reducers: set mapreduce.job.reduces=<number> Starting Job = job_1439775378077_0003, Tracking URL = http://node101:8088/proxy/application_1439775378077_0003/ Kill Command = /opt/hadoop-2.6.0/bin/hadoop job　-kill job_1439775378077_0003 Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1 2015-08-17 10:37:33,759 Stage-1 map = 0%,　reduce = 0% 2015-08-17 10:37:41,432 Stage-1 map = 100%,　reduce = 0%, Cumulative CPU 2.11 sec 2015-08-17 10:37:48,932 Stage-1 map = 100%,　reduce = 100%, Cumulative CPU 4.57 sec MapReduce Total cumulative CPU time: 4 seconds 570 msec Ended Job = job_1439775378077_0003 MapReduce Jobs Launched: Stage-Stage-1: Map: 1　Reduce: 1　 Cumulative CPU: 4.57 sec　 HDFS Read: 996387 HDFS Write: 6 SUCCESS Total MapReduce CPU Time Spent: 4 seconds 570 msec OK 17977 Time taken: 25.92 seconds, Fetched: 1 row(s)这里使用MR查询看到查询所有行数。 7）删除people_visits表。hive> drop table people_visits; OK Time taken: 1.355 seconds hive> dfs -ls /user/hive/warehouse/people_visits; ls: '/user/hive/warehouse/people_visits': No such file or directory Command failed with exit code = 1 Query returned non-zero code: 1, cause: null这里看到删除表之后，HDFS中的数据也被删除了。实践二：Hive外部表 1）拷贝“02-上机实验/names.txt”到客户端机器/opt目录下，并上传至HDFS。\[root@ slave2 ~\]# hadoop fs -put /opt/names.txt /user/root/names.txt \[root@ slave2 ~\]# hadoop fs -ls /user/root/names.txt -rw-r--r--　 3 root supergroup　　　　 78 2015-08-17 11:11 /user/root/names.txt \[root@ slave2 ~\]#2）在HDFS上新建/user/root/hivedemo文件夹。\[root@ slave2 ~\]# hadoop fs -mkdir /user/root/hivedemo3）新建Hive外部表，并指定数据存储位置为/user/root/hivedemo。hive> create external table names(id int,name string) > ROW FORMAT DELIMITED FIELDS TERMINATED BY '＼t' > LOCATION '/user/root/hivedemo'; OK Time taken: 0.206 seconds4）把数据导入Hive的外部表names表中。hive> load data inpath '/user/root/names.txt' into table names; Loading data to table default.names Table default.names stats: \[numFiles=0, numRows=0, totalSize=0, rawDataSize=0\] OK Time taken: 0.451 seconds5）查看表中的数据。hive> select * from names; OK 0　Rich 1　Barry 2　George 3　Ulf 4　Danielle 5　Tom 6　manish 7　Brian 8　Mark Time taken: 0.102 seconds, Fetched: 9 row(s) hive> dfs -ls hivedemo; Found 1 items -rwxr-xr-x　 3 root supergroup　　　　 78 2015-08-17 11:11 hivedemo/names.txt hive> dfs -ls /user/hive/warehouse;这里可以看到表中有数据，同时数据存储在指定的/user/root/hivedemo中，并没有存储在默认的/user/hive/warehouse中。 6）删除表。hive> drop table names; OK Time taken: 0.136 seconds hive> show tables; OK Time taken: 0.049 seconds hive> dfs -ls hivedemo; Found 1 items -rwxr-xr-x　 3 root supergroup　　　　 78 2015-08-17 11:11 hivedemo/names.txt

这里可以看到虽然表已经删除了，但是HDFS中的数据并没有删除。

相关资源：Hadoop实战中文版.pdf

最新回复(0)