hive练习案例

xiaoxiao2023-10-12 149

本地模式 set hive.exec.mode.local.auto=true;

创建的表：

create table if not exists employee( name string comment 'employee name', salary float comment 'employee salary', subordinates array<string> comment 'names of subordinates', deductions map<string,float> comment 'keys are deductions values are percentages', address struct<street:string, city:string, state:string, zip:int> comment 'home address' ) comment 'description of the table' location '/user/hive/warehouse/mydb.db/employee' tblproperties ('creator'='yoona','date'='20160719');

//注意：此处location放在tblproperties之前 //comment 为注释代码 //TBLPROPERTIES允许开发者定义一些自己的键值对信息。可以对TBLPROPERTIES进行查看和修改（部分可修改）。在TBLPROPERTIES中有一些预定义信息，比如last_modified_user和last_modified_time，其他的一些预定义信息包括：

单分区 create table day_table(id int,content string) partitioned by(dt string) 查询分区： select day_table * from day where day_table.dt='2018-07-087' 分区就是为了缩小查询范围

多分区 create table day_table(id int,content string) partitioned by(dt string,hour string); 分区的修改：增加分区 alter table 表名 add partition(dt=‘2018-0808’) 删除分区： alter table 表名 drop partition(dt=‘2018-0808’) 分桶：字段： cluster by 、sort by、distribute by 指定开启分桶 set hive.enforce.bucketing = true; set mapreduce.job.reduces=4; 创建分桶表： create table stu_buck(Sno int,Sname string,Sex string,Sage int,Sdept string) clustered by(Sno) sorted by(Sno DESC) into 4 buckets row format delimited fields terminated by ‘,’;

准备数据： 95001,李勇,男,20,CS 95002,刘晨,女,19,IS 95003,王敏,女,22,MA 95004,张立,男,19,IS 95005,刘刚,男,18,MA 95006,孙庆,男,23,CS 95007,易思玲,女,19,MA 95008,李娜,女,18,CS 95009,梦圆圆,女,18,MA 95010,孔小涛,男,19,CS 95011,包小柏,男,18,MA 95012,孙花,女,20,CS 95013,冯伟,男,21,CS 95014,王小丽,女,19,CS 95015,王君,男,18,MA 95016,钱国,男,21,MA 95017,王风娟,女,18,IS 95018,王一,女,19,IS 95019,邢小丽,女,19,IS 95020,赵钱,男,21,IS 95021,周二,男,17,MA 95022,郑明,男,20,MA 注： 1分桶表的个数：用户定义HSQL语句设置的reduceTask的个数决定，分桶表已经决定就不能修改，如果要改变分桶，需要重新插入分桶数据 2表的分区的个数：用户自定义的，可以由程序自定义生成，也可以动态增长数据表现： 1分桶表中每个分桶数据可以有多个key值 2分区表中每个分区只有一个key 插入数据： select * from student cluster by(Sno) sort by(Sage); 报错,cluster 和 sort 不能共存所以，用下面的： insert overwrite table t07 select * from t07_tmp cluster by(sno); select * from stu_buck tablesample (bucket 1 out of 2 on sno); 解析：查询第1桶、第（1+2）桶的数据，其中sno是之前按照sno进行分桶。这里的“2”的位置，必须是分桶个数的整数倍或者因子内外部表创建内部表：创建外部表： CREATE external TABLE test1(id INT, content STRING) ROW FORMAT delimited FIELDS TERMINATED BY ',' location '/stu'; 区别：删除内部表时，会删除表和相关的元数据；删除外部表时，会删除相关的元数据，但是那个目录不会删除

修改表：

1.2．修改表

增加分区： ALTER TABLE table_name ADD PARTITION (dt='20170101') location '/user/hadoop/warehouse/table_name/dt=20170101'; //一次添加一个分区 ALTER TABLE table_name ADD PARTITION (dt='2008-08-08', country='us') location '/path/to/us/part080808' PARTITION (dt='2008-08-09', country='us') location '/path/to/us/part080809'; //一次添加多个分区删除分区 ALTER TABLE table_name DROP IF EXISTS PARTITION (dt='2008-08-08'); ALTER TABLE table_name DROP IF EXISTS PARTITION (dt='2008-08-08', country='us'); 修改分区 ALTER TABLE table_name PARTITION (dt='2008-08-08') RENAME TO PARTITION (dt='20080808'); 添加列 ALTER TABLE table_name ADD|REPLACE COLUMNS (col_name STRING); 注：ADD 是代表新增一个字段，新增字段位置在所有列后面(partition 列前) REPLACE 则是表示替换表中所有字段。修改列 test_change (a int, b int, c int); ALTER TABLE test_change CHANGE a a1 INT; 修改 a 字段名表重命名 ALTER TABLE table_name RENAME TO new_table_name

显示命令

show tables; 显示当前数据库所有表 show databases |schemas; 显示所有数据库 show partitions table_name; 显示表分区信息，不是分区表执行报错 show functions; 显示当前版本 hive 支持的所有方法 desc extended table_name; 查看表信息 desc formatted table_name; 查看表信息（格式化美观） describe database database_name; 查看数据库相关信息

like字段： create table t_t9 like t_t8; load字段 Load 在将数据加载到表中时， Hive 不会进行任何转换。加载操作是将数据文件移动到与 Hive表对应的位置的纯复制/移动操作。语法结构 LOAD DATA [LOCAL] INPATH ‘filepath’ [OVERWRITE] INTO TABLE tablename [PARTITION (partcol1=val1, partcol2=val2 …)]

说明： 1、 filepath 相对路径，例如：project/data1 绝对路径，例如：/user/hive/project/data1 完整 URI，例如：hdfs://namenode:9000/user/hive/project/data1 filepath 可以引用一个文件（在这种情况下，Hive 将文件移动到表中），或者它可以是一个目录（在这种情况下， Hive 将把该目录中的所有文件移动到表中）。 2、 LOCAL 如果指定了 LOCAL， load 命令将在本地文件系统中查找文件路径。 load 命令会将 filepath 中的文件复制到目标文件系统中。目标文件系统由表的位置属性决定。被复制的数据文件移动到表的数据对应的位置。如果没有指定 LOCAL 关键字，如果 filepath 指向的是一个完整的 URI，hive 会直接使用这个 URI。否则：如果没有指定 schema 或者 authority，Hive 会使用在 hadoop 配置文件中定义的 schema 和 authority，fs.default.name 指定了 Namenode 的 URI。 3、 OVERWRITE 如果使用了 OVERWRITE 关键字，则目标表（或者分区）中的内容会被删除，然后再将 filepath 指向的文件/目录中的内容添加到表/分区中。如果目标表（分区）已经有一个文件，并且文件名和 filepath 中的文件名冲突，那么现有的文件会被新文件所替代

动态分区插入

set hive.exec.dynamic.partition=true; #是否开启动态分区功能，默认false关闭。 set hive.exec.dynamic.partition.mode=nonstrict; #动态分区的模式，默认strict，表示必须指定至少一个分区为静态分区，nonstrict模式表示允许所有的分区字段都可以使用动态分区。

需求：将dynamic_partition_table中的数据按照时间(day)，插入到目标表d_p_t的相应分区中。

原始表：

create table dynamic_partition_table(day string,ip string)row format delimited fields terminated by ","; load data local inpath '/root/hivedata/dynamic_partition_table.txt' into table dynamic_partition_table; dynamic_partition_table.txt数据为： 2015-05-10,ip1 2015-05-10,ip2 2015-06-14,ip3 2015-06-14,ip4 2015-06-15,ip1 2015-06-15,ip2 目标表： create table d_p_t(ip string) partitioned by (month string,day string); 动态插入： insert overwrite table d_p_t partition (month,day) select ip,substr(day,1,7) as month,day from dynamic_partition_table;

1: jdbc:hive2://node1:10000> select * from d_p_t; ±----------±-------------±------------±-+ | d_p_t.ip | d_p_t.month | d_p_t.day | ±----------±-------------±------------±-+ | ip1 | 2015-05 | 2015-05-10 | | ip2 | 2015-05 | 2015-05-10 | | ip3 | 2015-06 | 2015-06-14 | | ip4 | 2015-06 | 2015-06-14 | | ip1 | 2015-06 | 2015-06-15 | | ip2 | 2015-06 | 2015-06-15 | ±----------±-------------±------------±-+

查询结果导出到文件系统

3、将查询结果保存到指定的文件目录（可以是本地，也可以是hdfs） insert overwrite local directory ‘/home/hadoop/test’ select * from t_p;

insert overwrite directory ‘/aaa/test’ select * from t_p;

Select 基本的 Select 操作语法结构 SELECT [ALL | DISTINCT] select_expr, select_expr, … FROM table_reference JOIN table_other ON expr [WHERE where_condition] [GROUP BY col_list [HAVING condition]] [CLUSTER BY col_list | [DISTRIBUTE BY col_list] [SORT BY | ORDER BY col_list] ] [LIMIT number]

说明： 1、order by 会对输入做全局排序，因此只有一个 reducer，会导致当输入规模较大时，需要较长的计算时间。

2、sort by 不是全局排序，其在数据进入 reducer 前完成排序。因此，如果用 sort by 进行排序，并且设置 mapred.reduce.tasks>1，则 sort by 只保证每个 reducer 的输出有序，不保证全局有序。

3、 distribute by(字段)根据指定字段将数据分到不同的 reducer，分发算法是 hash 散列。

4、Cluster by(字段) 除了具有 Distribute by 的功能外，还会对该字段进行排序。如果 distribute 和 sort 的字段是同一个时，此时，cluster by = distribute by + sort by

1分桶下载：

insert overwrite local directory ‘/root/hivedata/aaa’ select * from t07_tmp cluster by(Sno);

select * from student order by sage asc 2分桶排序 insert overwrite local directory ‘/root/aaa888’ select * from student DISTRIBUTE by Sno sort by sage 3全局排序： select * from student order by sage asc

多表关联：

create table a(id int,name string) row format delimited fields terminated by ‘,’;

create table b(id int,name string) row format delimited fields terminated by ‘,’;

关于hive中的各种join

准备数据 1,a 2,b 3,c 4,d 7,y 8,u

2,bb 3,cc 7,yy 9,pp

导入数据 load data local inpath ‘/root/hivedata/a.txt’ into table a; load data local inpath ‘/root/hivedata/b.txt’ into table b;

select * from a cross join b;

cross join（##慎用）返回两个表的笛卡尔积结果，不需要指定关联键。 ±------±--------±------±--------±-+ | a.id | a.name | b.id | b.name | ±------±--------±------±--------±-+ | 1 | a | 2 | bb | | 1 | a | 3 | cc | | 1 | a | 7 | yy | | 1 | a | 9 | pp | | 2 | b | 2 | bb | | 2 | b | 3 | cc | | 2 | b | 7 | yy | | 2 | b | 9 | pp | | 3 | c | 2 | bb | | 3 | c | 3 | cc | | 3 | c | 7 | yy | | 3 | c | 9 | pp | | 4 | d | 2 | bb | | 4 | d | 3 | cc | | 4 | d | 7 | yy | | 4 | d | 9 | pp | | 7 | y | 2 | bb | | 7 | y | 3 | cc | | 7 | y | 7 | yy | | 7 | y | 9 | pp | | 8 | u | 2 | bb | | 8 | u | 3 | cc | | 8 | u | 7 | yy | | 8 | u | 9 | pp | ±------±--------±------±--------±-+

内连接： select * from a inner join b on a.id=b.id; ±------±--------±------±--------±-+ | a.id | a.name | b.id | b.name | ±------±--------±------±--------±-+ | 2 | b | 2 | bb | | 3 | c | 3 | cc | | 7 | y | 7 | yy | ±------±--------±------±--------±-+

左连接： select * from a left join b on a.id=b.id; ±------±--------±------±--------±-+ | a.id | a.name | b.id | b.name | ±------±--------±------±--------±-+ | 1 | a | NULL | NULL | | 2 | b | 2 | bb | | 3 | c | 3 | cc | | 4 | d | NULL | NULL | | 7 | y | 7 | yy | | 8 | u | NULL | NULL | ±------±--------±------±--------±-+

右连接： select * from a right join b on a.id=b.id; ±------±--------±------±--------±-+ | a.id | a.name | b.id | b.name | ±------±--------±------±--------±-+ | 2 | b | 2 | bb | | 3 | c | 3 | cc | | 7 | y | 7 | yy | | NULL | NULL | 9 | pp | ±------±--------±------±--------±-+

全连接： select * from a full outer join b on a.id=b.id; ±------±--------±------±--------±-+ | a.id | a.name | b.id | b.name | ±------±--------±------±--------±-+ | 1 | a | NULL | NULL | | 2 | b | 2 | bb | | 3 | c | 3 | cc | | 4 | d | NULL | NULL | | 7 | y | 7 | yy | | 8 | u | NULL | NULL | | NULL | NULL | 9 | pp | ±------±--------±------±--------±-+

**hive中的特别join select * from a left semi join b on a.id = b.id; ±------±--------±-+ | a.id | a.name | ±------±--------±-+ | 2 | b | | 3 | c | | 7 | y | ±------±--------±-+ 相当于 select a.id,a.name from a where a.id in (select b.id from b); 在hive中效率极低

select a.id,a.name from a join b on (a.id = b.id);

其他： 1参数的配置 2运算符和dual表 create table dual(id string); load data local inpath ‘/root/hivedata/dual.log’ into table dual; select 1 from dual where 1=1; //相等显示 | _c0 | ±-----±-+ | 1 | ±-----±-+ select 1 from dual where 1=2; //后面不等于，就不显示 ±-----±-+ | _c0 | ±-----±-+ ±-----±-+ select 1 from dual where ‘key’ like ‘foot%’; ±-----±-+ | _c0 | ±-----±-+ ±-----±-+ 加法： select 1+9 from dual where 1=1; ±-----±-+ | _c0 | ±-----±-+ | 10 | ±-----±-+ 除法： select 40/5 from dual where 1=1; ±-----±-+ | _c0 | ±-----±-+ | 8.0 | ±-----±-+ 截取函数： select substring(‘1234’,2,3) from dual; ±-----±-+ | _c0 | ±-----±-+ | 234 | ±-----±-+

自定义分割符 drop table t_bi_reg; create table t_bi_reg(id string,name string) row format serde ‘org.apache.hadoop.hive.serde2.RegexSerDe’ with serdeproperties( ‘input.regex’=’(.)\|\|(.)’, ‘output.format.string’=’%1 $s$ s’ ) stored as textfile; hive>load data local inpath ‘/root/hivedata/bi.dat’ into table t_bi_reg; hive>select * from t_bi_reg;

±-------------±---------------±-+ | t_bi_reg.id | t_bi_reg.name | ±-------------±---------------±-+ | zhangsan | 23 | | lisi | 45 | ±-------------±---------------±-+

联级求和： create table t_access_times(username string,month string,salary int) row format delimited fields terminated by ‘,’;

load data local inpath ‘/root/hivedata/t_access_times.dat’ into table t_access_times;

A,2015-01,5 A,2015-01,15 B,2015-01,5 A,2015-01,8 B,2015-01,25 A,2015-01,5 A,2015-02,4 A,2015-02,6 B,2015-02,10 B,2015-02,5

select * from t_access_times; ±-------------------------±----------------------±-----------------------±-+ | t_access_times.username | t_access_times.month | t_access_times.salary | ±-------------------------±----------------------±-----------------------±-+ | A | 2015-01 | 5 | | A | 2015-01 | 15 | | B | 2015-01 | 5 | | A | 2015-01 | 8 | | B | 2015-01 | 25 | | A | 2015-01 | 5 | | A | 2015-02 | 4 | | A | 2015-02 | 6 | | B | 2015-02 | 10 | | B | 2015-02 | 5 | ±-------------------------±----------------------±-----------------------±-+

1、第一步，先求个用户的月总金额 select username,month,sum(salary) as salary from t_access_times group by username,month tmp ±----------±---------±--------±-+ | username | month | salary | ±----------±---------±--------±-+ | A | 2015-01 | 33 | | A | 2015-02 | 10 | | B | 2015-01 | 30 | | B | 2015-02 | 15 | ±----------±---------±--------±-+

2、第二步，将月总金额表自己连接自己连接 select A.,B. FROM (select username,month,sum(salary) as salary from t_access_times group by username,month) A inner join (select username,month,sum(salary) as salary from t_access_times group by username,month) B on A.username=B.username where B.month <= A.month; ±------------±---------±----------±------------±---------±----------±-+ | a.username | a.month | a.salary | b.username | b.month | b.salary | ±------------±---------±----------±------------±---------±----------±-+ | A | 2015-01 | 33 | A | 2015-01 | 33 | | A | 2015-01 | 33 | A | 2015-02 | 10 | | A | 2015-02 | 10 | A | 2015-01 | 33 | | A | 2015-02 | 10 | A | 2015-02 | 10 | | B | 2015-01 | 30 | B | 2015-01 | 30 | | B | 2015-01 | 30 | B | 2015-02 | 15 | | B | 2015-02 | 15 | B | 2015-01 | 30 | | B | 2015-02 | 15 | B | 2015-02 | 15 | ±------------±---------±----------±------------±---------±----------±-+

注：此处有坑：注意空格 bug：Error: Error while compiling statement: FAILED: ParseException line 2:0 cannot recognize input near ‘select’ ‘username’ ‘,’ in join source (state=42000,code=40000)

3、第三步，从上一步的结果中进行分组查询，分组的字段是a.username a.month 求月累计值：将b.month <= a.month的所有b.salary求和即可 select A.username,A.month,max(A.salary) as salary,sum(B.salary) as accumulate from (select username,month,sum(salary) as salary from t_access_times group by username,month) A inner join (select username,month,sum(salary) as salary from t_access_times group by username,month) B on A.username=B.username where B.month <= A.month group by A.username,A.month order by A.username,A.month;

±------------±---------±--------±------------±-+ | a.username | a.month | salary | accumulate | ±------------±---------±--------±------------±-+ | A | 2015-01 | 33 | 33 | | A | 2015-02 | 10 | 43 | | B | 2015-01 | 30 | 30 | | B | 2015-02 | 15 | 45 | ±------------±---------±--------±------------±-+

最新回复(0)