[hive]优化策略

xiaoxiao2025-07-28 47

Hive对于表的操作大部分都是转换为MR作业的形式，为了提高OLAP[online analysis process 在线分析处理]的效率，Hive自身给出了很多的优化策略。

explain[解释执行计划]

通过explain命令，可以查看Hive语句的操作情况，是否为慢查询，是否走索引，一目了然

explain select sum(...) from table_name; 动态分区调整 hive.exec.dynamic.partition.mode = strict // 默认是strict

bucket表

索引

文件格式优化 TEXTFILE, SEQUENCEFILE, RCFILE[可切分], ORC[增强的RCFILE], 和 PARQUET

压缩

SET hive.exec.compress.intermediate=true // 设置MR中间数据可以进行压缩，默认是false SET hive.intermediate.compression.codec=org.apache.hadoop.io.compress.SnappyCodec // 设置MR中间数据压缩算法 SET hive.exec.compress.output=true // 设置MR输出数据可以进行压缩，默认是false SET mapreduce.map.output.compress.codec=org.apache.hadoop.io.compress.SnappyCodec // 设置MR输出数据压缩算法，Hadoop的配置设置本地模式，在当台机器上处理所有任务

适用于小数据情况

hive.exec.mode.local.auto = true // 默认false mapreduce.framework.name = local

运行本地模式的job需要满足的条件　　job的输入总大小要小于hive.exec.mode.local.auto.inputbytes.max // 默认是134217728 　　map任务的数量要小于hive.exec.mode.local.auto.input.files.max // 默认是4 　　reduce任务的数量要是1或者是0 8. JVM重用

SET mapreduce.job.jvm.numtasks=5; // 每个JVM能运行的任务数，默认是1，即为每一个任务开一个JVM，如果设为-1，则没有限制并行执行

如果Job之间没有依赖，可以并行执行

hive.exec.parallel = true // 默认是false SET hive.exec.parallel.thread.number=16 // 默认是8，能够并行执行的job数启动limit调优，避免全表扫描，使用抽样机制 select * from ... limit 1,2 hive.limit.optimize.enable = true // 默认是false JOIN

动态mapjoin使用(/+ streamtable(table_name)/) 连接查询表的大小从左到右依次增长默认是true

SET hive.auto.convert.join=true // 默认是true SET hive.mapjoin.smalltable.filesize=600000000 // 默认是25000000，mapjoin的阀值，如果小表小于该值，则会将普通join[reduce join]转为mapjoin

可以参考mapjoin的MR实现

严格模式

启用严格模式：

hive.mapred.mode = strict // Deprecated hive.strict.checks.large.query = true

该设置会禁用：1. 不指定分页的orderby 　　　　　　 2. 对分区表不指定分区进行查询　　　　　　 3. 和数据量无关，只是一个查询模式

hive.strict.checks.type.safety = true

严格类型安全，该属性不允许以下操作：1. bigint和string之间的比较　　　　　　　　　　　　　　　　　　2. bigint和double之间的比较

hive.strict.checks.cartesian.product = true

该属性不允许笛卡尔积操作 13. 调整Mapper和Reducer的个数

hive.exec.reducers.bytes.per.reducer = 256000000 // 每个reduce任务的字节数，256M hive.exec.reducers.max = 1009 // reduce task的最大值，属性为负数时，会使用该属性推测执行[hadoop]

让多个map/reduce多个实例并发执行

mapreduce.map.speculative = true // 默认是true mapreduce.reduce.speculative = true // 默认是true 多个分组优化 hive.multigroupby.singlereducer = true // 默认是true 若多个groupby使用的是一个公用的字段，则这些groupby可以生成一个MR 虚拟列 hive.exec.rowoffset = true // 默认是false

最新回复(0)