0%

优化

优化

既然hive是写sql来生成MR任务,那sql写的好坏严重影响了生成的MR任务,那么如何优化呢?就像优化sql一下,有个Explain执行计划,hive也是有Explain执行计划的

Explain执行计划

1
EXPLAIN [EXTENDED | DEPENDENCY | AUTHORIZATION] query
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
hive (study_hive)> explain select max(sal),deptno from emp group by deptno; 
OK
Explain
STAGE DEPENDENCIES:
Stage-1 is a root stage
Stage-0 depends on stages: Stage-1

STAGE PLANS:
Stage: Stage-1
Map Reduce
Map Operator Tree:
TableScan
alias: emp
Statistics: Num rows: 1 Data size: 6570 Basic stats: COMPLETE Column stats: NONE
Select Operator
expressions: sal (type: double), deptno (type: int)
outputColumnNames: sal, deptno
Statistics: Num rows: 1 Data size: 6570 Basic stats: COMPLETE Column stats: NONE
Group By Operator
aggregations: max(sal)
keys: deptno (type: int)
mode: hash
outputColumnNames: _col0, _col1
Statistics: Num rows: 1 Data size: 6570 Basic stats: COMPLETE Column stats: NONE
Reduce Output Operator
key expressions: _col0 (type: int)
sort order: +
Map-reduce partition columns: _col0 (type: int)
Statistics: Num rows: 1 Data size: 6570 Basic stats: COMPLETE Column stats: NONE
value expressions: _col1 (type: double)
Execution mode: vectorized
Reduce Operator Tree:
Group By Operator
aggregations: max(VALUE._col0)
keys: KEY._col0 (type: int)
mode: mergepartial
outputColumnNames: _col0, _col1
Statistics: Num rows: 1 Data size: 6570 Basic stats: COMPLETE Column stats: NONE
Select Operator
expressions: _col1 (type: double), _col0 (type: int)
outputColumnNames: _col0, _col1
Statistics: Num rows: 1 Data size: 6570 Basic stats: COMPLETE Column stats: NONE
File Output Operator
compressed: false
Statistics: Num rows: 1 Data size: 6570 Basic stats: COMPLETE Column stats: NONE
table:
input format: org.apache.hadoop.mapred.SequenceFileInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

Stage: Stage-0
Fetch Operator
limit: -1
Processor Tree:
ListSink

Time taken: 0.183 seconds, Fetched: 53 row(s)

Fetch抓取

对于某些查询是不需要使用MR任务的,只需要去文件中读取就可以了,如

1
select * from emp;

可以设置

1
2
3
4
5
6
-- 默认为more
-- 有三种选项
-- none 不使用Fetch抓取
-- minimal SELECT STAR, FILTER on partition columns, LIMIT only
-- more SELECT, FILTER, LIMIT only (support TABLESAMPLE and virtual columns)
set hive.fetch.task.conversion=more;

本地模式

数据量小的话可以使用本地模式在单台机器上处理所有任务,没有必要在集群上进行处理,可以减少执行时间。(但这一般只能处理个人的小文件,对于公司的大数据来说,很少有可以使用本地模式的)

1
2
3
4
5
6
-- 开启本地模式(默认false)
set hive.exec.mode.local.auto=true;
-- 最大输入量,当输入数据量小于这个值时使用local本地模式,默认128M
set hive.exec.mode.local.auto.inputbytes.max=134217728;
-- 最大输入文件数,输入文件小于这个值时使用local本地模式,默认4
set hive.exec.mode.local.auto.input.files.max=4;