hive导入数据

hive中创建表之后如何给表中添加数据呢，其实是有很多种方式的

使用hdfs

首先由于hive的数据是在hdfs存储的，所以其实是可以直接使用hdfs来将文件上传到对应表的目录下，此时表中就存在数据了

建表

-- 这里建表时指定了字段之间的分隔符为\t，而不使用默认的分隔符
create table if not exists dept(
    deptno int,
    dname string
)
row format delimited fields terminated by '\t';

使用desc formatted dept来找到该表所对应的hdfs存储位置为hdfs://localhost:9000/user/hive/warehouse/study_hive.db/dept

此时使用hdfs来将数据文件上传到该目录下

hdfs dfs -put ./dept.txt /user/hive/warehouse/study_hive.db/dept

查询数据

hive (study_hive)> select * from dept;
OK
dept.deptno    dept.dname
1    财务
2    IT

划重点

由于该数据是直接通过hdfs上传上去的，该数据没有经过hive插入，所以在hive的元数据中并不知道该表中的数据情况，来看一下元数据情况

 -- 查询表id
 select TBL_ID from TBLS where TBL_NAME = 'dept';
+--------+
| TBL_ID |
+--------+
|      3 |
+--------+

-- 根据表id去查询该表参数
select * from TABLE_PARAMS where TBL_ID = 3;
+--------+-----------------------+-----------------------------------------------------------------+
| TBL_ID | PARAM_KEY             | PARAM_VALUE                                                     |
+--------+-----------------------+-----------------------------------------------------------------+
|      3 | COLUMN_STATS_ACCURATE | {"BASIC_STATS":"true","COLUMN_STATS":{"ame":"true","o":"true"}} |
|      3 | bucketing_version     | 2                                                               |
|      3 | last_modified_by      | zhanghe                                                         |
|      3 | last_modified_time    | 1618134450                                                      |
|      3 | numFiles              | 0                                                               |
|      3 | numRows               | 0                                                               |
|      3 | rawDataSize           | 0                                                               |
|      3 | totalSize             | 0                                                               |
|      3 | transient_lastDdlTime | 1618134450                                                      |
+--------+-----------------------+-----------------------------------------------------------------+

可以看到此时该表的numFiles(文件数量)、numRows(行数量)都是0，因为hive并不知晓其有多少数据。

使用hive load数据

-- load data [local] inpath '数据的 path' [overwrite] into table student [partition (partcol1=val1,…)];
-- local:表示从本地加载数据到 hive 表；否则从 HDFS 加载数据到 hive表
-- inpath:表示加载数据的路径
-- overwrite:表示覆盖表中已有数据，否则表示追加
load data local inpath '/Users/zhanghe/Desktop/user/myself/hive_data/dept1.txt' into table dept;

查看此时的数据变化

select * from dept;
OK
dept.deptno    dept.dname
1    财务
2    IT
1001    产品
1002    测试

元数据变化

select * from TABLE_PARAMS where TBL_ID = 3;
+--------+-----------------------+-------------+
| TBL_ID | PARAM_KEY             | PARAM_VALUE |
+--------+-----------------------+-------------+
|      3 | bucketing_version     | 2           |
|      3 | last_modified_by      | zhanghe     |
|      3 | last_modified_time    | 1618134450  |
|      3 | numFiles              | 2           |
|      3 | numRows               | 0           |
|      3 | rawDataSize           | 0           |
|      3 | totalSize             | 44          |
|      3 | transient_lastDdlTime | 1618135089  |
+--------+-----------------------+-------------+

发现此时的numFiles已经变了，而且还将之前上传到hdfs的文件也算进来了，变成了2，而此时numRows还是0，说明load data上传文件时hive是可以知道上传了多少文件，但是并不知道文件中有多少数据的

为什么会将之前上传到hdfs的文件计算进来呢？？？

使用hive insert数据

-- into是追加
-- overwrite是覆盖
insert into table dept values(201,'人事'),(202,'公关');
或者
insert overwrite table dept values(201,'人事'),(202,'公关');