hive---DML

表内容操作语句---DML

load：将数据文件移动/复制到hive表对应位置

load data [local] inpath 'file_path' [overwrite] into table table_name [partition (partcol1=val1, partcol2=vol2,...)]

注：桶不能用load加载数据；文件格式为sequencefile的也不能通过load加载数据。

local：本地文件。overwrite：覆盖该分区或表的内容。

load本地文件时，复制本地文件到hive表位置；load hdfs文件时，移动hdfs文件到hive表位置。

例子：load data local inpath '/root/student/c.txt' into table student partition (country="China");

insert：插入数据到hive表；导出表数据

注意：overwrite可以换成into。overwrite是覆盖已经存在的数据（对于已存在的数据，先删除，再插入），into是不管是否存在都直接插入（对于已存在的数据，不管，直接重复插入）

1.1. 插入一条数据：（这个没什么用，因为hive做海量数据的处理）

insert into table_name values(xx,yy,zz);

1.2. 插入：（查询一个表table_name1，将查询结果插入table_name）

insert overwrite table table_name [partition (part_col1=val1, part_col2=val2,...)] select_statement from from_statement;

例子：基本插入：insert overwrite table student select * from student1;

自动分区：insert overwrite table student partiton (country="China") select * from student1 where country="China";

1.3. 多重插入：

from from_statement

insert overwrite table table_name1 [partition (part_col1=val1, part_col2=val2,...)] select_statement1

[insert overwrite table table_name2 [partition (part_col3=val3, part_col4=val4,...)] select_statement2 ...];

例子：from student1

insert overwrite table student partiton (country="China") select * where country="China"

insert overwrite table student partiton (country="UK") select * where country="UK";

2.1. 导出

insert overwrite [local] directory dire_path1 select ... from ... ;

2.2. 多重导出

from ... insert overwrite [local] directory dire_path1 select ... [insert overwrite [local] directory dire_path2 select ...];

注意：加‘local’为本地目录：/root/app/.... 不加‘local’为hdfs目录：hdfs://ip:9000/user/hive/...

select

select [all | distinct] select_expr,select_expr... from table_reference [where where_condition]

[group by col_list [having hav_condition]]

[ {分桶和sort的字段是同一个时： cluster = distribute + sort }

[cluster by col_list] {桶：先分区，再分区内排序}

[distribute by col_list] [sort by | order by col_list] {根据hash分区；sort分区内排序；order全局排序}

]

[limit limit_number]

例子：(set mapred.reduce.tasks=4) select * from student cluster by id;

结果：4 8 1 5 2 6 3 7 （分区内排序）

注意：

1. order by会对输入做全局排序，因此只有一个reducer，会导致当输入规模较大时，需要较长的计算时间。

2. sort by不是全局排序，其在数据进入reducer前完成排序。因此，如果用sort by进行排序，并且设置mapred.reduce.tasks>1，则sort by只保证每个reducer的输出有序，不保证全局有序。

3. distribute by(字段)根据指定的字段将数据分到不同的reducer，且分发算法是hash散列。

4. 设置分区：set mapred.reduce.tasks=3;

桶表的抽样调查：

select * from table_name tablesample(bucket x out of y on id)

解释：从第x个桶开始抽，每隔y个桶抽一次。（x、x+y、...）

假设桶数是z，y必须是z的倍数或因子，表示隔几个桶抽一次。

假设有32个桶，y=16，则抽取32/16=2个桶的数据；y=64，则抽取32/64=1/2桶的数据。

x表示从哪个桶开始抽。假设x=3、y=16、z=32，则抽取第3、19（3+16）个桶的数据。

猜你喜欢