安装mysql、hive步骤

什么是hive

一、启动方式

二、操作Hive

安装mysql、hive步骤

什么是hive

Hive是基于Hadoop的一个数据仓库工具(离线)，可以将结构化的数据文件映射为一张数据库表，并提供类SQL查询功能。

操作接口采用类SQL语法，提供快速开发的能力，避免了去写MapReduce，减少开发人员的学习成本，功能扩展很方便。

一、启动方式

需要先启动hdfs，hive数据最终保存在hdfs

1、方式一：

[root@hdp-2 lib]# hive

设置一些基本参数，让hive使用起来更便捷，比如：

//让提示符显示当前库：

hive>set hive.cli.print.current.db=true;

//显示查询结果时显示字段名称：

hive>set hive.cli.print.header=true;

但是这样设置只对当前会话有效，重启hive会话后就失效，解决办法：

在linux的当前用户目录中~，编辑一个.hiverc文件，将参数写入其中：

vi .hiverc

set hive.cli.print.header=true;
set hive.cli.print.current.db=true;

2、方式二：

在hdp-1服务器端启动hivesever2 （1）停在这个页面

[root@hdp-1 conf]# hiveserver2

启动hivesever2 （2）信息打印输出台

[root@hdp20-04 hive-1.2.1]# bin/hiveserver2 -hiveconf hive.root.logger=DEBUG,console

启动hivesever2 （2）后台启动

hiveserver2 1>/dev/null 2>&1 &

在其他机器，客户端连接hdp-1 beeline只是一个客户端，实际操作的是hivesever2的机器

方式1

[root@hdp-4 apps]# beeline 
Beeline version 1.2.1 by Apache Hive
beeline> !connect jdbc:hive2://hdp-1:10000
Connecting to jdbc:hive2://hdp-1:10000
Enter username for jdbc:hive2://hdp-1:10000: root
Enter password for jdbc:hive2://hdp-1:10000: 
Connected to: Apache Hive (version 1.2.1)
Driver: Hive JDBC (version 1.2.1)
Transaction isolation: TRANSACTION_REPEATABLE_READ
0: jdbc:hive2://hdp-1:10000>

方式2

[root@hdp-2 bin]# beeline -u jdbc:hive2://hdp-1:10000 -n root
Connecting to jdbc:hive2://hdp-1:10000
Connected to: Apache Hive (version 1.2.1)
Driver: Hive JDBC (version 1.2.1)
Transaction isolation: TRANSACTION_REPEATABLE_READ
Beeline version 1.2.1 by Apache Hive
0: jdbc:hive2://hdp-1:10000>

另外，hive提供一次性命令的方式来执行给定的hql语句

[root@hdp20-04 ~]#  hive -e "insert into table t_dest select * from t_src;"

因为这一机制，hive可以使用脚本、文件执行大量sql语句

（1）脚本，运行脚本

#!/bin/bash
hive -e "select * from db_order.t_order"
hive -e "select * from default.t_user"
hql="create table default.t_bash as select * from db_order.t_order"
hive -e "$hql"

（2）文件

vi x.hql

select * from db_order.t_order;
select count(1) from db_order.t_user;

执行文件：hive -f /root/x.hql

二、操作Hive

hive中有一个默认的库： default 在分布式文件系统中看不到以defalut为名字的数据库

新建库：create database db_order;

库建好后，在hdfs中会生成一个库目录：hdfs://hdp20-01:9000/user/hive/warehouse/db_order.db

1、基本建表语句：

use school;
create table teacher(id string,name string,gender string);
desc teacher;  //查看表结构

这样建表的话，hive会认为表数据文件中的字段分隔符为 ^A

指定字段分隔符：

create table teacher(id string,name string,gender string)
row format delimited
fields terminated by ',';

效果：

按照分割格式，可以将数据内容从本地上传到hdfs表的路径内，hive自动生成表数据

vi testdata.dat
1,lucas,male
2,nimoo,female
3,jack,male
//上传hdfs
hadoop fs -put testdata.dat /user/hive/warehouse/school.db/teacher

CTAS建表

（1）复制表结构：

create table t_user_2 like t_user;

（2）复制表结构以及数据

create table t_access_user 
as
select ip,url from t_access;

2、内部表与外部表

（1）创建表的时候，有external修饰的是外部表，否则是内部表

（2）内部表的数据是Hive自身管理，外部表数据由HDFS管理；

（3）内部表存储在hdfs的/user/hive/warehouse路径，外部表可以指定任何位置

（4）删除内部表会删除元数据（metadata）及存储数据，外部表仅仅会删除元数据，HDFS上的文件并不会被删除

（5）元数据存储着表的基本信息，增删改查记录，类似于Hadoop架构中的namespace。普通数据就是表中的详细数据。
hive的元数据默认存储在derby中，但大多数情况下存储在MySQL中。普通数据如架构图所示存储在hdfs中。

（6）创建外部表：

create external table t_access(ip string,url string,access_time string)
row format delimited
fields terminated by ','
location '/access/log';

内部表外部表转换：

内——>外
alter table tblName set tblproperties('EXTERNAL'='TRUE');
外——>内
alter table tblName set tblproperties('EXTERNAL'='FALSE');

3、分区表

分区表的创建是通过在Create Table语句中加入Partitioned by字句实现，一个分区表可以有一个或多个分区列，对于不同的分区，会创建一个对应的目录，用于存放分区表内容。

分区表的实质是：在表目录中为数据文件创建分区子目录，以便于在查询时，MR程序可以针对分区子目录中的数据进行处理，缩减读取数据的范围。

1、创建分区表（动态、静态）

create table t_access(ip string,url string,access_time string)
partitioned by(dt string)
row format delimited
fields terminated by ',';

注意：分区字段不能是数据表中已存在的字段。

向分区表中加载数据

load data local inpath '/root/access.log.2017-08-04.log' into table t_access partition(dt='20170804');

根据分区查询数据：

select count(*) from t_access where dt='20170804';
//实质：就是将分区字段当成表字段来用，就可以使用where子句指定分区了

多个分区字段：

建表：

create table t_partition(id int,name string,age int)
partitioned by(department string,sex string,howold int)
row format delimited fields terminated by ',';

加载数据：

load data local inpath '/root/p1.dat' into table t_partition partition(department='xiangsheng',sex='male',howold=20);

4、数据导入与导出

（1）数据文件导入hive

load data local inpath -->inpath指的是启动hivesever2的机器路径

方式1：导入数据的一种方式：
手动用hdfs命令，将文件放入表目录；

方式2：在hive的交互式shell中用hive命令来导入本地数据到表目录（服务器地址）
hive>load data local inpath '/root/order.data.2' into table t_order;

方式3：用hive命令导入hdfs中的数据文件到表目录
hive>load data inpath '/access.log.2017-08-06.log' into table t_access partition(dt='20170806');

注意：导本地文件和导HDFS文件的区别：
本地文件导入表：复制 hdfs文件导入表：移动

（2）hive表数据导出

1、将hive表中的数据导入HDFS的文件
insert overwrite directory '/root/access-data'
row format delimited fields terminated by ','
select * from t_access;


2、将hive表中的数据导入本地磁盘文件
insert overwrite local directory '/root/access-data'
row format delimited fields terminated by ','
select * from t_access limit 100000;

辛聪明

发布了77 篇原创文章 · 获赞 19 · 访问量 4089

私信关注

Hive教程（一）

安装mysql、hive步骤

什么是hive

一、启动方式

二、操作Hive

1、基本建表语句：

2、内部表与外部表

3、分区表

1、创建分区表（动态、静态）

4、数据导入与导出

猜你喜欢