本博文的主要内容有:
.hive的常用语法
.内部表
.外部表
.内部表,被drop掉,会发生什么?
.外部表,被drop掉,会发生什么?
.内部表和外部表的,保存的路径在哪?
.用于创建一些临时表存储中间结果
.用于向临时表中追加中间结果数据
.分区表(分为,分区内部表和分区外部表)
.hive的结构和原理
.hive的原理和架构设计
hive的使用
对于hive的使用,在hadoop集群里,先启动hadoop集群,再启动mysql服务,然后,再hive即可。
1、在hadoop安装目录下,sbin/start-all.sh。
2、在任何路径下,执行service mysql start (CentOS版本)、sudo /etc/init.d/mysql start (Ubuntu版本)
3、在hive安装目录下的bin下,./hive
对于hive的使用,在spark集群里,先启动hadoop集群,再启动spark集群,再启动mysql服务,然后,再hive即可。
1、在hadoop安装目录下,sbin/start-all.sh。
2、在spark安装目录下,sbin/start-all.sh
3、在任何路径下,执行service mysql start (CentOS版本)、sudo /etc/init.d/mysql start (Ubuntu版本)
3、在hive安装目录下的bin下,./hive
[hadoop@weekend110 bin]$ pwd
/home/hadoop/app/hive-0.12.0/bin[hadoop@weekend110 bin]$ mysql -uhive -hweekend110 -phiveWelcome to the MySQL monitor. Commands end with ; or \g.Your MySQL connection id is 110Server version: 5.1.73 Source distributionCopyright (c) 2000, 2013, Oracle and/or its affiliates. All rights reserved.
Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respectiveowners.Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.
mysql> SHOW DATABASES;
+--------------------+| Database |+--------------------+| information_schema || hive || mysql || test |+--------------------+4 rows in set (0.00 sec)mysql> quit;
Bye[hadoop@weekend110 bin]$
[hadoop@weekend110 bin]$ pwd
/home/hadoop/app/hive-0.12.0/bin[hadoop@weekend110 bin]$ ./hive16/10/10 22:36:25 INFO Configuration.deprecation: mapred.input.dir.recursive is deprecated. Instead, use mapreduce.input.fileinputformat.input.dir.recursive16/10/10 22:36:25 INFO Configuration.deprecation: mapred.max.split.size is deprecated. Instead, use mapreduce.input.fileinputformat.split.maxsize16/10/10 22:36:25 INFO Configuration.deprecation: mapred.min.split.size is deprecated. Instead, use mapreduce.input.fileinputformat.split.minsize16/10/10 22:36:25 INFO Configuration.deprecation: mapred.min.split.size.per.rack is deprecated. Instead, use mapreduce.input.fileinputformat.split.minsize.per.rack16/10/10 22:36:25 INFO Configuration.deprecation: mapred.min.split.size.per.node is deprecated. Instead, use mapreduce.input.fileinputformat.split.minsize.per.node16/10/10 22:36:25 INFO Configuration.deprecation: mapred.reduce.tasks is deprecated. Instead, use mapreduce.job.reduces16/10/10 22:36:25 INFO Configuration.deprecation: mapred.reduce.tasks.speculative.execution is deprecated. Instead, use mapreduce.reduce.speculativeLogging initialized using configuration in jar:file:/home/hadoop/app/hive-0.12.0/lib/hive-common-0.12.0.jar!/hive-log4j.properties
SLF4J: Class path contains multiple SLF4J bindings.SLF4J: Found binding in [jar:file:/home/hadoop/app/hadoop-2.4.1/share/hadoop/common/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]SLF4J: Found binding in [jar:file:/home/hadoop/app/hive-0.12.0/lib/slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]hive> SHOW DATABASES;OKdefaulthiveTime taken: 12.226 seconds, Fetched: 2 row(s)hive> quit;[hadoop@weekend110 bin]$
总结,mysql比hive,多出了自己本身mysql而已。
如
CREATE TABLE page_view(
viewTime INT,userid BIGINT,page_url STRING,referrer_url STRING,ip STRING COMMENT 'IP Address of the User')COMMENT 'This is the page view table'PARTITIONED BY(dt STRING, country STRING)ROW FORMAT DELIMITEDFIELDS TERMINATED BY '\001'STORED AS SEQUENCEFILE; TEXTFILE
原因解释如下:
0000101 iphone6pluse 64G 6888
0000102 xiaominote 64G 2388
CREATE TABLE t_order(id int,name string,rongliang string,price double)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
;
现在,我们来开始玩玩
[hadoop@weekend110 bin]$ pwd
/home/hadoop/app/hive-0.12.0/bin
[hadoop@weekend110 bin]$ ./hive
16/10/10 10:16:38 INFO Configuration.deprecation: mapred.input.dir.recursive is deprecated. Instead, use mapreduce.input.fileinputformat.input.dir.recursive
16/10/10 10:16:38 INFO Configuration.deprecation: mapred.max.split.size is deprecated. Instead, use mapreduce.input.fileinputformat.split.maxsize
16/10/10 10:16:38 INFO Configuration.deprecation: mapred.min.split.size is deprecated. Instead, use mapreduce.input.fileinputformat.split.minsize
16/10/10 10:16:38 INFO Configuration.deprecation: mapred.min.split.size.per.rack is deprecated. Instead, use mapreduce.input.fileinputformat.split.minsize.per.rack
16/10/10 10:16:38 INFO Configuration.deprecation: mapred.min.split.size.per.node is deprecated. Instead, use mapreduce.input.fileinputformat.split.minsize.per.node
16/10/10 10:16:38 INFO Configuration.deprecation: mapred.reduce.tasks is deprecated. Instead, use mapreduce.job.reduces
16/10/10 10:16:38 INFO Configuration.deprecation: mapred.reduce.tasks.speculative.execution is deprecated. Instead, use mapreduce.reduce.speculative
Logging initialized using configuration in jar:file:/home/hadoop/app/hive-0.12.0/lib/hive-common-0.12.0.jar!/hive-log4j.properties
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/home/hadoop/app/hadoop-2.4.1/share/hadoop/common/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/home/hadoop/app/hive-0.12.0/lib/slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
hive>
遇到,如下问题
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.metastore.HiveMetaStoreClient
参考,
先Esc,再Shift,再 . + /
<property>
<name>hive.metastore.schema.verification</name>
<value>true</value>
<description>
Enforce metastore schema version consistency.
True: Verify that version information stored in metastore matches with one from Hive jars. Also disable automatic
schema migration attempt. Users are required to manully migrate schema after Hive upgrade which ensures
proper metastore schema migration. (Default)
False: Warn if the version information stored in metastore doesn't match with one from in Hive jars.
</description>
</property>
改为
<property>
<name>hive.metastore.schema.verification</name>
<value>false</value>
<description>
Enforce metastore schema version consistency.
True: Verify that version information stored in metastore matches with one from Hive jars. Also disable automatic
schema migration attempt. Users are required to manully migrate schema after Hive upgrade which ensures
proper metastore schema migration. (Default)
False: Warn if the version information stored in metastore doesn't match with one from in Hive jars.
</description>
</property>
很多人这样写
CREATE TABLE t_order(
id int,name string,rongliang string,price double)ROW FORMAT DELIMITEDFIELDS TERMINATED BY '\t';
hive> CREATE TABLE t_order(id int,name string,rongliang string,price double)
> ROW FORMAT DELIMITED
> FIELDS TERMINATED BY '\t'
> ;
OK
Time taken: 28.755 seconds
hive>
测试连接下,
正式连接
成功!
这里呢,我推荐一款新的软件,作为入门。
之类呢,再可以玩更高级的,见
前提得要开启hive
注意:第一步里,输入后,不要点击“确定”。直接切换到“常规。”
关于,第二步。看下你的hive安装目录下的hive-site.xml,你的user和password。若你配置的是root,则第二步里用root用户。
配置完第二步,之后,再最后点击“确定。”
通过show databases;可以查看数据库。默认database只有default。
hive> CREATE DATABASE hive; //创建hive数据库,只是这个数据库的名称,命名为hive而已。
OK
Time taken: 1.856 seconds
hive> SHOW DATABASES;
OK
default
hive
Time taken: 0.16 seconds, Fetched: 2 row(s)
hive> use hive; //使用hive数据库
OK
Time taken: 0.276 seconds
很多人这样写法
CREATE TABLE t_order(
id int,name string,rongliang string,price double)ROW FORMAT DELIMITEDFIELDS TERMINATED BY '\t';
hive> CREATE TABLE t_order(id int,name string,rongliang string,price double)
> ROW FORMAT DELIMITED
> FIELDS TERMINATED BY '\t'
> ;
OK
Time taken: 0.713 seconds
hive> SHOW TABLES;
OK
t_order
Time taken: 0.099 seconds, Fetched: 1 row(s)
hive>
对应着,
TBLS,其实就是TABLES,记录的是表名等。
好的,现在,来导入数据。
新建
[hadoop@weekend110 ~]$ ls
app c.txt flowArea.jar jdk1.7.0_65 wc.jar
a.txt data flow.jar jdk-7u65-linux-i586.tar.gz words.log
blk_1073741856 download flowSort.jar Link to eclipse workspace
blk_1073741857 eclipse HTTP_20130313143750.dat qingshu.txt
b.txt eclipse-jee-luna-SR2-linux-gtk-x86_64.tar.gz ii.jar report.evt
[hadoop@weekend110 ~]$ mkdir hiveTestData
[hadoop@weekend110 ~]$ cd hiveTestData/
[hadoop@weekend110 hiveTestData]$ ls
[hadoop@weekend110 hiveTestData]$ vim XXX.data
0000101 iphone6pluse 64G 6888
0000102 xiaominote 64G 2388
0000103 iphone5s 64G 6888
0000104 mi4 64G 2388
0000105 mi3 64G 6388
0000106 meizu 64G 2388
0000107 huawei 64G 6888
0000108 zhongxing 64G 6888
本地文件的路径是在,
[hadoop@weekend110 hiveTestData]$ pwd
/home/hadoop/hiveTestData
[hadoop@weekend110 hiveTestData]$ ls
XXX.data
[hadoop@weekend110 hiveTestData]$
[hadoop@weekend110 bin]$ ./hive
16/10/10 17:23:09 INFO Configuration.deprecation: mapred.input.dir.recursive is deprecated. Instead, use mapreduce.input.fileinputformat.input.dir.recursive
16/10/10 17:23:09 INFO Configuration.deprecation: mapred.max.split.size is deprecated. Instead, use mapreduce.input.fileinputformat.split.maxsize
16/10/10 17:23:09 INFO Configuration.deprecation: mapred.min.split.size is deprecated. Instead, use mapreduce.input.fileinputformat.split.minsize
16/10/10 17:23:09 INFO Configuration.deprecation: mapred.min.split.size.per.rack is deprecated. Instead, use mapreduce.input.fileinputformat.split.minsize.per.rack
16/10/10 17:23:09 INFO Configuration.deprecation: mapred.min.split.size.per.node is deprecated. Instead, use mapreduce.input.fileinputformat.split.minsize.per.node
16/10/10 17:23:09 INFO Configuration.deprecation: mapred.reduce.tasks is deprecated. Instead, use mapreduce.job.reduces
16/10/10 17:23:09 INFO Configuration.deprecation: mapred.reduce.tasks.speculative.execution is deprecated. Instead, use mapreduce.reduce.speculative
Logging initialized using configuration in jar:file:/home/hadoop/app/hive-0.12.0/lib/hive-common-0.12.0.jar!/hive-log4j.properties
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/home/hadoop/app/hadoop-2.4.1/share/hadoop/common/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/home/hadoop/app/hive-0.12.0/lib/slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
hive> SHOW DATABASES;
OK
default
hive
Time taken: 15.031 seconds, Fetched: 2 row(s)
hive> use hive;
OK
Time taken: 0.109 seconds
hive> LOAD DATA LOCAL INPATH '/home/hadoop/hiveTestData/XXX.data' INTO TABLE t_order;
Copying data from file:/home/hadoop/hiveTestData/XXX.data
Copying file: file:/home/hadoop/hiveTestData/XXX.data
Failed with exception File /tmp/hive-hadoop/hive_2016-10-10_17-24-21_574_6921522331212372447-1/-ext-10000/XXX.data could only be replicated to 0 nodes instead of minReplication (=1). There are 1 datanode(s) running and no node(s) are excluded in this operation.
at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget(BlockManager.java:1441)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2702)
at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:584)
at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:440)
at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1556)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007)
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.CopyTask
hive>
(最后在这里,找到了)
错误是:
Failed with exception File /tmp/hive-hadoop/hive_2016-10-10_17-54-30_887_2531771020597467111-1/-ext-10000/XXX.data could only be replicated to 0 nodes instead of minReplication (=1). There are 1 datanode(s) running and no node(s) are excluded in this operation.
解决方法:
出现此类报错主要原因是datanode存在问题,要么硬盘容量不够,要么datanode服务器down了。检查datanode,重启Hadoop即可解决。
我的这里,有错误,还没解决!
这里,t_order_wk,对应,我的t_order而已。只是表名不一样
哇,多么的明了啊!
这一句命令,比起MapReduce语句来,多么的牛! 其实,hive本来就是用mapreduce写的,只是作为数据仓库,为了方便。
hive的常用语法
已经看到了hive的使用,很方便,把SQL语句,翻译成mapreduce语句。
由此可见,xxx.data是向hive中表,加载进文件,也即,这文件是用LOAD进入。(从linux本地 -> hive中数据库)
yyy.data是向hive中表,加载进文件,也即,这文件是用hadoop fs –put进入。(从hdfs里 -> hive中数据库)
无论,是哪种途径,只要文件放进了/user/hive/warehouse/t_order_wk里,则,都可以读取到。
则,LOAD DATA LOCAL INPATH ,这文件是在,本地,即Linux里。从本地里导入
LOAD DATA INPATH,这文件是在,hdfs里。从hdfs里导入
那么,由此可见,若这DATA,即这文件,是在hdfs里,如uuu.data,则如“剪切”。
会有一个问题。如是业务系统产生的,我们业务或经常要读,路径是写好的,把文件移动了,
会干扰业务系统的进行。为此,解决这个问题,则,表为EXTERNAL。这是它的好处。
//external
CREATE EXTERNAL TABLE tab_ip_ext(id int, name string,
ip STRING,
country STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
STORED AS TEXTFILE
LOCATION '/external/hive';
为此,我们现在,去建立一个ETTERNAL表,与它jjj.data,关联起来。Soga,终于懂了。
内部表,被drop掉,会发生什么?
以及,内部表t_order_wk里的那些文件(xxx.data、yyy.data、zzz.data、jjj.data)都被drop掉了。
外部表,被drop掉,会发生什么?
是自定义的,hive_ext,在/下
内部表和外部表的,保存的路径在哪?
用于创建一些临时表存储中间结果
CTAS,即CREATE AS的意思
// CTAS 用于创建一些临时表存储中间结果
CREATE TABLE tab_ip_ctas
AS
SELECT id new_id, name new_name, ip new_ip,country new_country
FROM tab_ip_ext
SORT BY new_id;
用于向临时表中追加中间结果数据
//insert from select 用于向临时表中追加中间结果数据
create table tab_ip_like like tab_ip;
insert overwrite table tab_ip_like
select * from tab_ip;
这里,没演示
分区表
//PARTITION
create table tab_ip_part(
id int,name string,ip string,country string)partitioned by (part_flag string)row format delimited fields terminated by ',';
LOAD DATA LOCAL INPATH '/home/hadoop/ip.txt' OVERWRITE INTO TABLE tab_ip_part PARTITION(part_flag='part1');
LOAD DATA LOCAL INPATH '/home/hadoop/ip_part2.txt' OVERWRITE INTO TABLE tab_ip_part PARTITION(part_flag='part2');
select * from tab_ip_part;
select * from tab_ip_part where part_flag='part2';
select count(*) from tab_ip_part where part_flag='part2';
alter table tab_ip change id id_alter string;
ALTER TABLE tab_cts ADD PARTITION (partCol = 'dt') location '/external/hive/dt';
show partitions tab_ip_part;
每个月生成的订单记录,对订单进行统计,哪些商品的最热门,哪些商品的销售最大,哪些商品点击率最大,哪些商品的关联最高。
如果,对订单整个分析很大,为提高效率,在建立表时,就分区。
则,多了一个选择,你也可以对全部来,也可以对某个分区来。则按分区来。
//PARTITION
create table tab_ip_part(id int,name string,ip string,country string) partitioned by (part_flag string)row format delimited fields terminated by ',';load data local inpath '/home/hadoop/ip.txt' overwrite into table tab_ip_partpartition(part_flag='part1');load data local inpath '/home/hadoop/ip_part2.txt' overwrite into table tab_ip_partpartition(part_flag='part2');select * from tab_ip_part;
select * from tab_ip_part where part_flag='part2';
select count(*) from tab_ip_part where part_flag='part2'; alter table tab_ip change id id_alter string;ALTER TABLE tab_cts ADD PARTITION (partCol = 'dt') location '/external/hive/dt';show partitions tab_ip_part;
这里,不多演示赘述了。
hive的结构和原理
hive的原理和架构设计