Hive
Run script file
$ hive -f script.hql
Run non-interactiv query
$ hive -e "<query>"
Metadata
show tables;
describe extended <table>
describe formatted <table>
show partitions <table>
Create database
create database <db>
Use database
use <db>
Create Hive managed table
create table <db>.<table> (<col> <type>, ...)
Create externally managed table
The data is managed by the filesystem, and not by Hive.
Issuing a table drop will not destroy the data
Can have multiple table definitions/tools use the same data in parallel
create external table <db>.<table> (...) location '/path/to/dir'
Create table with partitions
Partitions are extra physical folders to split the data more logically.
While querying the data, they will appear as columns
create table <table>(<col> <type>,...) partitioned by (<partition> <type>,...);
Create table with buckets
Buckets enable more efficient queries + joining
Buckets act on columns by hashing the value, similarly to a hashmap
Need to
set hive.enforce.bucketing=true
property, to create the needed buckets at table declaration
create table <table>(<col> <type>,...) clustered by (<col>) [sorted by (<col> <asc/des>)] into <x> buckets;
Create table with different text delimiters (for TEXTFILE format)
create table <table>(<col> <type>,...)
row format delimited
fields terminated by <delimiter>
lines terminated by <line_del>
collection items terminated by <item_del>
map keys terminated by <key_del>
Create table with a different storage format
Stored as to change the storage file type. Default is text file. Options:
textfile - default
rcfile / orcfile / orc
avro
sequencefile
parquet
create table <table>(<col> <type>,...) stored as <file_type>
Create external table
Data is stored externally to hive, and location folder needs to be specified at creation time (doesn't have to exist) to not copy from HDFS, or it will create a folder in normal Hive space on HDFS
create external table <table>(<col> <type>,...) LOCATION '/user/.../external_table';
Create table as select
Table will have the schema according to the select query and also be populated in 1 go (CTS create table as select statement)
create table <table> [stored as <format>] as select <col1,...> from <table_source>;
Create table with the same schema
create table <tbl> like <source_tbl>
Create transactional table
set hive.enforce.bucketing = true;
create table <table>(...) clustered by (<col>) into <x> buckets stored as orc tblproperties ("transactional"="true");
Drop table
Drop schema (and data if managed by Hive)
drop table <table>;
Truncate table
Drop data and keep schema. Works for internal tables
truncate table <table>
Load data from local FS
load data local inpath '<file protocol>://<filepath>' [overwrite] into table <table> [partition (<part/col>=<val>,...)]
Load data from HDFS
load data inpath '/path/to/file' [overwrite] into table <table> [partition (<part/col>=<val>,...)]
Load/Insert into partition dynamicaly
set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;
insert/load ...
Insert data values
insert [overwrite/ into] table <table> [partition (part1=value1, part2=value2) ] values (col_val1, col_val2, ...)
Insert data as select
insert [overwrite/ into] table <table> [partition (part1, ...) ] <select_statement> from <from_statement>
Insert compressed data
Data saved as gzip or bzip2 is automatically decompressed when inserting into the table
Update
Update requires transactional behaviour for a table
stored as orc
has buckets
transactional property
Doesn't work to update the bucket col
UPDATE tablename SET column = value [, column = value ...] [WHERE expression]
Delete data
delete from <tbl> [where ...]
Alter table add column
alter table <table> add columns (<col> <type>)
Alter table rename
alter table <old_table> rename to <new_table_name>
Alter table drop column
To drop a column you need to re-write the schema with the columns that you want to keep
alter table <table> replace columns(col1 type1, col2 type2,...)
Select
SELECT [ALL | DISTINCT]
select_expr1, [select_expr2, …]
FROM tbl_name
[WHERE where_expr]
[GROUP BY col1, col2,…]
[LIMIT n];
Select and sort globally
select <> from <table> order by <col> ASC/DESC [, <col2> ...]
Select and sort per reducer
select <> from <table> distribute by <col_x> sort by <col_x> ASC/DESC[, <col_y> ...]
Joins
Inner join
Join keyword
select t1.*, t2.* from t1 join t2 on (t1.col_x = t2.col_x [AND ...]) [join ... on ...]
SQL way
select t1.*, t2.* from t1, t2 where t1.col_x = t2.col_x
Outer join
Includes rows from the left table which don't have a match with the right table, and the values are set to null, or the reverse
select t1.*, t2.* from t1 {LEFT|RIGHT|FULL} OUTER join t2 on (t1.col_x = t2.col_x [AND ...]) [join ... on ...]
Cross join
select t1.*, t2.* from t1 cross join t2;
Change execution engine
set hive.execution.engine=tez|mr|...
Optimizations
vectorization
Vectorized query execution streamlines operations by processing a block of 1024 rows at a time.
set hive.vectorized.execution.enabled = true;
Debug
Explain
Detail the execution plan
explain
<query>
Last updated
Was this helpful?