Hive

Run script file

$ hive -f script.hql

Run non-interactiv query

$ hive -e "<query>"

Metadata

show tables;

describe extended <table>

describe formatted <table>

show partitions <table>

Create database

create database <db>

Use database

use <db>

Create Hive managed table

create table <db>.<table> (<col> <type>, ...)

Create externally managed table

The data is managed by the filesystem, and not by Hive.
Issuing a table drop will not destroy the data
Can have multiple table definitions/tools use the same data in parallel

create external table <db>.<table> (...) location '/path/to/dir'

Create table with partitions

Partitions are extra physical folders to split the data more logically.
While querying the data, they will appear as columns

create table <table>(<col> <type>,...) partitioned by (<partition> <type>,...);

Create table with buckets

Buckets enable more efficient queries + joining
Buckets act on columns by hashing the value, similarly to a hashmap
Need to set hive.enforce.bucketing=true property, to create the needed buckets at table declaration

create table <table>(<col> <type>,...) clustered by (<col>) [sorted by (<col> <asc/des>)] into <x> buckets;

Create table with different text delimiters (for TEXTFILE format)

create table <table>(<col> <type>,...) 
row format delimited
    fields terminated by <delimiter>
    lines terminated by <line_del>
    collection items terminated by <item_del>
    map keys terminated by <key_del>

Create table with a different storage format

Stored as to change the storage file type. Default is text file. Options:

textfile - default
rcfile / orcfile / orc
avro
sequencefile
parquet

create table <table>(<col> <type>,...) stored as <file_type>

Create external table

Data is stored externally to hive, and location folder needs to be specified at creation time (doesn't have to exist) to not copy from HDFS, or it will create a folder in normal Hive space on HDFS

create external table <table>(<col> <type>,...) LOCATION '/user/.../external_table';

Create table as select

Table will have the schema according to the select query and also be populated in 1 go (CTS create table as select statement)

create table <table> [stored as <format>] as select <col1,...> from <table_source>;

Create table with the same schema

create table <tbl> like <source_tbl>

Create transactional table

set hive.enforce.bucketing = true;
create table <table>(...) clustered by (<col>) into <x> buckets stored as orc tblproperties ("transactional"="true");

Drop table

Drop schema (and data if managed by Hive)

drop table <table>;

Truncate table

Drop data and keep schema. Works for internal tables

truncate table <table>

Load data from local FS

load data local inpath '<file protocol>://<filepath>' [overwrite] into table <table> [partition (<part/col>=<val>,...)]

Load data from HDFS

load data inpath '/path/to/file' [overwrite] into table <table> [partition (<part/col>=<val>,...)]

Load/Insert into partition dynamicaly

set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;

insert/load ...

Insert data values

insert [overwrite/ into] table <table> [partition (part1=value1, part2=value2) ] values (col_val1, col_val2, ...)

Insert data as select

insert [overwrite/ into] table <table> [partition (part1, ...) ] <select_statement> from <from_statement>

Insert compressed data

Data saved as gzip or bzip2 is automatically decompressed when inserting into the table

Update

Update requires transactional behaviour for a table
- stored as orc
- has buckets
- transactional property
Doesn't work to update the bucket col

UPDATE tablename SET column = value [, column = value ...] [WHERE expression]

Delete data

delete from <tbl> [where ...]

Alter table add column

alter table <table> add columns (<col> <type>)

Alter table rename

alter table <old_table> rename to <new_table_name>

Alter table drop column

To drop a column you need to re-write the schema with the columns that you want to keep

alter table <table> replace columns(col1 type1, col2 type2,...)

Select

SELECT [ALL | DISTINCT]
 select_expr1, [select_expr2, …]
 FROM tbl_name
 [WHERE where_expr]
[GROUP BY col1, col2,…]
[LIMIT n];

Select and sort globally

select <> from <table> order by <col> ASC/DESC [, <col2> ...]

Select and sort per reducer

select <> from <table> distribute by <col_x> sort by <col_x> ASC/DESC[, <col_y> ...]

Joins

Inner join

Join keyword

select t1.*, t2.* from t1 join t2 on (t1.col_x = t2.col_x [AND ...]) [join ... on ...]

SQL way

select t1.*, t2.* from t1, t2 where t1.col_x = t2.col_x

Outer join

Includes rows from the left table which don't have a match with the right table, and the values are set to null, or the reverse

select t1.*, t2.* from t1 {LEFT|RIGHT|FULL} OUTER join t2 on (t1.col_x = t2.col_x [AND ...]) [join ... on ...]

Cross join

select t1.*, t2.* from t1 cross join t2;

Change execution engine

set hive.execution.engine=tez|mr|...

Optimizations

vectorization

Vectorized query execution streamlines operations by processing a block of 1024 rows at a time.

set hive.vectorized.execution.enabled = true;

Debug

Explain

Detail the execution plan

explain
<query>

PreviousPig NextSpark - Scala

Last updated 5 years ago

Was this helpful?