# Hive

## Run script file

```
$ hive -f script.hql
```

## Run non-interactiv query

```
$ hive -e "<query>"
```

## Metadata

```
show tables;

describe extended <table>

describe formatted <table>

show partitions <table>
```

## Create database

```
create database <db>
```

## Use database

```
use <db>
```

## Create Hive managed table

```
create table <db>.<table> (<col> <type>, ...)
```

## Create externally managed table

* The data is managed by the filesystem, and not by Hive.&#x20;
* Issuing a table drop will not destroy the data
* Can have multiple table definitions/tools use the same data in parallel

```
create external table <db>.<table> (...) location '/path/to/dir'
```

## Create table with partitions

* Partitions are extra physical folders to split the data more logically.
* While querying the data, they will appear as columns

```
create table <table>(<col> <type>,...) partitioned by (<partition> <type>,...);
```

## Create table with buckets

* Buckets enable more efficient queries + joining
* Buckets act on columns by hashing the value, similarly to a hashmap
* Need to `set hive.enforce.bucketing=true` property, to create the needed buckets at table declaration

```
create table <table>(<col> <type>,...) clustered by (<col>) [sorted by (<col> <asc/des>)] into <x> buckets;
```

## Create table with different text delimiters (for TEXTFILE format)

```
create table <table>(<col> <type>,...) 
row format delimited
    fields terminated by <delimiter>
    lines terminated by <line_del>
    collection items terminated by <item_del>
    map keys terminated by <key_del>
```

## Create table with a different storage format

Stored as to change the storage file type. Default is text file. Options:

* textfile - default
* rcfile / orcfile / orc
* avro
* sequencefile
* parquet

```
create table <table>(<col> <type>,...) stored as <file_type>
```

## Create external table

Data is stored externally to hive, and location folder needs to be specified at creation time (doesn't have to exist) to not copy from HDFS, or it will create a folder in normal Hive space on HDFS

```
create external table <table>(<col> <type>,...) LOCATION '/user/.../external_table';
```

## Create table as select

Table will have the schema according to the select query and also be populated in 1 go (CTS create table as select statement)

```
create table <table> [stored as <format>] as select <col1,...> from <table_source>;
```

## Create table with the same schema

```
create table <tbl> like <source_tbl>
```

## Create transactional table

```
set hive.enforce.bucketing = true;
create table <table>(...) clustered by (<col>) into <x> buckets stored as orc tblproperties ("transactional"="true");
```

## Drop table

Drop schema (and data if managed by Hive)

```
drop table <table>;
```

## Truncate table

Drop data and keep schema. Works for internal tables

```
truncate table <table>
```

### Load data from local FS

```
load data local inpath '<file protocol>://<filepath>' [overwrite] into table <table> [partition (<part/col>=<val>,...)]
```

## Load data from HDFS

```
load data inpath '/path/to/file' [overwrite] into table <table> [partition (<part/col>=<val>,...)]
```

## Load/Insert into partition dynamicaly

```
set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;

insert/load ...
```

## Insert data values

```
insert [overwrite/ into] table <table> [partition (part1=value1, part2=value2) ] values (col_val1, col_val2, ...)
```

## Insert data as select

```
insert [overwrite/ into] table <table> [partition (part1, ...) ] <select_statement> from <from_statement>
```

## Insert compressed data

Data saved as gzip or bzip2 is automatically decompressed when inserting into the table

## Update

* Update requires transactional behaviour for a table
  * stored as orc&#x20;
  * has buckets
  * transactional property
* Doesn't work to update the bucket col

```
UPDATE tablename SET column = value [, column = value ...] [WHERE expression]
```

## Delete data

```
delete from <tbl> [where ...]
```

### Alter table add column

```
alter table <table> add columns (<col> <type>)
```

### Alter table rename

```
alter table <old_table> rename to <new_table_name>
```

### Alter table drop column

To drop a column you need to re-write the schema with the columns that you want to keep

```
alter table <table> replace columns(col1 type1, col2 type2,...)
```

## Select

```
SELECT [ALL | DISTINCT]
 select_expr1, [select_expr2, …]
 FROM tbl_name
 [WHERE where_expr]
[GROUP BY col1, col2,…]
[LIMIT n];
```

### Select and sort globally

```
select <> from <table> order by <col> ASC/DESC [, <col2> ...]
```

### Select and sort per reducer

```
select <> from <table> distribute by <col_x> sort by <col_x> ASC/DESC[, <col_y> ...]
```

## Joins

### Inner join

Join keyword

```
select t1.*, t2.* from t1 join t2 on (t1.col_x = t2.col_x [AND ...]) [join ... on ...]
```

SQL way

```
select t1.*, t2.* from t1, t2 where t1.col_x = t2.col_x
```

### Outer join

Includes rows from the left table which don't have a match with the right table, and the values are set to null, or the reverse

```
select t1.*, t2.* from t1 {LEFT|RIGHT|FULL} OUTER join t2 on (t1.col_x = t2.col_x [AND ...]) [join ... on ...]
```

### Cross join

```
select t1.*, t2.* from t1 cross join t2;
```

### Change execution engine

```
set hive.execution.engine=tez|mr|...
```

## Optimizations

### vectorization

Vectorized query execution streamlines operations by processing a block of 1024 rows at a time.

```
set hive.vectorized.execution.enabled = true;
```

## Debug

### Explain

Detail the execution plan

```
explain
<query>
```
