Hive
Run script file
Run non-interactiv query
Metadata
Create database
Use database
Create Hive managed table
Create externally managed table
The data is managed by the filesystem, and not by Hive.
Issuing a table drop will not destroy the data
Can have multiple table definitions/tools use the same data in parallel
Create table with partitions
Partitions are extra physical folders to split the data more logically.
While querying the data, they will appear as columns
Create table with buckets
Buckets enable more efficient queries + joining
Buckets act on columns by hashing the value, similarly to a hashmap
Need to
set hive.enforce.bucketing=true
property, to create the needed buckets at table declaration
Create table with different text delimiters (for TEXTFILE format)
Create table with a different storage format
Stored as to change the storage file type. Default is text file. Options:
textfile - default
rcfile / orcfile / orc
avro
sequencefile
parquet
Create external table
Data is stored externally to hive, and location folder needs to be specified at creation time (doesn't have to exist) to not copy from HDFS, or it will create a folder in normal Hive space on HDFS
Create table as select
Table will have the schema according to the select query and also be populated in 1 go (CTS create table as select statement)
Create table with the same schema
Create transactional table
Drop table
Drop schema (and data if managed by Hive)
Truncate table
Drop data and keep schema. Works for internal tables
Load data from local FS
Load data from HDFS
Load/Insert into partition dynamicaly
Insert data values
Insert data as select
Insert compressed data
Data saved as gzip or bzip2 is automatically decompressed when inserting into the table
Update
Update requires transactional behaviour for a table
stored as orc
has buckets
transactional property
Doesn't work to update the bucket col
Delete data
Alter table add column
Alter table rename
Alter table drop column
To drop a column you need to re-write the schema with the columns that you want to keep
Select
Select and sort globally
Select and sort per reducer
Joins
Inner join
Join keyword
SQL way
Outer join
Includes rows from the left table which don't have a match with the right table, and the values are set to null, or the reverse
Cross join
Change execution engine
Optimizations
vectorization
Vectorized query execution streamlines operations by processing a block of 1024 rows at a time.
Debug
Explain
Detail the execution plan
Last updated
Was this helpful?