Pig
Your general purpose analytics pig-tool
Launch CLI
pig
Change execution engine
At CLI launch
pig -x <mode>
In script
set exectype=tez;
mode: local, mapreduce, tez, tez_local
Execute Pig script
pig script.pig
Load data from HDFS
<var> = load 'path/to/file';
Load data from HDFS with schema
<var> = load 'path/to/file' as (field1:chararray, field2:int, ...);
Load data from Hive
start pig with hcatalog
$ pig -useHCatalog
var = load '<db>.<table>' using org.apache.hive.hcatalog.pig.HCatLoader();
Group data
<grouped_var> = group (<var> by <col>)/ (all), group <var> by <col>/all
Transform schema
<var> = foreach <raw_var> generate <col1>, <col2> ...;
Filter data
<var> = filter <raw_var> by <filter_query>
Order
<var> = order <var> by <col> (asc/desc), <col> ...
Limit the number of rows
<var> = limit <var> <size>
Split
split <var> into <good_var> if <query>, <bad_var> otherwise;
Remove duplicates
<var> = distinct <var>
Inner join
<varC> = join <varA> by <col1>, <varB> by <col2>
Right/Left/Full Outer join
<varC> = join <varA> by <col1> right/left/full (outer), <varB> by >col2>
Cross join
<varC> = cross <varA>, <varB>;
Join options
join .... left , .... using 'replicated' | skewed | merge
replicated dosn't work on tez. use mapreduce mdoe
Dump data on console
dump <var>;
Store data into HDFS
store <var> into <path/to/file> ( using PigStorage(',') )
Store data into Hive
store <var> into 'table' using org.apache.hive.hcatalog.pig.HCatStorer()
Specify nb reducers
Add 'parallel ' to any reducer operator: group, distinct, order, join
<reducer ops> parallel <nb>
Debug
explain <query>
describe <query>
illustrate <query>
set debug on
Register a UDF Jar
Register the jar (eg: PiggyBank.jar).
register <jar>
Use an UDF
Like any other function, just it might require the full package name
Create an alias for function
define <alias> <udf/macro>
Import a macro/UDF from another script
import <>
Last updated
Was this helpful?