Pig
Your general purpose analytics pig-tool
Launch CLI
pigChange execution engine
At CLI launch
pig -x <mode>In script
set exectype=tez;mode: local, mapreduce, tez, tez_local
Execute Pig script
pig script.pigLoad data from HDFS
<var> = load 'path/to/file';Load data from HDFS with schema
<var> = load 'path/to/file' as (field1:chararray, field2:int, ...);Load data from Hive
start pig with hcatalog
$ pig -useHCatalog
var = load '<db>.<table>' using org.apache.hive.hcatalog.pig.HCatLoader();Group data
<grouped_var> = group (<var> by <col>)/ (all), group <var> by <col>/allTransform schema
<var> = foreach <raw_var> generate <col1>, <col2> ...;Filter data
<var> = filter <raw_var> by <filter_query>Order
<var> = order <var> by <col> (asc/desc), <col> ...Limit the number of rows
<var> = limit <var> <size>Split
split <var> into <good_var> if <query>, <bad_var> otherwise;Remove duplicates
<var> = distinct <var>Inner join
<varC> = join <varA> by <col1>, <varB> by <col2>Right/Left/Full Outer join
<varC> = join <varA> by <col1> right/left/full (outer), <varB> by >col2>Cross join
<varC> = cross <varA>, <varB>;Join options
join .... left , .... using 'replicated' | skewed | mergereplicated dosn't work on tez. use mapreduce mdoe
Dump data on console
dump <var>;Store data into HDFS
store <var> into <path/to/file> ( using PigStorage(',') )Store data into Hive
store <var> into 'table' using org.apache.hive.hcatalog.pig.HCatStorer()Specify nb reducers
Add 'parallel ' to any reducer operator: group, distinct, order, join
<reducer ops> parallel <nb>Debug
explain <query>
describe <query>
illustrate <query>
set debug onRegister a UDF Jar
Register the jar (eg: PiggyBank.jar).
register <jar>Use an UDF
Like any other function, just it might require the full package name
Create an alias for function
define <alias> <udf/macro>Import a macro/UDF from another script
import <>Last updated
Was this helpful?