Pig

Your general purpose analytics pig-tool

Launch CLI

pig

Change execution engine

At CLI launch

pig -x <mode>

In script

set exectype=tez;

mode: local, mapreduce, tez, tez_local

Execute Pig script

pig script.pig

Load data from HDFS

<var> = load 'path/to/file';

Load data from HDFS with schema

<var> = load 'path/to/file' as (field1:chararray, field2:int, ...);

Load data from Hive

  • start pig with hcatalog

$ pig -useHCatalog

var = load '<db>.<table>' using org.apache.hive.hcatalog.pig.HCatLoader();

Group data

<grouped_var> = group (<var> by <col>)/ (all), group <var> by <col>/all

Transform schema

<var> = foreach <raw_var> generate <col1>, <col2> ...;

Filter data

<var> = filter <raw_var> by <filter_query>

Order

<var> = order <var> by <col> (asc/desc), <col> ...

Limit the number of rows

<var> = limit <var> <size>

Split

split <var> into <good_var> if <query>, <bad_var> otherwise;

Remove duplicates

<var> = distinct <var>

Inner join

<varC> = join <varA> by <col1>, <varB> by <col2>

Right/Left/Full Outer join

<varC> = join <varA> by <col1> right/left/full (outer), <varB> by >col2>

Cross join

<varC> = cross <varA>, <varB>;

Join options

join .... left , .... using 'replicated' | skewed | merge

replicated dosn't work on tez. use mapreduce mdoe

Dump data on console

dump <var>;

Store data into HDFS

store <var> into <path/to/file> ( using PigStorage(',') )

Store data into Hive

store <var> into 'table' using org.apache.hive.hcatalog.pig.HCatStorer()

Specify nb reducers

Add 'parallel ' to any reducer operator: group, distinct, order, join

<reducer ops> parallel <nb>

Debug

explain <query>

describe <query>

illustrate <query>

set debug on

Register a UDF Jar

Register the jar (eg: PiggyBank.jar).

register <jar>

Use an UDF

Like any other function, just it might require the full package name

Create an alias for function

define <alias> <udf/macro>

Import a macro/UDF from another script

import <>

Last updated

Was this helpful?