Pig

Your general purpose analytics pig-tool

Launch CLI

pig

Change execution engine

At CLI launch

pig -x <mode>

In script

set exectype=tez;

mode: local, mapreduce, tez, tez_local

Execute Pig script

pig script.pig

Load data from HDFS

<var> = load 'path/to/file';

Load data from HDFS with schema

Load data from Hive

  • start pig with hcatalog

Group data

Transform schema

Filter data

Order

Limit the number of rows

Split

Remove duplicates

Inner join

Right/Left/Full Outer join

Cross join

Join options

replicated dosn't work on tez. use mapreduce mdoe

Dump data on console

Store data into HDFS

Store data into Hive

Specify nb reducers

Add 'parallel ' to any reducer operator: group, distinct, order, join

Debug

Register a UDF Jar

Register the jar (eg: PiggyBank.jar).

Use an UDF

Like any other function, just it might require the full package name

Create an alias for function

Import a macro/UDF from another script

Last updated

Was this helpful?