Apache Pig – Bag vs Tuple usage in Java UDFs

unit test java pig script

Bags and Tuples are 2 of the 3 Complex Data Types used in the Apache Pig (The other is Map). They are mostly generated when a GROUP BY operator is called in a Pig script. A bag is represented with curly brackets “{}” and tuples are represented with parenthesis “()”.

It is complex because they can interact together with no limits in the amount of their interactions. There could be multiple tuples inside a bag, or bags within a bag, tuples inside tuples, multiple values inside a tuple and so on.

By definition, a Tuple is a sequence collection of elements of different data types (int, string, etc). Tuples are represented with values inside a parentheses. For example, (value1, 1000,07/03/2015) is a tuple with 3 elements always separated by commas.

A Bag is an unordered collection of tuples. It is not required to have a specific schema either. A bag is represented by tuples inside curly brackets.  For example “{(tuple1a, tuple1b),(tuple2),(tuple3)}”

Pig Script Example

This Pig script sends a bag of tuple values to a Java UDF. The data in alias A comes in a comma-delimited format. It is transformed when alias B groups it by “field1”. On alias C, a bag of all the “field2” values per grouping will be sent to a Java UDF for some processing.

In the Java UDF, a bag of tuples is processed this way…

Bag vs Tuple usage in Java UDFs

Bag in Java UDFs

Create new Empty Bag

DataBag emptyBag = BagFactory.getInstance().newDefaultBag();

Crete a Bag of Tuples

List<Tuple> listOfTuples = new ArrayList<Tuple>();

 

listOfTuples.add(value1);

 

listOfTuples.add(value2);

 

 

DataBag bagOfTuples = BagFactory.getInstance().newDefaultBag(listOfTuples);

Tuples in Java UDFs

Create a new Empty Tuple

Tuple tuple = TupleFactory.getInstance().newTuple();

Create a tuple with 2 values

Tuple tuple = TupleFactory.getInstance().newTuple();
tuple.append(“value1”);
tuple.append(“value2”);


Here is a list of Java UDFs