Java UDF to Sort a Bag of Tuples

java pig udf

This UDF sorts a bag of tuples. You can do that in a PIG script but Bags do not guarantee order. So the best way is to create an User Defined Function to do that for you. To sort them alphabetically, the java util Collections library is used. Very easy to use and makes the PIG script a little cleaner.

Example: {(Item A),(Item D),(Item C),(Item B)} is converted to {(Item A),(Item B),(Item C),(Item D)}

You can also find some unit test for this UDF to give you an idea of what is doing.

Also, this UDF keeps any values that have duplicates. You can use a Set instead of an Arraylist if you want to remove duplicates.

UDF to Sort a Bag of Tuples in Pig

Here is another way to Sort a Bag of Tuples. This time, you can use newSortedBag() from BagFactory.
You would have to create your own Comparator as shown below:

Unit Test to Sort a Bag of Tuples in a PIG UDF

Here is a list of UDFs


    For sorting the tuples from the DataBag while storing the tuples in the arraylist getting the error


    List list = new ArrayList((int) dataBag.size());
    for (Tuple tuple : dataBag) {
    if (tuple != null) {

    Caused by: java.lang.OutOfMemoryError: GC overhead limit exceeded
    at java.util.Arrays.copyOfRange(
    at java.lang.String.(
    at java.lang.StringBuilder.toString(

    Probable reason –
    DataBag is spillable but we cannot use sortedDatabag which is not spillable and also the performance head
    the total records which i want to store in arraylist for sorting is approximate 2.7 million records

    is there is any solution so that i can sort these many records

    • vaiz84

      did you try this on your PIG script to increase memory space?

      I haven’t been able to replicate the issue, but will try to do it during the weekend