Unit Test Java User Defined Functions in Apache Pig

unit test java pig script

Unit Test Java UDFs

This article will explain how to unit test java User Defined Functions (UDF). Using data from a sample file, we will create a test driven User Defined Function used in a Hadoop Pig script.

In the following scenario, there are 5 records from a random text file. It is a comma delimited file, so it will be easy to just get each field in a Pig Script. The last field contains a date field and our job is to find out how many valid/invalid dates there are in the file.

Note: this is a made up scenario. The only purpose is to show how everything glues together. Also, in the Pig script, I am loading a local file from my computer. You could also change that to a directory inside the Hadoop File System if you have a running cluster.

Let’s crack it!

Unit Test Java UDF Scenario

A Pig script loads a file that contains 5 fields. The fifth field is a Date field in which we will use/create a Java UDF to check if the Date given in the file is valid.


These are the records inside sampleFile.txt

123456,hadoop A,hadoop B,hadoop C,09/10/2014
123457,hadoop A,hadoop B,hadoop C,09/11/2014
123458,hadoop A,hadoop B,hadoop C,09/12/2014
123459,hadoop A,hadoop B,hadoop C,09/13/2014
123450,hadoop A,hadoop B,hadoop C,15/09/2014


This is a Pig script example using a Java UDF

pig scripts

Expected Ouput from Pig Script

Output will be chararray since we did not specify an @OutputSchema

123456,true
123457,true
123458,true
123459,true
123450,false

Java User Defined Function to find out if a date is valid

If you want to make the Pig Java UDF more flexible, you could also pass the “EXPECTED_DATE_FORMAT” from the PIG script. That way you can reuse it if you have to validate different date formats in your project.

Unit test Java User Defined Functions (UDF)

 

if you have experience with JUnit, this code will look familiar. These tests were written using Test Driven Development. A test was written before the actual code in the UDF was created.

Here is the process

  1. Create UDF test class
  2. Write Test to assert which interface you will be using (FilterFunc, EvalFunc, etc)
  3. Create the corresponding Java class.
  4. Write every possible test to validate and assert desired outputs in your UDF.
  5. Code each piece in the Java UDF class according to the test  written.

cooltext116942035291975

 

 

It only shows 5 steps, but I am sure you will have to think about the different scenarios for your UDF. Different inputs could be applied from the Pig Script in order to change desired outputs.

Note: in this example, an output schema is not given back to the Pig Script.

Craking Hadoop’s Piggybank


  • Very descriptive blog, I liked that bit. Will there be a part 2?

  • anonymous

    Nice post, very helpful.

    However i am using UDFContext to get the count of reducers. How can i write test cases for such udf.