[Hadoop Pig] LOAD file with double quotes and commas

apache pig

Loading files with double quotes and comma separated values can be tricky, especially if you try to use the standard PigStorage() loader. Luckily there are a few ways to deal with this scenario. We assume you have CSV files already loaded to HDFS and you just want to manipulate the data.

There are ways to fix the files using command line string operations, but we are not going to touch those topics here.

We will present 2 ideas to loading files with quotes/comma delimiters. One of the options is very easy to implement, which is CSVLoader() or CSVExcelStorage() from Apache’s PiggyBank.

The other options is using the built in TextLoader() function to load each line of the file, then using the STRSPLIT function to split the fields and then using the REPLACE function to replace any unwanted double quotes.

Apache Pig – LOAD files with quotes/comma delimiters

Option 1 – Using CSVLoader or CSVExcelStorage to Load Pig files

Option 2 – TextLoader + STRSPLIT + REPLACE

Option 1 is probably much better than Option 2. However, it is up to you to choose the way you want to handle these scenarios. Using CSVExcelStorage() has worked great for our pig scripts when dealing with files created in Excel or files that later need to be read from Excel.

Option 2 have extra steps since first it the line needs to be split and then remove unnecessary quotes from the first and last fields.

  • mohan 221213

    Hi

    I have a problem when I am trying to load a json file.
    Here what i have used the script and the file.

    sample.json

    {“created_at”:”Mon Aug 22 10:48:23 +0000 2016″,”id”:767674772662607873,”id_str”:”767674772662607873″,”text”:”KPIT Image Result for https://t.co/Nas2ZnF1zZhttps://t.co/9TnelwtIvm“,”source”:”u003ca href=”http://twitter.com” rel=”nofollow”u003eTwitter Web Clientu003c/au003e”,”truncated”:false,”in_reply_to_status_id”:123,”in_reply_to_status_id_str”:null,”in_reply_to_user_id”:null,”in_reply_to_user_id_str”:null,”in_reply_to_screen_name”:null,”geo”:null,”coordinates”:null,”place”:null,”contributors”:null,”is_quote_status”:false,”retweet_count”:0,”favorite_count”:0,”entities”:{“hashtags”:[],”urls”:[{“url”:”https://t.co/Nas2ZnF1zZ”,”expanded_url”:”http://miltonious.com/”,”display_url”:”miltonious.com”,”indices”:[24,47]}],”user_mentions”:[],”symbols”:[]},”favorited”:false,”retweeted”:false,”possibly_sensitive”:false,”filter_level”:”low”,”lang”:”en”,”timestamp_ms”:”1471862903167″}
    script:

    REGISTER piggybank.jar
    REGISTER json-simple-1.1.1.jar
    REGISTER elephant-bird-pig-4.3.jar
    REGISTER elephant-bird-core-4.1.jar
    REGISTER elephant-bird-hadoop-compat-4.3.jar

    json = LOAD ‘sample.json’ USING JsonLoader(‘created_at:chararray, id:chararray, id_str:chararray, text:chararray, source:chararray, in_reply_to_status_id:chararray, in_reply_to_status_id_str:chararray, in_reply_to_user_id:chararray, in_reply_to_user_id_str:chararray, in_reply_to_screen_name:chararray, geo:chararray, coordinates:chararray, place:chararray, contributors:chararray, is_quote_status:bytearray, retweet_count:long, favorite_count:chararray, entities:map[], favorited:bytearray, retweeted:bytearray, possibly_sensitive:bytearray, lang:chararray’);
    describe json; dump json;
    When I dump json,I am getting the following output and the worning

    (Mon Aug 22 10:48:23 +0000 2016,767674772662607873,767674772662607873,google Image Result for Twitter Web Client,false,1234,12345,3214,43215,,,,,,,,,,,,,,)

    WARN org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigHadoopLogger – org.apache.pig.builtin.JsonLoader(UDF_WARNING_1): Bad record, returning null for {complete json}

    By warning i guess it is getting NULL values. So how can we load a Json which is having null values in it.

    And I have tried in another way i.e

    json = LOAD ‘sample.json’ USING com.twitter.elephantbird.pig.load.JsonLoader(‘created_at:chararray, id:chararray, id_str:chararray, text:chararray, source:chararray, in_reply_to_status_id:chararray, in_reply_to_status_id_str:chararray, in_reply_to_user_id:chararray, in_reply_to_user_id_str:chararray, in_reply_to_screen_name:chararray, geo:chararray, coordinates:chararray, place:chararray, contributors:chararray, is_quote_status:bytearray, retweet_count:long, favorite_count:chararray, entities:map[], favorited:bytearray, retweeted:bytearray, possibly_sensitive:bytearray, lang:chararray’);

    describe json;
    Output

    Schema for json unknown.

    Please suggest me what i am missing.

    Thanks.