String Manipulation and DateTime Functions For Pig

Sample Code & Libraries>String Manipulation and DateTime Functions For Pig
Community Contributed Software

  • Amazon Web Services provides links to these packages as a convenience for our customers, but software not authored by an "@AWS" account has not been reviewed or screened by AWS.
  • Please review this software to ensure it meets your needs before using it.

This library provides user defined functions for performing string manipulation and DateTime functions.

Details

Submitted By: Ian@AWS
AWS Products Used: Amazon Elastic MapReduce
License: Apache License 2.0
Created On: August 6, 2009 2:03 AM GMT
Last Updated: August 11, 2009 12:07 AM GMT
Download
Location of Jar s3://elasticmapreduce/libs/pig/0.3/piggybank-0.3-amzn.jar
Source License Apache License, Version 2.0

The following functions are described.

1.0 FORMAT_DT

Description

Takes a DateTimeFormat string and a DateTime(i.e. a string produced by DATE_TIME()), and formats it into a string. The DateTimeFormat is the Joda Time form, documented at DateTimeFormat.

Import

  DEFINE FORMAT_DT org.apache.pig.piggybank.evaluation.datetime.FORMAT_DT();

Signature

  FORMAT_DT(datetimeformat: chararray, date: DateTime(chararray))
    returns chararray;

2.0 DATE_TIME

Description

A function that returns a DateTime string, of the form yyyy-MM-dd'T'HH:mm:ss.SSSZZ.

A DateTime represents a precise point on the time line. This is the number of milliseconds from the Java epoch of 1970-01-01T00:00:00Z. Together with this information DateTime holds a Timezone used for interpreting its fields.

The constructor for this function had overloads which allows the specification of a default Timezone and a default DateTimeFormat. The Timezone replaces the system default in any function that would use it. The DateTimeFormat is a default format which is attempted to match a string which doesn't match the overload.

A Timezone may be specified as:

  • 'Z' or 'UTC' to represent a UTC TimeZone
  • '[+-]hh:mm' to represent a numeric offset from UTC
  • A long form Time Zone supported by the system, such as "America/Los_Angeles".

DateTimeFormat is the Joda Time format, described at DateTimeFormat

Import

  DEFINE DATE_TIME    org.apache.pig.piggybank.evaluation.datetime.DATE_TIME();
  DEFINE MY_DATE_TIME org.apache.pig.piggybank.evaluation.datetime.DATE_TIME(
    '-07:00', 'MM-dd-yyyy-HH-mm-ss'
  );

Signatures

  DATE_TIME() returns DateTime;

Creates a DateTime for now with the default timezone.

  DATE_TIME(timezone: chararray) returns DateTime;

Creates a DateTime for now with the given timezone.

  DATE_TIME(datetime: chararray) returns DateTime;

Converts the given DateTime to the default timezone.

  DATE_TIME(datetime: chararray, timezone: chararray) returns DateTime;

Converts the given DateTime to the given timezone.

  DATE_TIME(instant:long) returns DateTime;

Creates a DateTime for the given number of milliseconds since 1970-01-01 with the default timezone.

  DATE_TIME(instant:long, timezone: chararray) returns DateTime;

Creates a DateTime for the given number of milliseconds since 1970-01-01 with the given timezone.

  DATE_TIME(str:chararray, datetimeformat: chararray) returns DateTime;

Parses str into a DateTime using format. If timezone is not parsed then it defaults to the default timezone.

  DATE_TIME(str:chararray, datetimeformat: chararray, timezone: chararray) 
    returns DateTime;

Parses str into a DateTime using format, in and with the given timezone.

3.0 REPLACE

Description

Replaces a string with another string inside a larger string. A null reference passed to this method is a no-op.

Import

 DEFINE REPLACE org.apache.pig.piggybank.evaluation.string.REPLACE();

Signature

  REPLACE(string: chararray, pattern: chararray, replacement: chararray)
    returns chararray;

Note that the function only does string matching, pattern is not a regular expression.

4.0 FORMAT

Description

Formats a list of arguments into a single string. See java.util.Formatter for the definition of format strings.

Import

  DEFINE FORMAT org.apache.pig.piggybank.evaluation.string.FORMAT();

Signature

  FORMAT(format: chararray, args: object...);
  FORMAT(format: chararray, args: tuple);

5.0 EXTRACT

Description

Parses input string with a regular expression, and returns all matching groups. The regular expression format is documented in java.util.regex.Pattern.

You may find it useful to combine EXTRACT with FLATTEN like so:

  grunt> numbers = FOREACH mylog GENERATE 
    FLATTEN(EXTRACT(line, '([0-9]+) [0-9]+ ([0-9]+)')) as (first:chararray, second:charray);
  grunt> dump numbers;
  (22, 64)

Note the 'as (...)' clause must be included in this case due to a shortcoming in the Pig type system.

Import

  DEFINE EXTRACT org.apache.pig.piggybank.evaluation.string.EXTRACT();

Signature

  EXTRACT(string: chararray, pattern: chararray) returns 
    tuple(chararray ...);
©2014, Amazon Web Services, Inc. or its affiliates. All rights reserved.