Parse Big Data with Informatica's HParser on Amazon EMR

Articles & Tutorials>Parse Big Data with Informatica's HParser on Amazon EMR
The following tutorial walks you through the process of using Informatica's HParser hosted on Amazon EMR to process custom text files into an easy-to-analyze XML format.

Details

Submitted By: Syne@AWS
AWS Products Used: Amazon Elastic MapReduce
Created On: April 27, 2012 7:11 AM GMT
Last Updated: November 21, 2012 4:43 AM GMT

Informatica's HParser is a tool you can use to extract data stored in heterogeneous formats and convert it into a form that is easy to process and analyze. For example, if your company has legacy stock trading information stored in custom-formatted text files, you could use HParser to read the text files and extract the relevant data as XML. In addition to text and XML, HParser can extract and convert data stored in proprietary formats such as PDF and Word files.

HParser is designed to run on top of the Hadoop architecture, which means you can distribute operations across many computers in a cluster to efficiently parse vast amounts of data. Amazon Elastic MapReduce (Amazon EMR) makes it easy to run Hadoop in the Amazon Web Services (AWS) cloud. With Amazon EMR you can set up a Hadoop cluster in minutes and automatically terminate the resources when the processing is complete.

In our stock trade information example, you could use HParser running on top of Amazon EMR to efficiently parse the data across a cluster of machines. The cluster will automatically shut down when all of your files have been converted, ensuring you are only charged for the resources used. This makes your legacy data available for analysis, without incurring ongoing IT infrastructure expenses.

The following tutorial walks you through the process of using HParser hosted on Amazon EMR to process custom text files into an easy-to-analyze XML format. The parsing logic for this sample has been defined for you using HParser, and is stored in the transformation services file (services_basic.tar.gz). This file, along with other content needed to run this tutorial, has been preloaded onto Amazon Simple Storage Service (S3) at s3n://elasticmapreduce/samples/informatica/. You will reference these files when you run the HParser job.

As part of its tool set, HParser provides a Studio application that you can use to define parsing logic graphically, without the need to write code. For more information about HParser and how to use it to define parsing logic, go to http://www.informatica.com/us/products/b2b-data-exchange/hparser/.

For information about Amazon EMR, see the Amazon Elastic MapReduce Documentation.

Step 1: Sign Up for the Service

If you don't already have an AWS account, you'll need to get one. Your AWS account gives you access to all services, but you will be charged only for the resources that you use. For this example walkthrough, the charges will be minimal.

To sign up for AWS

  1. Go to http://aws.amazon.com and click Sign Up Now.
  2. Follow the on-screen instructions.

AWS notifies you by email when your account is active and available for you to use.

The credentials for your AWS account give you access to all the resources that you have deployed.

Step 2: Create a Key Pair

To run HParser, you'll need to create a key pair to connect to the Amazon EC2 instances that Amazon EMR launches. For security reasons, EC2 instances use a public/private key pair, rather than a user name and password, to authenticate connection requests. The public key half of this pair is embedded in the instance, so you can use the private key to log in securely without a password. In this step we will use the AWS Management Console to create a key pair.

To generate a key pair

  1. Open the Amazon EC2 console at https://console.aws.amazon.com/ec2/.
  2. In the Navigation pane, in the Region box, click US East (Virginia).
  3. In the Navigation pane, under Network and Security, click Key Pairs.
  4. Click Create Key Pair.
  5. In the Key Pair dialog box, in the Key Pair Name box, type "newkeypair" for the new key pair and then click Create.
  6. Download the private key file, which is named newkeypair.pem, and keep it in a safe place. You will need it to access any instances that you launch with this key pair.

    Important: If you lose the key pair, you cannot connect to your Amazon EC2 instances.
  7. For more information about key pairs, see Getting an SSH Key Pair in the Amazon Elastic Compute Cloud User Guide.

Step 3: Create an Amazon S3 Bucket

In this tutorial, HParser stores output in a bucket on Amazon Simple Storage Service (Amazon S3). You'll need to create a bucket to receive the parsed results.

To create an Amazon S3 bucket

  1. Open the Amazon S3 console at https://console.aws.amazon.com/s3/.
  2. In the Amazon S3 console, click Create Bucket. The Create a Bucket dialog box appears.
  3. Enter a bucket name in the Bucket Name field.

    Note: Because of requirements imposed by Hadoop, this bucket name must contain only lower-case letters, numbers, and hyphens (-).

    The bucket name you choose must be unique across all existing bucket names in Amazon S3. One way to create unique bucket names is to prefix your bucket names with your company's name. For this example, we'll use hparser-output, however you should choose a unique name.

    Your bucket name must comply with Amazon S3 requirements. There might be additional restrictions on bucket names based on the region your bucket is in or how you intend to access the object. For more information, go to http://docs.amazonwebservices.com/AmazonS3/latest/dev/index.html?BucketRestrictions.html.

    Do not set up logging when you create the bucket. This is a temporary output bucket that you will delete at the end of this tutorial.

    Note: After you create a bucket, you cannot change its name. In addition, the bucket name is visible in the URL that points to the objects stored in the bucket. Make sure the bucket name you choose is appropriate.

  4. In the Region drop-down list box, select US Standard as the region. The rest of the tutorial files are stored in a public bucket in the US Standard region. Creating your bucket in the same region will avoid cross-region data transfer charges.
  5. Click Create.

    Create a Bucket - Select a Bucket Name and Region

    When Amazon S3 successfully creates your bucket, the console displays your empty bucket in the Buckets panel.

    Empty Bucket Created

    For more information about Amazon S3, go to http://aws.amazon.com/documentation/s3/.

Step 4: Create an Interactive Job Flow Using the Console

To run Hadoop and HParser, you'll need to create an interactive Amazon EMR job flow that you can SSH into and run commands interactively.

To create a job flow using the console

  1. Sign in to the AWS Management Console and open the Amazon Elastic MapReduce console at https://console.aws.amazon.com/elasticmapreduce/
  2. In the Region box, click US East, and then click Create New Job Flow.
  3. On the DEFINE A JOB FLOW page, you can leave Job Flow Name as "My Job Flow," or you can rename it to "HParser Tutorial" as we've done in this example. Click Run your own application. Select Hive Program as the job flow type. Then click Continue.

    Define Job Flow

  4. On the SPECIFY PARAMETERS page, click Start an Interactive Hive Session. (Hive is an open-source tool that runs on top of Hadoop. With Hive, you can query job flows by using a simplified SQL syntax.) Then click Continue.

    For this tutorial, we will not be using Hive. We are selecting an interactive Hive session because that will give us the ability to issue commands from a terminal window. We will use these commands to run HParser.

    Specify Parameters

  5. On the CONFIGURE EC2 INSTANCES page, you can set the number and type of instances used to process the job flow. To ensure that the job flow runs on 64-bit instances, set Master Instance Group Instance Type to Large (m1.large). In the Core Instance Group set the Instance Count to 2 and the Instance Type to Large (m1.large). Do not change the settings for the Task Instance Group. Then click Continue.

    Configure EC2 Instances

  6. On the ADVANCED OPTIONS page, specify the key pair that you created earlier. We are not going to launch this job flow on Amazon VPC, so leave Amazon VPC Subnet Id set to Proceed without a VPC Subnet ID. Specify a location on Amazon S3 for the Amazon S3 Log Path. In this example, we specify "s3n://hparser-output" as the path. Set Enable Debugging to Yes. Then click Continue.

    Advanced Options

  7. On the BOOTSTRAP ACTIONS page, click Configure your Bootstrap Actions. For Action Type, select Custom Action. For Name, you can specify a name for the bootstrap action, or leave it at the default, "Custom Action," For the Amazon S3 Location specify the following value:
    s3n://elasticmapreduce/samples/informatica/1-basic-samples/bootstrap_basic_95.sh  
    

    Bootstrap actions load custom software onto the virtual servers that Amazon EMR launches. The preceding location contains the bootstrap script that loads the HParser software. Click Continue.

    Bootstrap Actions

  8. On the REVIEW page, review the settings for your job flow. If everything looks correct, click Create Job Flow. When the confirmation window closes, your new job flow will appear in the list of job flows in the Amazon EMR console with the status STARTING. It will take a few minutes for Amazon EMR to provision the Amazon EC2 instances for your job flow.

Step 5: SSH into the Master Node

When your new job flow's status in the Amazon EMR console is RUNNING, the master node is ready for you to connect to it. Before you can do that, however, you must acquire the DNS name of the master node and configure your connection tools and credentials.

To locate the DNS name of the master node

  • Locate the DNS name of the master node in the Amazon EMR console by selecting the job flow you just created from the list of running job flows. Details about the job flow to appear in the lower pane. The DNS name you will use to connect to the master node instance is listed on the Description tab as Master Public DNS Name. In our example, the DNS name is ec2-23-22-21-117.compute-1.amazonaws.com. Make a note of the DNS name; you'll need it in the next step.

    Master Public DNS Name

Next we'll use a Secure Shell (SSH) application to open a terminal connection to the master node. An SSH application is installed by default on most Linux, UNIX, and Mac OS systems. Windows users can use an application called PuTTY to connect to the master node. Platform-specific instructions for configuring a Windows application to open an SSH connection are described later in this tutorial.

You must first configure your credentials, or SSH will return an error message saying that your private key file is unprotected, and it will reject the key. You need to do this step only the first time you use the private key to connect.

To configure your credentials on Linux/UNIX/Mac OS X

  1. Open a terminal window. This is found at Applications/Utilities/Terminal on Mac OS X and at Applications/Accessories/Terminal on many Linux distributions.
  2. Set the permissions on the PEM file for your Amazon EC2 key pair so that only the key owner has permissions to access the key. For example, if you saved the file as mykeypair.pem in the user's home directory, the command is:
    chmod og-rwx ~/mykeypair.pem   
                

To connect to the master node using Linux/UNIX/Mac OS X

  1. At the terminal window, enter the following command line, where ssh is the command, hadoop is the user name you are using to connect, the at symbol (@) joins the username and the DNS of the machine you are connecting to, and the -i parameter indicates the location of the private key file you saved in step 6 of Step 2: Create a Key Pair. In this example, we're assuming that you saved your private key file to your home directory.
    ssh hadoop@ec2-23-22-21-117.compute-1.amazonaws.com \ -i ~/mykeypair.pem 
                    
  2. You will receive a warning that the authenticity of the host you are connecting to can't be verified. Type "yes" to continue connecting.

If you are using a Windows-based computer, you will need to install an SSH program in order to connect to the master node. In this tutorial, we will use PuTTY. If you have already installed PuTTY and configured your key pair, you can skip this procedure.

To install and configure PuTTY on Windows

  1. Download PuTTYgen.exe and PuTTY.exe to your computer from http://www.chiark.greenend.org.uk/~sgtatham/putty/download.html.
  2. Launch PuTTYgen.
  3. Click Load. Select the PEM file you created earlier. You may have to change the search parameters from file of type "PuTTY Private Key Files (*.ppk)" to "All Files (*.*)."
  4. Click Open.
  5. On the PuTTYgen Notice telling you the key was successfully imported, click OK.
  6. To save the key in the PPK format, click Save private key.
  7. When PuTTYgen prompts you to save the key without a pass phrase, click Yes.
  8. Enter a name for your PuTTY private key, such as mykeypair.ppk.

To connect to the master node using Windows/Putty

  1. Start PuTTY.
  2. In the Category list, click Session. In the Host Name box, type hadoop@DNS. The input looks similar to hadoop@ ec2-23-22-21-117.compute-1.amazonaws.com.
  3. In the Category list, expand Connection, expand SSH, and then click Auth.
  4. In the Options controlling SSH authentication pane, click Browse for Private key file for authentication, and then select the private key file that you generated earlier. If you are using the name we're using in this tutorial, the file name is newkeypair.ppk.
  5. Click Open.
  6. To connect to the master node, click Open.
  7. In the PuTTY Security Alert, click Yes.

Note: For more information about how to install PuTTY and use it to connect to an EC2 instance, such as the master node, go to Connecting to a Linux/UNIX Instance from Windows using PuTTY in the Amazon Elastic Compute Cloud User Guide.

When you are successfully connected to the master node via SSH, you will see a welcome screen and prompt like the following.

-----------------------------------------------------------------------------  
Welcome to Amazon EMR running Hadoop and Debian/Lenny.   

Hadoop is installed in /home/hadoop. Log files are in /mnt/var/log/hadoop. 
Check /mnt/var/log/hadoop/steps for diagnosing step failures.  

The Hadoop UI can be accessed via the following commands:     
JobTracker    lynx http://localhost:9100/   
NameNode      lynx http://localhost:9101/   
----------------------------------------------------------------------------- 
hadoop@ ip-10-118-101-76:~$ 
        

Step 6: Run HParser

Before you can run HParser, you need to download the files you need from Amazon S3 to the master node. Once the required files are available locally, you can run a Hadoop job that calls HParser.

To download HParser files from Amazon S3 to the master node

  1. Use Hadoop to copy the HParser JAR file by running the following command.
    hadoop fs -copyToLocal s3n://elasticmapreduce/samples/informatica/hparser95/hparser-1.1.jar   ./
                    
  2. Copy the HParser configuration file.
    hadoop fs -copyToLocal s3n://elasticmapreduce/samples/informatica/1-basic-samples/hparser-job-conf-basic.xml  ./
                   

To run HParser

  • Run the following command to launch a Hadoop job that calls HParser. Replace the value s3n://hparser-output/ with the name of the Amazon S3 bucket you created in Step 3: Create an Amazon S3 Bucket.
    hadoop jar hparser-1.1.jar  \
    com.informatica.b2b.dt.hadoop.DataTransformationJob \
    -conf hparser-job-conf-basic.xml  \
    s3n://elasticmapreduce/samples/informatica/1-basic-samples/input10-text  \
    s3n://hparser-output/output 10_Text
                    

In this tutorial, the parsing input and output is stored in Amazon S3. You can also use locations in HDFS as your input and output sources.

The command line output should be something like the following. Note that there are no reduce tasks needed in this parsing operation.

12/04/24 22:28:03 INFO hadoop.DTRecordReader: dt.service.name = 10_Text
12/04/24 22:28:05 INFO mapred.JobClient: Default number of map tasks: null
12/04/24 22:28:05 INFO mapred.JobClient: Setting default number of map tasks based on cluster size to : 12
12/04/24 22:28:05 INFO mapred.JobClient: Default number of reduce tasks: 0
12/04/24 22:28:05 INFO mapred.JobClient: Setting group to hadoop
12/04/24 22:28:06 INFO input.FileInputFormat: Total input paths to process : 4
12/04/24 22:28:06 INFO mapred.JobClient: Running job: job_201204242208_0002
12/04/24 22:28:07 INFO mapred.JobClient:  map 0% reduce 0%
12/04/24 22:28:21 INFO mapred.JobClient:  map 50% reduce 0%
12/04/24 22:28:27 INFO mapred.JobClient:  map 100% reduce 0%
12/04/24 22:28:32 INFO mapred.JobClient: Job complete: job_201204242208_0002
12/04/24 22:28:32 INFO mapred.JobClient: Counters: 20
12/04/24 22:28:32 INFO mapred.JobClient:   Job Counters 
12/04/24 22:28:32 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=33090
12/04/24 22:28:32 INFO mapred.JobClient:     Total time spent by all reduces waiting after reserving slots (ms)=0
12/04/24 22:28:32 INFO mapred.JobClient:     Total time spent by all maps waiting after reserving slots (ms)=0
12/04/24 22:28:32 INFO mapred.JobClient:     Rack-local map tasks=4
12/04/24 22:28:32 INFO mapred.JobClient:     Launched map tasks=4
12/04/24 22:28:32 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=0
12/04/24 22:28:32 INFO mapred.JobClient:   File Output Format Counters 
12/04/24 22:28:32 INFO mapred.JobClient:     Bytes Written=3720
12/04/24 22:28:32 INFO mapred.JobClient:   FileSystemCounters
12/04/24 22:28:32 INFO mapred.JobClient:     S3N_BYTES_READ=27324
12/04/24 22:28:32 INFO mapred.JobClient:     HDFS_BYTES_READ=492
12/04/24 22:28:32 INFO mapred.JobClient:     S3N_BYTES_WRITTEN=3720
12/04/24 22:28:32 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=94886
12/04/24 22:28:32 INFO mapred.JobClient:   File Input Format Counters 
12/04/24 22:28:32 INFO mapred.JobClient:     Bytes Read=27324
12/04/24 22:28:32 INFO mapred.JobClient:   Map-Reduce Framework
12/04/24 22:28:32 INFO mapred.JobClient:     Map input records=4
12/04/24 22:28:32 INFO mapred.JobClient:     Physical memory (bytes) snapshot=535044096
12/04/24 22:28:32 INFO mapred.JobClient:     Spilled Records=0
12/04/24 22:28:32 INFO mapred.JobClient:     CPU time spent (ms)=2310
12/04/24 22:28:32 INFO mapred.JobClient:     Total committed heap usage (bytes)=469762048
12/04/24 22:28:32 INFO mapred.JobClient:     Virtual memory (bytes) snapshot=9651716096
12/04/24 22:28:32 INFO mapred.JobClient:     Map output records=4
12/04/24 22:28:32 INFO mapred.JobClient:     SPLIT_RAW_BYTES=492
12/04/24 22:28:32 INFO hadoop.DTRecordReader: The 'Data Transformation job- 10_Text' job has finished successfully. 
        

Step 7: View the Results

The output of this tutorial is stored in the Amazon S3 bucket you created in Step 3: Create an Amazon S3 Bucket. You can download this file to see the parsed results, they should be similar to the results in Appendix B.

To view a file stored on Amazon S3

  1. Open the Amazon S3 console at https://console.aws.amazon.com/s3/.
  2. Click on the bucket you created in Step 3: Create an Amazon S3 Bucket. In this tutorial we named the bucket hparser-output.
  3. Open the output folder. You should see an empty file named "_SUCCESS" and four output files.
  4. Right-click an output file and select Download to download the file to your computer, where you can open it with the text editor of your choice.

    Download from Amazon S3

Step 8: Run Other Samples

HParser can run other transformation samples. The following are combinations of data sources and their respective transformation names:

Data Source Directory Transformation Name
s3n://elasticmapreduce/samples/informatica/1-basic-samples/input10-text 10_Text
s3n://elasticmapreduce/samples/informatica/1-basic-samples/input11-xml 11_XML_to_CSV
s3n://elasticmapreduce/samples/informatica/1-basic-samples/input30-binary 30_BinaryCDRParser
s3n://elasticmapreduce/samples/informatica/1-basic-samples/input40-edi 40_HIPAA

You can also examine implementation of these transformations with HParser Studio. The Studio is contained within the HParser download.

In order to view the transformation implementations, copy the implementation file in S3 to your local drive and unpack it. Then transformations may be imported into HParser Studio.

Step 9: Clean Up

Now that you've learned how to run HParser on Amazon EMR, it's time to terminate your environment, clean up your resources, and avoid accruing any further charges.

To disconnect from SSH

  • At the SSH command prompt, type exit, and then press Enter. Afterwards you can close the terminal or PuTTY window.
    exit

To terminate a job flow

  • In the Amazon Elastic MapReduce console, click the job flow, and then click Terminate.

To delete an Amazon S3 bucket

Buckets with objects in them cannot be deleted. Before deleting a bucket, all objects within the bucket must be deleted.

To delete an object

  1. Open the Amazon S3 console at https://console.aws.amazon.com/s3/.
  2. Click the bucket where the objects are stored.
  3. Right-click the object you want to delete. You can use Ctrl+Shift to select multiple objects and perform the same action on them simultaneously.

    A dialog box shows the actions you can take on the selected object(s).

  4. Click Delete.
  5. Confirm the deletion when the console prompts you to.

To delete a bucket

  1. Right-click the bucket you want to delete.

    A dialog box shows the actions you can take on the selected bucket.

  2. Click Delete.
  3. Confirm the deletion when the console prompts you to.

You have now deleted your bucket and all its contents.

The next step is optional. It deletes the key pair you created in Step 2: Create a Key Pair. You are not charged for key pairs. If you are planning to explore Amazon EMR further, you should retain the key pair.

To delete a key pair

  1. From the Amazon EC2 console, select Key Pairs from the left-hand pane.
  2. In the right pane, select the key pair you created in Step 2: Create a Key Pair and click Delete.

The next step is optional. It deletes two security groups created for you by Amazon EMR when you launched the job flow. You are not charged for security groups. If you are planning to explore Amazon EMR further, you should retain them.

To delete Amazon EMR security groups

  1. From the Amazon EC2 console, in the Navigation pane, click Security Groups.
  2. In the Security Groups pane, click the ElasticMapReduce-slave security group.
  3. In the details pane for the ElasticMapReduce-slave security group, delete all actions that reference ElasticMapReduce. Click Apply Rule Changes.
  4. In the right pane, select the ElasticMapReduce-master security group.
  5. In the details pane for the ElasticMapReduce-master security group, delete all actions that reference ElasticMapReduce. Click Apply Rule Changes.
  6. With ElasticMapReduce-master still selected in the Security Groups pane, click Delete. Click Yes to confirm.
  7. In the Security Groups pane, click ElasticMapReduce-slave, and then click Delete. Click Yes to confirm.

Appendix A: Input Data

The following is an example of the custom-formatted text files this tutorial uses as input.

Input Data

Appendix B: Output Data

The following is an example of the XML generated from the custom-formatted text input shown in Appendix A.

Output Data

©2014, Amazon Web Services, Inc. or its affiliates. All rights reserved.