Up and Running with Big Data: 3 Day Deep-Dive
Over three days, explore the Big Data tools, technologies and techniques which allow organizations to gain insight and drive new business opportunities by finding signal in their data. Using Amazon Web Services, you'll learn how to use the flexible map/reduce programming model to scale your analytics, use Hadoop with Elastic MapReduce, write queries with Hive, develop real world data flows with Pig and understand the operational needs of a production data platform.
Schedules and Locations
What You Will Learn
Brought to you by AWS & Think Big Analytics, this course offers a hands-on training experience with short lectures and plenty of programming exercises. The agenda includes the following topics:
The agenda includes the following topics:
- Amazon Elastic MapReduce Overview and Hadoop Architecture.
- Amazon Elastic MapReduce Value Proposition.
- Starting and examining your first EMR cluster.
- Transient versus persistent clusters
- Dynamic cluster resizing
- Writing your first Amazon Elastic MapReduce job.
- Loading Data into the cluster.
- Amazon Elastic MapReduce Controls and Debugging EMR.
- CloudWatch integration
- S3 backup and disaster recovery
- Spot Integration
- Bootstrap Actions for cluster customization and configuration
- DynamoDB integration
- Data and Security.
- Elastic MapReduce Programming Models.
- Amazon Elastic MapReduce with streaming.
- Amazon Elastic MapReduce with Pig.
- Amazon Elastic MapReduce with Hive.
- Advanced Hadoop Features – UDFs, UDAFs.
- Amazon Elastic MapReduce Ecosystem.
Prerequisites
The following prerequisites ensure that you will gain the maximum benefit from the course.
- Programming experience: This is a developer course. We will write Java, Hive, and Pig applications. Prior Java experience is strongly recommended.
- Linux shell experience: Basic Linux shell (bash) commands will be used extensively. Some prior experience is recommended.
- Experience with SQL databases: SQL experience is helpful for learning Hive and Pig, but not essential.
What You Must Bring
We will log into remote EMR instances to build, test, and run our applications. You will also be provided with all the exercise software so you can view it on your laptop, if desired.
Bring your laptop with the following software installed in advance.
- JDK 1.6 or 1.7: The JDK (Java Development Kit) version 1.6 or newer (not just the JRE - Java Runtime Environment).
- Ant: The Java-based ant build tool, version 1.7 or newer, if you want to build and test the Java exercises on your laptop.
- A programmer’s source code editor: Whatever you prefer. Either Eclipse or IntelliJ IDEA is recommended for the Java exercises and project files for both environments will be provided. You might find a separate programmer’s text editor to be more convenient for Hive and Pig exercise.
*Please Note* this course is not taught by AWS Employees.
Additional Elastic MapReduce Resources
Interested in learning more? Click on the below links to gain insight on What EMR is and how it can help you!
- The agenda for the class, and why developers should consider using Amazon Elastic MapReduce.
- Signing up for an AWS account, generating a key-pair, and setting up an S3 bucket.
- Creating, monitoring, and getting results from you EMR Job Flow.
- EC2 instance types, pricing, and Hadoop cluster configuration.
- S3 architectures, pricing, and access control.
- How to use a Hadoop Job Flow to analyze text from Wikipedia.
- When and how to use the EMR and s3cmd tools.
- Best practices for debugging EMR Job Flows.
- Creating, monitoring, and getting results from Hive & Pig Job Flows.
- How to use a Hive Job Flow to analyze Wikipedia article data.
- Bootstrap actions, spot pricing and task groups.