Amazon EMR is a managed service that makes it fast, easy, and cost-effective to run Apache Hadoop and Spark to process vast amounts of data. Amazon EMR also supports powerful and proven Hadoop tools such as Presto, Hive, Pig, HBase, and more. In this project, you will deploy a fully functional Hadoop cluster, ready to analyze log data in just a few minutes. You will start by launching an Amazon EMR cluster and then use a HiveQL script to process sample log data stored in an Amazon S3 bucket. HiveQL, is a SQL-like scripting language for data warehousing and analysis. You can then use a similar setup to analyze your own log files.
What you'll need before starting:
An AWS Account: You will need an AWS account to begin provisioning resources to host your website. Sign up for AWS.
IT Experience: Prior experience with Hadoop is recommended, but not required, to complete this project.
AWS Experience: Basic familiarity with Amazon S3 and Amazon EC2 key pairs is suggested, but not required, to complete this project.
Cost to complete project: The estimated cost to complete this project is $1.05. This cost assumes that you are within the AWS Free Tier limits, you follow the recommended configurations, and that you terminate all resources used in the project within an hour of creating them. Your use case may require different configurations that can impact your bill. Use the Pricing Calculator to estimate costs tailored for your needs.
Monthly billing estimate: The total cost of this project will vary depending on your usage and configuration settings. Using the default configuration recommended in this guide, it will typically cost $769/month for this project. AWS pricing is based on your usage of each individual service. The total combined usage of each service will create your monthly bill. Explore the tabs below to learn what each service does and how it affects your bill. To see a breakdown of the services used and their associated costs, see Services Used and Costs.