Indexing Common Crawl Metadata on Amazon EMR Using Cascading and Elasticsearch

Hernan Vivani is a Big Data Support Engineer for Amazon Web Services

A previous post showed you how to get started with Elasticsearch and Kibana on Amazon EMR. In that post, we installed Elasticsearch and Kibana on an Amazon EMR cluster using bootstrap actions.

This post shows you how to build a simple application with Cascading for reading Common Crawl metadata, index the metadata on Elasticsearch, and use Kibana to query the indexed content.

What is Common Crawl?

Common Crawl is an open-source repository of web crawl data. This data set is freely available on Amazon S3 under the Common Crawl terms of use. The data is stored in several data formats. In this example, you work with the WAT response format that contains the metadata for the crawled HTML information. This allows you to build an Elasticsearch index, which can be used to extract useful information about tons of sites on the Internet.

What is Cascading?

Cascading is an application development platform for building data applications on Apache Hadoop. In this post, you use it to build a simple application that indexes JSON files in Elasticsearch, without the need to think in terms of MapReduce methods.

Launching an EMR cluster with Elasticsearch, Maven, and Kibana

As in the previous post, you launch a cluster with Elasticsearch and Kibana installed. You also install Maven to compile the application and run a script to resolve some library dependencies between Elasticsearch and Cascading. All the bootstrap actions are public, so you can download the code to verify the installation steps at any time.

To launch the cluster, use the AWS CLI and run the following command:

aws emr create-cluster --name "Elasticsearch_Getting_Started" --ami-version 3.11.0 \
--instance-type=m3.xlarge --instance-count 3 \
--ec2-attributes KeyName=your-key \
--log-uri s3://your-bucket/logs/ \
--bootstrap-action Name="Install EMR Dev Tools",Path=s3://awssupportdatasvcs.com/bootstrap-actions/EMR_Dev/setup_EMR_Dev.sh \
Name="Install Cascading",Path=s3://awssupportdatasvcs.com/bootstrap-actions/Cascading/cascading-install.sh \
Name="Configure Cascading Classpath",Path=s3://awssupportdatasvcs.com/bootstrap-actions/Cascading/cascading-set-classpath.sh \
Name="Install Elasticsearch",Path=s3://support.elasticmapreduce/bootstrap-actions/other/elasticsearch_install.rb \
Name="Install Kibana",Path=s3://support.elasticmapreduce/bootstrap-actions/other/kibananginx_install.rb \
--no-auto-terminate --use-default-roles --region us-east-1

Compiling Cascading Source Code with Maven

After you have the cluster up and running, you can connect using SSH into the master node to compile and run the application. Your Cascading application applies a filter before you start the indexing process, to remove the WARC envelope and obtain plain JSON output. For more information about the code, see the Github repository.

Clone the repository:

$ git clone https://github.com/awslabs/aws-big-data-blog.git

Compile the code:

$ cd aws-big-data-blog/aws-blog-elasticsearch-cascading-commoncrawl/commoncrawl.cascading.elasticsearch
$ mvn clean && mvn assembly:assembly -Dmaven.test.skip=true  -Ddescriptor=./src/main/assembly/job.xml -e

Compiled application is placed in the following directory: aws-big-data-blog/aws-blog-elasticsearch-cascading-commoncrawl/commoncrawl.cascading.elasticsearch/target

Listing the directory should show the packaged application, as shown in the following graphic:

Indexing Common Crawl Metadata on Elasticsearch

Using the application you just compiled, you can index a single Common Crawl file or a complete directory, by modifying the parameter. The following commands show you how to index a file or directory.

Index a single file:

hadoop jar /home/hadoop/aws-big-data-blog/aws-blog-elasticsearch-cascading-commoncrawl/commoncrawl.cascading.elasticsearch/target/commoncrawl.cascading.elasticsearch-0.0.1-SNAPSHOT-job.jar com.amazonaws.bigdatablog.indexcommoncrawl.Main s3://commoncrawl/crawl-data/CC-MAIN-2014-52/segments/1419447563504.69/wat/CC-MAIN-20141224185923-00099-ip-10-231-17-201.ec2.internal.warc.wat.gz

Index a complete directory:

hadoop jar /home/hadoop/aws-big-data-blog/aws-blog-elasticsearch-cascading-commoncrawl/commoncrawl.cascading.elasticsearch/target/commoncrawl.cascading.elasticsearch-0.0.1-SNAPSHOT-job.jar com.amazonaws.bigdatablog.indexcommoncrawl.Main s3://commoncrawl/crawl-data/CC-MAIN-2014-52/segments/1419447563504.69/wat/

Running the command to index a single file produces the following output:

Running the command to index a single file produces this output

The application writes each JSON entry directly into Elasticsearch using the Cascading and Hadoop connectors.

Checking Indexes and Mappings

The index on Elasticsearch is created automatically, using the default configuration. Now, run a couple of commands on the console to check the index and mappings.

List all indexes:

$ curl 'localhost:9200/_cat/indices?v'

Listing indexes

View the mappings:

curl -XGET 'http://localhost:9200/_all/_mapping' | python -m json.tool |more

If you look at the mapping output, you’ll see that it follows the structure showed on the Common Crawl WAT metadata description: http://commoncrawl.org/the-data/get-started/.

Viewing mappings

This mapping is shown in the Kibana menu and allows you to navigate the different metadata entries.

Querying Indexed Content

Because the Kibana bootstrap action configures the cluster to use port 80, you can point the browser to the master node public DNS address to access the Kibana console. On the Kibana console, click Sample Dashboard to start exploring the content indexed earlier in this post.

A sample dasbhard appears with some basic information extracted:

Querying Indexed Content

You can search Head.Metas headers for all the occurrences of “hello”; in the search box, type “HTML-Metadata.Head.Metas AND keywords AND hello”.

Searching for headers

That search returns all the records that contain ‘keywords’ and ‘hello’ on the “Metadata.Head.Metas” header. The result looks like the following:

Search results

Another useful way to find information is by using the mapping index. You can click “Envelope.Payload-Metadata.HTTP-Response-Metadata.Headers.Server” to see a ranking of the different server technologies of all the indexed sites:

using the mapping index

Click the magnifier icon to find all the details on the selected entry.

Or you can get the top ten technologies used in the indexed web application by clicking “Envelope.Payload-Metadata.HTTP-Response-Metadata.Headers.X-Powered-By”. The following graphic shows an example:

Getting the top ten technologies used in the indexed web application

Conclusion

This post has shown how EMR lets you build and compile a simple Cascading application and use it to index Common Crawl metadata on an Elasticsearch cluster.

Cascading provided a simple application layer on top of Hadoop to parallelize the process and fetch the data directly from the S3 repository location, while Kibana provided a presentation interface that allowed you to research the indexed data in many ways.

If you have questions or suggestions, please leave a comment below.

————————————

Do more with EMR:

Using IPython Notebook to Analyze Data with EMR

—————————————————————

Love to work on open source? Check out EMR’s careers page.

—————————————————————-

AWS Big Data Blog

Indexing Common Crawl Metadata on Amazon EMR Using Cascading and Elasticsearch

What is Common Crawl?

What is Cascading?

Launching an EMR cluster with Elasticsearch, Maven, and Kibana

Compiling Cascading Source Code with Maven

Indexing Common Crawl Metadata on Elasticsearch

Checking Indexes and Mappings

Querying Indexed Content

Conclusion

Resources

Follow

Learn

Resources

Developers

Help