Indexing Common Crawl Metadata on Amazon EMR Using Cascading and Elasticsearch
Hernan Vivani is a Big Data Support Engineer for Amazon Web Services
A previous post showed you how to get started with Elasticsearch and Kibana on Amazon EMR. In that post, we installed Elasticsearch and Kibana on an Amazon EMR cluster using bootstrap actions.
This post shows you how to build a simple application with Cascading for reading Common Crawl metadata, index the metadata on Elasticsearch, and use Kibana to query the indexed content.
What is Common Crawl?
What is Cascading?
Cascading is an application development platform for building data applications on Apache Hadoop. In this post, you use it to build a simple application that indexes JSON files in Elasticsearch, without the need to think in terms of MapReduce methods.
Launching an EMR cluster with Elasticsearch, Maven, and Kibana
As in the previous post, you launch a cluster with Elasticsearch and Kibana installed. You also install Maven to compile the application and run a script to resolve some library dependencies between Elasticsearch and Cascading. All the bootstrap actions are public, so you can download the code to verify the installation steps at any time.
To launch the cluster, use the AWS CLI and run the following command:
Compiling Cascading Source Code with Maven
After you have the cluster up and running, you can connect using SSH into the master node to compile and run the application. Your Cascading application applies a filter before you start the indexing process, to remove the WARC envelope and obtain plain JSON output. For more information about the code, see the Github repository.
Clone the repository:
$ git clone https://github.com/awslabs/aws-big-data-blog.git
Compile the code:
$ cd aws-big-data-blog/aws-blog-elasticsearch-cascading-commoncrawl/commoncrawl.cascading.elasticsearch $ mvn clean && mvn assembly:assembly -Dmaven.test.skip=true -Ddescriptor=./src/main/assembly/job.xml -e
Compiled application is placed in the following directory: aws-big-data-blog/aws-blog-elasticsearch-cascading-commoncrawl/commoncrawl.cascading.elasticsearch/target
Listing the directory should show the packaged application, as shown in the following graphic:
Indexing Common Crawl Metadata on Elasticsearch
Using the application you just compiled, you can index a single Common Crawl file or a complete directory, by modifying the parameter. The following commands show you how to index a file or directory.
Index a single file:
Index a complete directory:
Running the command to index a single file produces the following output:
The application writes each JSON entry directly into Elasticsearch using the Cascading and Hadoop connectors.
Checking Indexes and Mappings
The index on Elasticsearch is created automatically, using the default configuration. Now, run a couple of commands on the console to check the index and mappings.
List all indexes:
$ curl 'localhost:9200/_cat/indices?v'
View the mappings:
curl -XGET 'http://localhost:9200/_all/_mapping' | python -m json.tool |more
If you look at the mapping output, you’ll see that it follows the structure showed on the Common Crawl WAT metadata description: http://commoncrawl.org/the-data/get-started/.
This mapping is shown in the Kibana menu and allows you to navigate the different metadata entries.
Querying Indexed Content
Because the Kibana bootstrap action configures the cluster to use port 80, you can point the browser to the master node public DNS address to access the Kibana console. On the Kibana console, click Sample Dashboard to start exploring the content indexed earlier in this post.
A sample dasbhard appears with some basic information extracted:
You can search Head.Metas headers for all the occurrences of “hello”; in the search box, type “HTML-Metadata.Head.Metas AND keywords AND hello”.
That search returns all the records that contain ‘keywords’ and ‘hello’ on the “Metadata.Head.Metas” header. The result looks like the following:
Another useful way to find information is by using the mapping index. You can click “Envelope.Payload-Metadata.HTTP-Response-Metadata.Headers.Server” to see a ranking of the different server technologies of all the indexed sites:
Click the magnifier icon to find all the details on the selected entry.
Or you can get the top ten technologies used in the indexed web application by clicking “Envelope.Payload-Metadata.HTTP-Response-Metadata.Headers.X-Powered-By”. The following graphic shows an example:
This post has shown how EMR lets you build and compile a simple Cascading application and use it to index Common Crawl metadata on an Elasticsearch cluster.
Cascading provided a simple application layer on top of Hadoop to parallelize the process and fetch the data directly from the S3 repository location, while Kibana provided a presentation interface that allowed you to research the indexed data in many ways.
If you have questions or suggestions, please leave a comment below.
Do more with EMR:
Love to work on open source? Check out EMR’s careers page.