AWS Big Data Blog

Exploring Geospatial Intelligence using SparkR on Amazon EMR

Gopal Wunnava is a Senior Consultant with AWS Professional Services

The number of data sources that use location, such as smartphones and sensory devices used in IoT (Internet of things), is expanding rapidly. This explosion has increased demand for analyzing spatial data.

Geospatial intelligence (GEOINT) allows you to analyze data that has geographical or spatial dimensions and present it based on its location. GEOINT can be applied to many industries, including social and environmental sciences, law enforcement, defense, and weather and disaster management. In this blog post, I show you how to build a simple GEOINT application using SparkR that will allow you to appreciate GEOINT capabilities.

R has been a popular platform for GEOINT due to its wide range of functions and packages related to spatial analysis. SparkR provides a great solution for overcoming known limitations in R because it lets you run geospatial applications in a distributed environment provided by the Spark engine. By implementing your SparkR geospatial application on Amazon EMR, you combine  these benefits with the flexibility and ease-of-use provided by EMR.

Overview of a GEOINT application

You’ll use the GDELT project for building this GEOINT application. The GDELT dataset you will use is available on Amazon S3  in the form of a tab-delimited text file with an approximate size of 13 GB.

Your GEOINT application will generate images like the one below.

Building your GEOINT application

Create an EMR cluster (preferably with the latest AMI version) in your region of choice, specifying Spark, Hive, and Ganglia.  To learn how to create a cluster, see Getting Started: Analyzing Big Data with Amazon EMR.

I suggest the r3 instance family for this application (1 master and 8 core nodes, all r38xlarge in this case), as it is suitable for the type of SparkR application you will create here. While I have chosen this cluster size for better performance, a smaller cluster size could work as well.

After your cluster is ready and you SSH into the master node, run the following command to install the files required for this application to run on EMR:

sudo yum install libjpeg-turbo-devel

For this GEOINT application, you identify and display locations where certain events of interest related to the economy are taking place in the US. For a more detailed description of these events, see the CAMEO manual available from the GDELT website.

You can use either the SparkR shell (type sparkR on the command line) or RStudio to develop this GEOINT application. To learn how to configure RStudio on EMR, see the Crunching Statistics at Scale with SparkR on Amazon EMR blog post.

You need to install the required R packages for this application onto the cluster. This can be done by executing the following R statement:

install.packages(c("plyr","dplyr","mapproj","RgoogleMaps","ggmap"))

Note: The above step can take up to thirty minutes because a number of dependent packages must be installed onto your EMR cluster.

After the required packages are installed the next step is to load these packages into your R environment:

library(plyr)
library(dplyr)
library(mapproj)
library(RgoogleMaps)
library(ggmap)

You can save the images generated by this application as a PDF document. Unless you use the setwd() function to set your desired path for this file, it defaults to your current working directory.

setwd("/home/hadoop")
pdf("SparkRGEOINTEMR.pdf")

If you are using RStudio, the plots appear in the lower-right corner of your workspace.

Now, create the Hive context that is required to access the external table from within the Spark environment:

#set up Hive context
hiveContext <- sparkRHive.init(sc)

Note: If you are using the SparkR shell in EMR, the spark context ‘sc’ is created automatically for you.  If you are using RStudio, follow the instructions in the Crunching Statistics at Scale with SparkR on Amazon EMR blog post to create the Spark context.

Next, create an external table that points to your source GDELT dataset on S3.

sql(hiveContext,
"
CREATE EXTERNAL TABLE IF NOT EXISTS gdelt (
GLOBALEVENTID BIGINT,
SQLDATE INT,
MonthYear INT,
Year INT,
FractionDate DOUBLE,
Actor1Code STRING,
Actor1Name STRING,
Actor1CountryCode STRING,
Actor1KnownGroupCode STRING,
Actor1EthnicCode STRING,
Actor1Religion1Code STRING,
Actor1Religion2Code STRING,
Actor1Type1Code STRING,
Actor1Type2Code STRING,
Actor1Type3Code STRING,
Actor2Code STRING,
Actor2Name STRING,
Actor2CountryCode STRING,
Actor2KnownGroupCode STRING,
Actor2EthnicCode STRING,
Actor2Religion1Code STRING,
Actor2Religion2Code STRING,
Actor2Type1Code STRING,
Actor2Type2Code STRING,
Actor2Type3Code STRING,
IsRootEvent INT,
EventCode STRING,
EventBaseCode STRING,
EventRootCode STRING,
QuadClass INT,
GoldsteinScale DOUBLE,
NumMentions INT,
NumSources INT,
NumArticles INT,
AvgTone DOUBLE,
Actor1Geo_Type INT,
Actor1Geo_FullName STRING,
Actor1Geo_CountryCode STRING,
Actor1Geo_ADM1Code STRING,
Actor1Geo_Lat FLOAT,
Actor1Geo_Long FLOAT,
Actor1Geo_FeatureID INT,
Actor2Geo_Type INT,
Actor2Geo_FullName STRING,
Actor2Geo_CountryCode STRING,
Actor2Geo_ADM1Code STRING,
Actor2Geo_Lat FLOAT,
Actor2Geo_Long FLOAT,
Actor2Geo_FeatureID INT,
ActionGeo_Type INT,
ActionGeo_FullName STRING,
ActionGeo_CountryCode STRING,
ActionGeo_ADM1Code STRING,
ActionGeo_Lat FLOAT,
ActionGeo_Long FLOAT,
ActionGeo_FeatureID INT,
DATEADDED INT,
SOURCEURL STRING )
ROW FORMAT DELIMITED
FIELDS TERMINATED BY 't'
STORED AS TEXTFILE
LOCATION 's3://support.elasticmapreduce/training/datasets/gdelt'
");

Note: You might encounter an error in the above statement, with an error message that specifies an unused argument. This can be due to an overwritten Spark context. If this error appears, restart your SparkR or RStudio session.

Next, apply your filters to extract desired events in the last two years for the country of interest and store the results in a SparkR dataframe  named ‘gdelt’. In this post, I focus on spatial analysis of the data for the US only. For code samples that illustrate how this can be done for other countries, such as India, see the Exploring GDELT – Geospatial Analysis using SparkR on EMR GitHub site for the Big Data Blog.

gdelt<-sql(hiveContext,"SELECT * FROM gdelt WHERE ActionGeo_CountryCode IN ('US') AND Year >= 2014")

Register and cache this table in-memory:

registerTempTable(gdelt, "gdelt")
cacheTable(hiveContext, "gdelt")

Rename the columns for readability:

names(gdelt)[names(gdelt)=="actiongeo_countrycode"]="cn"
names(gdelt)[names(gdelt)=="actiongeo_lat"]="lat"
names(gdelt)[names(gdelt)=="actiongeo_long"]="long"
names(gdelt)

Next, extract a subset of columns from your original SparkR dataframe (‘gdelt’).  Of particular interest is the “lat” and “long” columns, as these attributes provide you with the locations where these events have taken place. Register and cache this result table into another SparkR dataframe  named ‘gdt’:

gdt=gdelt[,
		 c("sqldate",
		 "eventcode",
		 "globaleventid", 
		 "cn",
	         "year",
	         "lat",
	         "long")
             ]
registerTempTable(gdt, "gdt")
cacheTable(hiveContext, "gdt")

Now, filter down to specific events of interest. For this blog post, I have chosen certain event codes that relate to economic aid and co-operation.  While the CAMEO manual provides more details on what these specific event codes represent, I have provided a table below for quick reference. Refer to the manual to choose events from other news categories that may be of particular interest to you.

Store your chosen event codes into an R vector object:

ecocodes <- c("0211","0231","0311","0331","061","071")	

Next, apply a filter operation and store the results of this operation into another SparkR dataset named ‘gdeltusinf’. Follow the same approach of registering and caching this table.

gdeltusinf <- filter(gdt,gdt$eventcode %in% ecocodes)
registerTempTable(gdeltusinf, "gdeltusinf")
cacheTable(hiveContext, "gdeltusinf")

Now that you have a smaller dataset that is a subset of the original, collect this SparkR dataframe into a local R dataframe. By doing this, you can leverage the spatial libraries installed previously into your R environment.

dflocale1=collect(select(gdeltusinf,"*"))	

Save the R dataframe to a local file system in case you need to quit your SparkR session and want to reuse the file at a later point. You can also share this saved file with other R users and sessions.

save(dflocale1, file = "gdeltlocale1.Rdata")

Now, create separate dataframes by country as this allows you to plot the corresponding event locations on separate maps:

dflocalus1=subset(dflocale1,dflocale1$cn=='US')

Next, provide a suitable title for the maps and prepare to plot them:

plot.new()
title("GDELT Analysis for Economy related Events in 2014-2015")
map=qmap('USA',zoom=3)

The first plot identifies locations where all events related to economic aid and co-operation have taken place in the US within the last two years (2014-2015). For this example, these locations are marked in red, but you can choose another color.

map + geom_point(data = dflocalus1, aes(x = dflocalus1$long, y = dflocalus1$lat), color="red", size=0.5, alpha=0.5)
title("All GDELT Event Locations in USA related to Economy in 2014-2015")

It may take several seconds for the image to be displayed. From the above image, you can infer that the five chosen events related to economy took place in locations all over the US in 2014-2015. While this map provides you with the insight that these five events were fairly widespread in the US, you might want to drill down further and identify locations where each of these five specific events took place.

For this purpose, display only certain chosen events. Start by displaying locations where events related to ‘0211’ (Economic Co-op for Appeals) have taken place in the US, using the color blue:

dflocalus0211=subset(dflocalus1,dflocalus1$eventcode=='0211')
x0211=geom_point(data = dflocalus0211, aes(x = dflocalus0211$long, y = dflocalus0211$lat), color="blue", size=2, alpha=0.5)
map+x0211
title("GDELT Event Locations in USA: Economic Co-op(appeals)-Code 0211")

From the image above, you can see that the event ‘0211’ (Economic Co-op for Appeals) was fairly widespread as well, but there is more of a concentration within the Eastern region of the US, specifically the Northeast.

Next, follow the same process, but this time for a different event –‘0231’ (Economic Aid for Appeals). Notice the use of the color yellow for this purpose.

dflocalus0231=subset(dflocalus1,dflocalus1$eventcode=='0231')
x0231=geom_point(data = dflocalus0231, aes(x = dflocalus0231$long, y = dflocalus0231$lat), color="yellow", size=2, alpha=0.5)
map+x0231
title("GDELT Event Locations in USA:Economic Aid(appeals)-Code 0231")

From the above image, you can see that there is a heavy concentration of this event type in the Midwest, Eastern, and Western parts of the US while the North-Central region is sparser.

You can follow a similar approach to prepare separate R dataframes for each event of interest. Choosing a different color for each event allows you to identify each event type and locations much more easily.

dflocalus0311=subset(dflocalus1,dflocalus1$eventcode=='0311')
dflocalus0331=subset(dflocalus1,dflocalus1$eventcode=='0331')
dflocalus061=subset(dflocalus1,dflocalus1$eventcode=='061')
dflocalus071=subset(dflocalus1,dflocalus1$eventcode=='071')

x0211=geom_point(data = dflocalus0211, aes(x = dflocalus0211$long, y = dflocalus0211$lat), color="blue", size=3, alpha=0.5)
x0231=geom_point(data = dflocalus0231, aes(x = dflocalus0231$long, y = dflocalus0231$lat), color="yellow", size=1, alpha=0.5)
x0311=geom_point(data = dflocalus0311, aes(x = dflocalus0311$long, y = dflocalus0311$lat), color="red", size=1, alpha=0.5)
x0331=geom_point(data = dflocalus0331, aes(x = dflocalus0331$long, y = dflocalus0331$lat), color="green", size=1, alpha=0.5)
x061=geom_point(data = dflocalus061, aes(x = dflocalus061$long, y = dflocalus061$lat), color="orange", size=1, alpha=0.5)
x071=geom_point(data = dflocalus071, aes(x = dflocalus071$long, y = dflocalus071$lat), color="violet", size=1, alpha=0.5)

Using this approach allows you to overlay locations of different events on the same map, where each event is represented by a specific color. To illustrate this with an example, use the following code to display locations of three specific events from the steps above:

map+x0211+x0231+x0311

legend(‘bottomleft’,c(“0211:Appeal for Economic Co-op”,”0231:Appeal for Economic Aid”,”0311:Intent for Economic Co-op”),col=c(“blue”,”yellow”,”red”),pch=16) title(“GDELT Locations In USA: Economy related Events in 2014-2015”)

Note: If you are using RStudio, you might have to hit the refresh button on the Plots tab to display the map in the lower-right portion of your workspace.

From the above image, you can see that the Northeast region has the heaviest concentration of events, while certain areas such as North Central are sparser.

Conclusion

I’ve shown you how to build a simple yet powerful geospatial application using SparkR on EMR.  Though R helps with geospatial analysis, native R limitations prevent GEOINT applications from scaling to the extent required by large-scale GEOINT applications. You can overcome these limitations by mixing SparkR with native R workloads, a method referred to as big data / small learning.  This approach lets you take your GEOINT application performance to the next level while still running the R analytics you know and love.

You can find the code samples for this GEOINT application, along with other use cases for this application, on the Exploring GDELT – Geospatial Analysis using SparkR on EMR GitHub site.  You can also find code samples that make use of the concept of pipelines and the pipeR package to implement this functionality. Along similar lines, you can make use of the magrittr package to implement the functionality represented in this application.

If you have any questions or suggestions, please leave a comment below.

————————————

Related

Crunching Statistics at Scale with SparkR on Amazon EMR

Want to learn more about Big Data or Streaming Data? Check out our Big Data and Streaming data educational pages.