Estimating the Location of Images Using MXNet and Multimedia Commons Dataset on AWS EC2

by Jaeyoung Choi and Kevin Li | on | Permalink | Comments |  Share

This is a guest post by Jaeyoung Choi of the International Computer Science Institute and Kevin Li of the University of California, Berkeley. This project demonstrates how academic researchers can leverage our AWS Cloud Credits for Research Program to support their scientific breakthroughs.

Modern mobile devices can automatically assign geo-coordinates to images when you take pictures of them. However, most images on the web still lack this location metadata. Image geo-location is the process of estimating the location of an image and applying a location label. Depending on the size of your dataset and how you pose the problem, the assigned location label can range from the name of a building or landmark to an actual geo-coordinate (latitude, longitude).

In this post, we show how to use a pre-trained model created with Apache MXNet to geographically categorize images. We use images from a dataset that contains millions of Flickr images taken around the world. We also show how to map the result to visualize it.

Our approach

The approaches to image geo-location can be divided into two categories: image-retrieval-based search approaches and classification-based approaches. (This blog post compares two state-of-the-art approaches in each category.)

Recent work by Weyand et al. posed image geo-location as a classification problem. In this approach, the authors subdivided the surface of the earth into thousands of geographic cells and trained a deep neural network with geo-tagged images. For a less technical description of their experiment, see this article.

Because the authors did not release their training data or their trained model, PlaNet, to the public, we decided to train our own image geo-locator. Our setup for training the model is inspired by the approach described in Weyand et al., but we changed several settings.

We trained our model, LocationNet, using MXNet on a single p2.16xlarge instance with geo-tagged images from the AWS Multimedia Commons dataset.

We split training, validation, and test images so that images uploaded by the same person do not appear in multiple sets. We used Google’s S2 Geometry Library to create classes with the training data. The model converged after 12 epochs, which took about 9 days with the p2.16xlarge instance. A full tutorial with a Jupyter notebook is available on GitHub.

The following table compares the setups used to train and test LocationNet and PlaNet.

             LocationNet PlaNet
Dataset source Multimedia Commons Images crawled from the web
Training set 33.9 million 91 million
Validation 1.8 million 34 million
S2 Cell Partitioning t1=5000, t2=500
→ 15,527 cells
t1=10,000, t2=50
→ 26,263 cells
Model ResNet-101 GoogleNet
Optimization SGD with Momentum and LR Schedule Adagrad
Training time 9 days on 16 NVIDIA K80 GPUs (p2.16xlarge EC2 instance),
12 epochs
2.5 months on 200 CPU cores
Framework MXNet DistBelief
Test set Placing Task 2016 Test Set (1.5 million Flickr images) 2.3 M geo-tagged Flickr images

At inference time, LocationNet outputs a probability distribution over the geographic cells. The center-of-mass geo-coordinate of the images in the cell with the highest likelihood is assigned as the geo-coordinate of the query image.

LocationNet is shared publicly in the MXNet Model Zoo.

Downloading LocationNet

Now download LocationNet, the pretrained model. LocationNet has been trained on the subset of geo-tagged images in the AWS Multimedia Commons dataset. The Multimedia Commons dataset contains more than 39 million images and 15 thousand geographic cells (classes).

LocationNet has two parts, a JSON file containing the model definition and a binary file containing the parameters. We load necessary packages and download the files from S3.

import os
import urllib
import mxnet as mx
import logging
import numpy as np
from skimage import io, transform
from collections import namedtuple
from math import radians, sin, cos, sqrt, asin

path = ''
model_path = 'models/'
if not os.path.exists(model_path):
urllib.urlretrieve(path+'RN101-5k500-symbol.json', model_path+'RN101-5k500-symbol.json')
urllib.urlretrieve(path+'RN101-5k500-0012.params', model_path+'RN101-5k500-0012.params')

Then, load the downloaded model. If you don’t have a GPU available, replace mx.gpu() with mx.cpu():

# Load the pre-trained model
prefix = "models/RN101-5k500"
load_epoch = 12
sym, arg_params, aux_params = mx.model.load_checkpoint(prefix, load_epoch)
mod = mx.mod.Module(symbol=sym, context=mx.gpu())
mod.bind([('data', (1,3,224,224))], for_training=False)
mod.set_params(arg_params, aux_params, allow_missing=True)

The grids.txt file contains the geographic cells used for training the model.

The i-th line is the i-th class, and the columns are: S2 Cell Token, Latitude, and Longitude. We load the labels to a list named grids.

# Download and load grids file 

# Load labels.
grids = []
with open('grids.txt', 'r') as f:
    for line in f:
        line = line.strip().split('\t')
        lat = float(line[1])
        lng = float(line[2])
        grids.append((lat, lng))

The model uses the haversine formula to measure the great-circle distance between points p1 and p2 in kilometers:

def distance(p1, p2):
        R = 6371 # Earth radius in km
        lat1, lng1, lat2, lng2 = map(radians, (p1[0], p1[1], p2[0], p2[1]))
        dlat = lat2 - lat1
        dlng = lng2 - lng1
        a = sin(dlat * 0.5) ** 2 + cos(lat1) * cos(lat2) * (sin(dlng * 0.5) ** 2)
        return 2 * R * asin(sqrt(a))

Before feeding the image to the deep learning network, the model preprocesses the image by cropping it and subtracting the mean:

# mean image for preprocessing
mean_rgb = np.array([123.68, 116.779, 103.939])
mean_rgb = mean_rgb.reshape((3, 1, 1))

def PreprocessImage(path, show_img=False):
    # load image.
    img = io.imread(path)
    # We crop image from center to get size 224x224.
    short_side = min(img.shape[:2])
    yy = int((img.shape[0] - short_side) / 2)
    xx = int((img.shape[1] - short_side) / 2)
    crop_img = img[yy : yy + short_side, xx : xx + short_side]
    resized_img = transform.resize(crop_img, (224,224))
    if show_img:
    # convert to numpy.ndarray
    sample = np.asarray(resized_img) * 256
    # swap axes to make image from (224, 224, 3) to (3, 224, 224)
    sample = np.swapaxes(sample, 0, 2)
    sample = np.swapaxes(sample, 1, 2)
    # sub mean 
    normed_img = sample - mean_rgb
    normed_img = normed_img.reshape((1, 3, 224, 224))
    return [mx.nd.array(normed_img)]

Evaluating and comparing models

For evaluation, we use two datasets: the IM2GPS dataset and a test dataset of Flickr images that is used in MediaEval Placing 2016 Benchmark.

Results for the IM2GPS test set

The following values indicate the percentage of images in the IM2GPS test set that were correctly located within each distance from the actual location.

Method 1km 25km 200km 750km 2500km
PlaNet 8.4% 24.5% 37.6% 53.6% 71.3%
LocationNet 16.8% 39.2% 48.9% 67.9% 82.2%

Results for Flickr images

These results are not directly comparable because the test set images used in PlaNet have not been publicly released. The values indicate the percentage of images in the test set that were correctly located within each distance from the actual location.

Method 1km 25km 200km 750km 2500km
PlaNet 3.6% 10.1% 16.0% 28.4% 48.0%
LocationNet 6.2% 13.5% 20.8% 35.6% 55.2%

By visually inspecting the geo-located images, we can see that the model does well with landmark locations, but it is also capable of correctly geo-locating non-landmark scenes.

Estimating the geo-location of an image using a URL

Now let’s try to geo-locate an image on the web using a URL .

Batch = namedtuple('Batch', ['data'])
def predict(imgurl, prefix='images/'):
    download_url(imgurl, prefix)
    imgname = imgurl.split('/')[-1]
    batch = PreprocessImage(prefix + imgname, True)
    #predict and show top 5 results
    mod.forward(Batch(batch), is_train=False)
    prob = mod.get_outputs()[0].asnumpy()[0]
    pred = np.argsort(prob)[::-1]
    result = list()
    for i in range(5):
        pred_loc = grids[int(pred[i])]
        res = (i+1, prob[pred[i]], pred_loc)
        print('rank=%d, prob=%f, lat=%s, lng=%s' \
              % (i+1, prob[pred[i]], pred_loc[0], pred_loc[1]))
    return result    

def download_url(imgurl, img_directory):
    if not os.path.exists(img_directory):
    imgname = imgurl.split('/')[-1]
    filepath = os.path.join(img_directory, imgname)
    if not os.path.exists(filepath):
        filepath, _ = urllib.urlretrieve(imgurl, filepath)
        statinfo = os.stat(filepath)
        print('Succesfully downloaded', imgname, statinfo.st_size, 'bytes.')
    return filepath

Let’s see how our model does with an image of Tokyo Tower. The following code downloads the image from URL and outputs the model’s location prediction.

#download and predict geo-location of an image of Tokyo Tower
url = ''
result = predict(url)

The result lists the top-5 result with the confidence score (prob) and the geo-coordinate:

rank=1, prob=0.139923, lat=35.6599344486, lng=139.728919109
rank=2, prob=0.095210, lat=35.6546613641, lng=139.745685815
rank=3, prob=0.042224, lat=35.7098435803, lng=139.810458528
rank=4, prob=0.032602, lat=35.6641725688, lng=139.746648114
rank=5, prob=0.023119, lat=35.6901996892, lng=139.692857396

It is hard to tell the quality of the geo-location output with just the raw latitude and longitude values. Let’s map the output to visualize the results.

Visualizing results using Google Maps on the Jupyter notebook

To visualize the results of the prediction, we use Google Maps in the Jupyter notebook. This allows you to see if the prediction makes sense. We use a plugin called gmaps, which allows the use of Google Maps in the Jupyter Notebook. To install gmaps, follow the installation instructions on the gmaps GitHub page.

Visualizing the result with gmaps takes only a few lines of code. In your notebook, type the following:

import gmaps

gmaps.configure(api_key="") # Fill in with your API key 

fig = gmaps.figure()

for i in range(len(result)):
    marker = gmaps.marker_layer([result[i]], label=str(i+1))

The top-1 geo-location estimation result is, indeed, right on the spot where Tokyo Tower is.

Now, try to geo-locate images of your choice!


Training LocationNet on AWS has been graciously supported by AWS Programs for Research and Education. We also thank the AWS Public Dataset program for hosting the Multimedia Commons dataset for public use. Our work is also partially supported by a collaborative LDRD led by Lawrence Livermore National Laboratory (U.S. Dept. of Energy contract DE-AC52-07NA27344).

Additional Reading

Learn more about AWS Cloud Credits for Research! Read about Ottertune and how to tune your DBMS automatically with Machine Learning.