Do “Burns Cliff”, “Columbia Hills”, ”Endeavour”, and “Bonneville Crater” sound out of this world? You are right! These are some of the geographical structures visited by NASA’s Mars Exploration Rovers (MER). Over the years, the rovers beam back troves of exciting data including high-resolution images about the Red Planet. Now, Amazon Simple Workflow Service (Amazon SWF) joins some of the key computing technologies behind these missions, in enabling NASA scientists to reliably drive mission critical operations and efficiently process the growing knowledge we gather about our universe.
Mars Exploration Rover

NASA’s Jet Propulsion Laboratory (JPL) uses Amazon SWF as an integral part of several missions including the Mars Exploration Rover (MER) and Carbon in the Arctic Reservoir Vulnerability Experiment (CARVE). These missions continuously generate large volumes of data that must be processed, analyzed, and stored efficiently and reliably. The data processing pipelines for tactical operations and scientific analysis involve large scale ordered execution of steps with ample opportunity for parallelization across multiple machines. Examples include generation of stereoscopic data from pairs of images, stitching of multi-gigapixel panoramas to immerse the scientist into the Martian terrain, and tiling of these gigapixel images so that the data can be loaded on demand. There is a global community of operators and scientists who rely on such data. This community is frequently on a tight tactical operations timeline, as short as only a few hours. To meet these needs, JPL engineers set a goal of processing and disseminating Mars images within minutes.

JPL has long been storing and processing data in AWS. Much of this work is done with the Polyphony framework, which is the reference implementation of JPL’s Cloud Oriented Architecture. It provides support for provisioning, storage, monitoring, and task orchestration of data processing jobs in the cloud. Polyphony’s existing toolset for data processing and analytics consisted of Amazon EC2 for computational capacity, Amazon S3 for storage and data distribution, as well as Amazon SQS and MapReduce implementations such as Hadoop for task distribution and execution. However, there was a key missing piece: an orchestration service to reliably manage tasks for large, complex workflows.

While queues provide an effective approach to distribute massively parallel jobs, engineers quickly ran into shortcomings. The inability to express ordering and dependencies within queues made them unfit for complex workflows. JPL engineers also had to deal with duplication of messages when using queues. For example, when stitching images together, a duplicate task to stich an image would result in expensive redundant processing the result of which would drive further expensive computation as the pipeline proceeds to completion in futility. JPL also has numerous use cases that go beyond brute data processing and require mechanisms to drive control flow. While engineers were able to implement their data driven flows easily with MapReduce, they found it difficult to express every step in the pipeline within the semantics of the framework. Particularly, as the complexity of data processing increased, they faced difficulties in representing dependencies between processing steps and in handling distributed computation failures.

JPL engineers identified the need for an orchestration service with the following characteristics:

  • Highly Available: To support mission critical operations
  • Scalable: To facilitate parallel execution simultaneously from hundreds of Amazon EC2 instances
  • Consistent: A scheduled task must be executed once with very high probability
  • Expressive: Simple expression of complex workflows to expedite development
  • Flexible: Workflows execution must not be limited to Amazon EC2 and tasks must be routable
  • Performant: Tasks should be scheduled with minimal latency

JPL engineers used Amazon SWF and integrated the service with the Polyphony pipelines responsible for data processing of Mars images for tactical operations. They gained unprecedented control and visibility into the distributed execution of their pipelines. Most importantly, they were able to express complex workflows succinctly without being forced to express the problem in any specific paradigm.

To support the Fast Motion Field Test for the Curiosity rover, also known as Mars Science Laboratory, JPL engineers had to process images, generate stereo imagery, and make panoramas. Stereo imagery requires a pair of images acquired at the same time, and it generates range data that tells a tactical operator the distance and direction from the rover to the pixels in the images. The left and right images can be processed in parallel; however, stereo processing cannot start until each image has been processed. This classic split-join workflow is difficult to express with a queue based system, while expressing it with SWF requires a few simple lines of Java code together with AWS Flow Framework annotations.

swf-nasa-mer-stereo-cameras-whiteboard
swf-nasa-mer-swf-code-whiteboard

Generation of panoramas is also implemented as a workflow. For tactical purposes, panoramas are generated at each location where the rover parks and takes pictures. Hence, anytime a new picture arrives from a particular location, the panorama is augmented with the newly available information. Due to the large scale of the panoramas and the requirement to generate them as quickly as possible, the problem has to be divided and orchestrated across numerous machines. The algorithm employed by the engineers divides the panorama into large multiple rows. The first task of the workflow is to generate each of the rows based on images available at the location. Once the rows have been generated, they are scaled down to multiple resolutions and tiled for consumption by remote clients. Using the rich feature-set provided by Amazon SWF, JPL engineers have expressed this application flow as an Amazon SWF workflow.

swf-nasa-mer-tiling

An Opportunity Pancam mosaic of total size 11280×4280 pixels containing 77 color images. Tiles at six levels of detail are required to deliver this image to a viewer at any arbitrary size. The yellow grid lines indicate the tiles required for each image. The Panoramas for Mastcam instrument on Mars Science Laboratory consist of up to 1296 images and have a resolution of almost 2 Gigapixels! The corresponding panoramic image is shown below.

swf-nasa-mer-panorama-whiteboard

By making orchestration available in the cloud, Amazon SWF gives JPL the ability to leverage resources inside and outside its environment and seamlessly distribute application execution into the public cloud, enabling their applications to dynamically scale and run in a truly distributed manner.

Many JPL data processing pipelines are structured as automated workers to upload firewalled data, workers to process the data in parallel, and workers to download results. The upload and download workers run on local servers and the data processing workers can run both on local servers and on the Amazon EC2 nodes. By using the routing capabilities in Amazon SWF, JPL developers dynamically incorporated workers into the pipeline while taking advantage of worker characteristics such as data locality. This processing application is also highly available because even when local workers fail, cloud based workers continue to drive the processing forward. Since Amazon SWF does not constrain the location of the worker nodes, JPL runs jobs in multiple regions as well as in its local data centers to allow for the highest availability for mission critical systems. As Amazon SWF becomes available in multiple regions, JPL plans to integrate automatic failover to SWF across regions.

swf-nasa-mer-arch-whiteboard

JPL’s use of Amazon SWF is not limited to data processing applications. Using the scheduling capabilities in Amazon SWF, JPL engineers built a distributed Cron job system that reliably performed timely mission critical operations. In addition to the reliability, JPL gained unprecedented, centralized visibility into these distributed jobs through the Amazon SWF’s visibility capabilities available in the AWS Management Console. JPL has even built an application to backup vital data from MER on Amazon S3. With distributed Cron jobs, JPL updates the backups as well as audits the integrity of the data as frequently as desired by the project. All the steps of this application including encryption, upload to S3, random selection of data to audit, and actual auditing through comparison of onsite data with S3 are reliably orchestrated through Amazon SWF. In addition, several JPL teams have quickly migrated their existing applications to use orchestration in the cloud by leveraging the programming support provided through the AWS Flow Framework.

JPL continues to use Hadoop for simple data processing pipelines and Amazon SWF is now a natural choice for implementing applications with complex dependencies between the processing steps. Developers also frequently use the diagnostic and analytics capability available through the AWS Management Console to debug applications during development and in tracking distributed executions. Using AWS, mission critical applications that previously took months to develop, test, and deploy, now take days.