Category: Thought Pieces

Empowerment, Engagement, and Education for Women in Tech

I’ve been earning a living in the technology industry since 1977, when I worked in one of the first computer stores in the country as a teenager. Looking back over the past 40 years, and realizing that the Altair, IMSAI, Sol-20, and North Star Horizon machines that I learned about, built, debugged, programmed, sold, and supported can now be seen in museums (Seattle’s own Living Computer Museum is one of the best), helps me to appreciate that the world I live in changes quickly, and to understand that I need to do the same. This applies to technology, to people, and to attitudes.

I lived in a suburb of Boston in my early teens. At that time, diversity meant that one person in my public school had come all the way from (gasp) England a few years earlier. When I went to college I began to meet people from other countries and continents and to appreciate the fresh vantage points and approaches that they brought to the workplace and to the problems that we tackled together.

Back in those days, there were virtually no women working as software engineers, managers, or entrepreneurs. Although the computer store was owned by a couple and the wife did all of the management, this was the exception rather than the rule at that time, and for too many years after that. Today, I am happy to be part of a team that brings together the most capable people, regardless of their gender, race, background, or anything other than their ability to do a kick-ass job (Ana, Tara, Randall, Tina, Devin, and Sara, I’m talking about all of you).

We want to do all that we can to encourage young women to prepare to become the next generation of engineers, managers, and entrepreneurs. AWS is proud to support Girls Who Code (including the Summer Immersion Program), Girls in Tech, and other organizations supporting women and underrepresented communities in tech. I sincerely believe that these organizations will be able to move the needle in the right direction. However, like any large-scale social change, this is going to take some time with results visible in years and decades, and only with support & participation from those of us already in the industry.

In conjunction with me&Eve, we were able to speak with some of the attendees at the most recent Girls in Tech Catalyst conference (that’s our booth in the picture). Click through to see what the attendees had to say:

I’m happy to be part of an organization that supports such a worthwhile cause, and that challenges us to make our organization ever-more diverse. While reviewing this post with my colleagues I learned about We Power Tech, an AWS program designed to build skills and foster community and to provide access to Amazon executives who are qualified to speak about the program and about diversity. In conjunction with our friends at Accenture, we have assembled a strong Diversity at re:Invent program.


PS – I did my best to convince Ana, Tara, Tina, or Sara to write this post. Tara finally won the day when she told me “You have raised girls into women, and you are passionate in seeing them succeed in their chose fields with respect and equity. Your post conveying that could be powerful.”

The Cloud as a Platform for Platforms

Of the many things I love about AWS, I will mention three of my favorites in this blog post:

  • AWS does not force developers to use any particular programming model, language, or operating system.
  • AWS does not force developers to use the entire suite of services – they can use any of our infrastructure services individually or in any combination.
  • AWS does not limit developers to a pre-set amount of storage, bandwidth, or computing resources they can consume – they can use as much or as little as they wish, and only pay for what they use.

Our customers love this flexibility. Today, a developer can run more experiments and achieve results much faster than before. If something does not work in a particular environment, the developer can drop that idea, click a few buttons, dispose all of his infrastructure and move on to the next experiment; starting with a fresh, new environment. Developers can try out several new ideas simultaneously by running multiple projects concurrently. Once the ideas are implemented, they can be further battle-tested using more resources in the AWS cloud until they become finished products. Developers love this because they are able to convert their concept/idea into a successful finished product quickly. As a result, we are seeing tremendous innovation happening at break-neck speed. The Cloud is becoming a platform for Innovation. 

The inherent flexibility of the AWS cloud enables customers to use it as a Platform in variety of different ways, including: 

  • The AWS cloud as a Platform for Collaboration
  • The AWS cloud as a Platform for Computation
  • The AWS cloud as a Platform for Software and Data Delivery
  • The AWS cloud as a Platform for Hot and Cold Storage
  • The AWS cloud as a Platform for Research Development and Experimentation

Every day, I find customers in each of categories mentioned above. Some of them share their stories and architectures with us.

It does not stop there!

Its inspiring when I see the AWS cloud being used as a Platform for Platforms.

AWS is not only a rich platform to build solutions but also a platform for building specialized platforms. Customers can choose to either use the AWS cloud directly or take advantage of these value-added platforms. Customers can also mix and match platforms from this rich ecosystem.  

In this post, we look at some of the best examples of specialized platforms built on AWS:

Hero Ruby Platforms
Heroku, as most of you may already know, is one of the early platforms built on top of Amazon Web Services. This “Instant Ruby Platform” enables any Ruby developer to take their existing Ruby code and move it to cloud. Customers of Heroku do not have to worry about scaling or managing their server farm, in case there is a success disaster. Heroku deploys a Ruby app in a single step without changing the app or the process. They recently launched commercially and offer a similar pay-as you-go pricing for enterprises and hobbyists. They also offer a free tier to try and test your prototypes.  By offering Deployment platform-as-a-Service on the top of reliable Amazon’s platform, everybody wins.  The end-user gets all the things they need to build a modern web-scale application quickly while Heroku manages the “magic” (via Slugs and Dynos) of Ruby deployment without worrying about the complexity of maintaining and managing the underlying infrastructure.

Engine Yard offers a rich and open Ruby deployment platform on Amazon Web Services. Developers can take advantage of Engine Yard’s pre-configured standardized stacks which makes it easy to deploy a Ruby on Rails application. Using Engine Yard’s wizard-style web interface, developers can create the entire environment including different Unix packages and Ruby Gems to install, setting frequencies of database backups. They have a nice video on How to deploy your Ruby app on EC2 in 10 minutes


Coderun Language-agnostic Development Platform
I stumbled upon CodeRun Online Development Platform on Amazon EC2 few months ago and was tracking it closely. The interface looked just like Visual Studio but in-browser. End-user can code in PHP, AJAX, ASP.NET in an in-browser IDE which is fully hosted on multiple Amazon EC2 instances and then deploy the code (by clicking on Debug or Run) using several backend services that in-turn run on various other Amazon EC2 instances (all managed). Free accounts may share instances while premium accounts (which will be available in August) will run on stand-alone instances. Developers may deploy their code to Amazon EC2 more than once for testing, debugging and production purposes. Code snippets are shared on Amazon S3 and can be shared among developers (Check out AWS Code Samples). Amazon EBS is used to store users files and data while Amazon CloudFront is used distribute static files. Logs are stored on Amazon SimpleDB. They almost have a full house!

The platform includes a custom elasticity mechanism that monitors resource usage and performs automatic scaling based on dynamic set of predefined business rules. Developers can code in existing technologies (.NET/PHP/JS) and outsource scalability to CodeRun. With a single-click, you can deploy your app. CodeRun leverages AWS API to completely automate the deployment process. This includes allocating resources (instances, addresses and storage), copying files, synchronizing database structure and configuring the web servers. The entire platform is not fully baked yet but, I think, it has tremendous potential.

Voice Platform
Twilio is a voice platform built on the top of AWS. As Twilios website suggests, developer could build innovative voice apps like sales automation systems, order inquiry lines, CRM solutions, call routing apps, appointment reminders, custom voicemail apps. Platform developers at Twilio are focusing on building a powerful telephony platform on the top of Amazon Web Services. Twilio is drop-dead simple and easy to get started (friction-free development). With this simplicity, I think, it wont be too long until a developer will be able to write a phone tree app that calls up all your friends from your social network about an upcoming party and get their RSVP over phone which can then be viewed on a website. Take a look at Twilios presentation (from the AWS Start-Up Tour in Seattle), and you will be convinced that they are AWS experts and they know what they are doing.

At VoiceCon, Siemens Enterprise Communications pre-announced that their Voice and Unified Communications product suite will be available as-a-Service on the AWS cloud. It will be interesting to see how this platform evolves.


Bottom line
Heroku, Engine Yard, Twilio, CodeRun are all different in nature and behavior. All of them are built using different technologies and methodologies. All are targeting different market segments. All share one thing in common. They are all built on AWS. All of them are built to scale and take advantage of flexibility. Innovation thrives in an environment that permits flexibility. AWS gives them the flexibility they need along with the scalability and elasticity their customers require.

This, to me, is very inspiring. What do you think ?

– Jinesh Varia (evangelists at amazon dot com)

Virtual Stress-free Testing in the Cloud

I am giving a presentation at upcoming FutureTest conference on “Testing in the Cloud”. Using Amazon EC2 for Software Testing  is one of “Lowest-hanging fruit” use cases. In this blog post, I will review some of the highlights of my presentation – some of the cool things the cloud brings to the world of application testing.

Instant Virtual Test Lab
Our customers love the fact that they can spawn up Test boxes when they need them, within minutes and terminate them when they are done with testing. The time it takes to procure new hardware, configuration, upfront investment, under-utilization of testing resources were major obstacles to good quality application testing and hence was often either ignored or was incomplete. The fact that they can now get 10, 15 or even 1000s of server instances encourages better and more complete testing and thereby creating more confidence. Instant test labs for a 3-month Test phase when they spawn few dozens instances to do their traditional usability and functional tests. Outsourcing/QA companies love it because they can now charge their customers for the infrastructure they consumed that month (usage-based costing). The day does not seem far when Cloud becomes the ideal testing infrastructure of the future.

Testing as a Service
Companies are like SOASTA, LoadStorm, Browsermob are building their businesses to provide Testing-as-a-Service. SOASTA spawned 650 EC2 Servers to simulate load from two different availability zones to stress test a music-sharing website QTRAX.  After the 3-month iterative process of test-analyze-fix-test cycle, QTRAX can now serve 10M hits/hour and handle 500K concurrent users. This was a significant improvement from their first test which failed at 500 concurrent users. This “Let’s run it again” philosophy encourages iterative process and helps you focus on your fixes while abstracting out all the complexity of generating load and analyzing results (because they provide you with real-time graphs on a dashboard.  Likewise, LoadStorm users can select a target server, build out test plan and run a load test right from their website. Browsermob, built by the core-developer of Selenium project, has created a client parallelization testing tool that spawned 2000 real browsers on 334 High-CPU XL instances for a media company, 1-cast, to test complex streaming and AJAX features of the website. These vendors offer Testing as a Service and reduce the complexity of traditional enterprise tools (like LoadRunner) and have pay-as-you-go models (charging either by bandwidth consumed or testing hours usage) thereby reducing huge upfront licensing cost. Not to mention saving time in learning and configuring these systems and maintaining and operating physical test labs. This “Push it to the Cloud” philosophy helps us to focus on building cool features in our applications and outsource QA while not losing control (unlike traditional outsourcing/offshoring models)

Throw-away boxes
It was funny when I was talking to some developers about testing. One of the developer jumped and said “If you mess up the configuration, simply dump the instance and start a new one. Time is precious, dude!”  I knew developers from server-less start-up companies in our ecosystem, who start their dev boxes every morning and run it for 12 hours (the average developer uptime) per day and shut down every night before they go home. But I never thought that one can actually use Amazon EC2 to create thousands of Test environments in the Cloud – all fresh and new – and dump them, if they mess up and/or recreate it in the next test/sprint cycle. When you are testing your mobile application on different device platforms or testing your database-agnostic and app server-agnostic middleware app on different deployment configurations, the Cloud becomes an ideal platform to create-dump-recreate environments as you need.

Virtualization (AMIs) for Repros
Creating Test Environments with one-click has lots of advantages. The on-going battle between testers, developers and operators of I cannot reproduce the bug could be resolved by few clicks and recreating the environments and sharing Amazon Machine Images. It is a one click BundleInstance request (for Windows) using our Management Console and 3 command line calls to bundle/upload/register an AMI (Linux). Of course, its not advisable to create AMIs for every bug (thats insane) but at least severe production bugs could be reproduced very quickly.

Automation through Web Services
One of the best technology benefits of Cloud Computing is Automation via Web Service Interfaces. It is now easy to setup a system that builds your code at 2AM, runs it on 2 Large Instances for 2 Hours, run all your regression and unit tests to make sure that all the code is compiling fine. (Total Cost ~$1/day). One can also setup to replicate their environment and test the application by running clients from different Geographical Regions (US/EU) and Availability Zones.

Mechanical Turk for Innovative Testing
Using the on-demand workforce to help you in testing your app:

  1. Workers create actual test scenarios (Selenium)
  2. Workers enable Usability testing
  3. Worker help analyze results from Cross-browser testing (present a screenshot and ask a turker to compare the pages)
  4. Worker analyzes your test results/log files
  5. Worker tests for broken links on your website
  6. Workers participate in surveys that rate look-and-feel, navigation, search features of your website

You will find variety of customer stories and actual HITs on Last year’s Start-Up Challenge Contest nominee uses Amazon Mechanical Turk to create real videos to capture user behavior.

Parallelization (Client and Server)

Browsermob can start up 1000s of real Firefox client browsers (no http simulated traffic) and test your applications the way it should be tested. All of this happens in parallel. Imagine if you could do this on server-side as well. In other words, run all your tests (Unit Tests, Integration Tests, Regression Tests, browsers etc) in parallel. Cost of 1 server that runs for 1000 hours is same as 1000 servers for 1 hour. But in latter case, Quality will be verified in minutes instead of hours and days. Cloud Computing inspires us to think parallel.

To reduce the uncertainty in today’s Era of Tera (we never know when your app will be successful), we have to perform all the stress and load tests necessary prior to launch so that we can set the right expectations (SLAs). Customers all around the globe will access your web application. If the website is slow and does not respond, it amounts to bad customer service.

Take advantage of all the technologies that make the cloud (Virtualization for reuse/repros, Web services for automation, crowdsourcing for creating test scenarios) to stress/load/performance test your web applications thereby improving quality and speed (response time) of your application while keeping your costs down.

James Whittaker wrote series of posts on “Future of testing” which I highly recommend everybody to read. He predicts that the next generation of Testing is TestSourcing (after Crowdsourcing, Outsourcing and Insourcing) in which the focus will be on Tests (and not Testers, Testing and Tools). Embellishing on it, I think:

TestSourcing = CrowdSourcing + CloudComputing

Testing in the Cloud – Use the power of the people (on-demand workforce to create standardized re-usable test scenarios) of your apps and the power of cloud computing tools like Amazon EC2 (on-demand hosted Virtualization) and Amazon S3 (for storing your logs/regression results).

How do you see Testing in the Cloud evolve?

If you would like to learn more about these thoughts, Catch me at FutureTest Conference in New York!

— Jinesh

Update: BrowserMob Blog has more details

Cloudbursting – Hybrid Application Hosting

I get to meet with lots of developers and system architects as part of my job. Talking to them about cloud computing and about the Amazon Web Services is both challenging and rewarding. Cloud computing as a concept is still relatively new. When I explain what it is and what it enables, I can almost literally see the light bulbs lighting up in people’s heads as they understand cloud computing and our services, and what they can do with them.

A typical audience contains a nice mix of wild-eyed enthusiasts and more conservative skeptics. The enthusiasts are ready to jump in to cloud computing with both feet, and start to make plans to move corporate assets and processes to the cloud as soon as possible. The conservative folks can appreciate the benefits of cloud computing, but would prefer to take a more careful and measured approach. When the enthusiasts and the skeptics are part of the same organization, they argue back and forth and often come up with an interesting hybrid approach.

The details vary, but a pattern is starting to emerge. The conservative side advocates keeping core business processes inside of the firewall. The enthusiasts want to run on the cloud. They argue back and forth for a while, and eventually settle on a really nice hybrid solution. In a nutshell, they plan to run the steady state business processing on existing systems, and then use the cloud for periodic or overflow processing.

After watching (sometimes in real time in the course of a meeting) this negotiation and ultimate compromise take place time and time again in the last few months, I decided to invent a new word to describe what they are doing. I could have come up with some kind of lifeless and forgettable acronym, but that’s not my style. I proposed cloudbursting in a meeting a month or two ago and everyone seemed to like it.

Eventseer_cloudburstSo, here we go. Cloudbursting is an application hosting model which combines existing corporate infrastructure with new, cloud-based infrastructure to create a powerful, highly scalable application hosting environment.

Earlier this week my colleague Deepak Singh pointed me to a blog post written by Thomas Brox Rst. In the post, Thomas talks about how he combined traditional hosting with an EC2-powered, batch mode page regeneration system. His site (Eventseer) contains over 600,000 highly interconnected pages. As traffic and content grew, serving up the pages dynamically became prohibitively expensive. Renerating all of the pages on a single server would have taken an unacceptably long 7 days, and even longer as the site became more complex. Instead, Thomas used a cloudbursting model, regenerating the pages on an array of 25 Amazon EC2 instances in just 5 hours (or, as he notes, “roughly the cost of a pint of beer in Norway.”). There’s some more information about his approach on the High Scalability blog. Thomas has also written about running Django on EC2 using EBS.

I’d be interesting in hearing about more approaches to creating applications which cloudburst.

— Jeff;

White Paper on ‘Cloud Architectures’ and Best Practices of Amazon S3, EC2, SimpleDB, SQS

I am very happy to announce my white paper on Cloud Architectures is now ready. This is one incarnation of the Emerging Cloud Service Architectures that Jeff wrote about a few weeks ago.

If you are new to the cloud, the first section of the paper will help you understand the benefits of building applications in-the-cloud. If you are using the cloud already, the second section of the paper will help you to use the cloud more effectively by utilizing some of the best practices.

In this paper, I discuss a new way to design architectures. Cloud Architectures are Services-Oriented Architectures that are designed to use On-demand infrastructure more effectively. Applications built on Cloud Architectures are such that the underlying computing infrastructure is used only when it is needed (for example to process a user request), draw the necessary resources on-demand (like compute servers or storage), perform a specific job, then relinquish the unneeded resources after the job is done. While in operation the application scales up or down elastically based on actual need for resources. Everything is automated and operates without any human intervention.


As an example of a Cloud Architecture, I discuss the GrepTheWeb application. This application runs a regular expression against millions of documents from the web and returns the filtered results which match the query. The architecture is interesting because it is runs completely on-demand in automated fashion. Triggered by a regex request, hundreds of Amazon EC2 instances are launched, a Hadoop Cluster is started on them, transient messages are stored on Amazon SQS queues, statuses in Amazon SimpleDB, and all Map/Reduce jobs are run in parallel. Each Map task fetches the file from Amazon S3 and runs the regular expression – and aggregates all the results in the Reduce/Combine Phase and then disposes all the infrastructure back into the cloud (when the Hadoop job is processed)

GrepTheWeb is one of many applications built by Amazon that uses all our services (Amazon EC2, Amazon SimpleDB, Amazon SQS, Amazon S3) together.


A wide variety of different types of applications that can be built using this design approach – from nightly batch processing systems to media processing pipelines.

An excerpt:

Cloud Architectures address key difficulties surrounding large-scale data processing. In traditional data processing it is difficult to get as many machines as an application needs. Second, it is difficult to get the machines when one needs them.  Third, it is difficult to distribute and co-ordinate a large-scale job on different machines, run processes on them, and provision another machine to recover if one machine fails. Fourth, it is difficult to auto-scale up and down based on dynamic workloads.  Fifth, it is difficult to get rid of all those machines when the job is done. Cloud Architectures solve such difficulties.

Applications built on Cloud Architectures run in-the-cloud where the physical location of the infrastructure is determined by the provider. They take advantage of simple APIs of Internet-accessible services that scale on-demand, that are industrial-strength, where the complex reliability and scalability logic of the underlying services remains implemented and hidden inside-the-cloud. The usage of resources in Cloud Architectures is as needed, sometimes ephemeral or seasonal, thereby providing the highest utilization and optimum bang for the buck.

In the first section I discuss the advantages and business benefits of Cloud Architectures and how each service was used. In the second section, I discuss best practices for the various Amazon Web Services.

You can download the PDF version or access it on AWS Resource Center

I talked about this briefly at the Hadoop Summit 2008 and QCon 2007. I got some good reviews after the talk and hence I decided to put all my thoughts in this paper along with some Best Practices for the use of Amazon Web Services (Amazon EC2, Amazon SQS, Amazon S3 and Amazon SimpleDB together). Many developers from our community have been asking for a real-world example of a complex, large-scale application. I will presenting this paper at the 2008 NSF Data-Intensive Scalable Computing Workshop at UW and 9th IEEE/NATEA Conference on Cloud Computing later this week.

I believe this new and emerging way of building applications, that run in-the-cloud, is going to change the way we do business.

— Jinesh

The Emerging Cloud Service Architecture

I’m going to go out on a limb today and try to paint a picture of where some of this cool and crazy cloud-based infrastructure may be going. While none of what I will write about is idle speculation, it is based on just a few data points, and may be totally off base. However, I do get to talk to plenty of entrepreneurs and developers on a daily basis, and I am starting to see a very interesting pattern emerge.

Skynet_smugmug The existing state of the art in cloud-based architectures takes the shape of an application running in the cloud, calling upon services running within and provided by the operator of the cloud. There are any number of great examples of this type of architecture. Doug Kaye at IT Conversations built and documented his implementation over a year ago. Earlier today, Don MacAskill of SmugMug send me a link to his new post, SkyNet Lives (aka EC2 @ SmugMug). In that article, Don provides a detailed review of SmugMug’s use of Amazon EC2 and S3 to implement a dynamic, highly scalable system which simultaneously minimizes response time and cost by optimizing the number of EC2 instances.

As I said, I am starting to see something which goes beyond this in a subtle yet important way. Developers are now building services in the cloud for other developers, with the understanding that important (and perhaps primary) consumers of the service will also be resident within the same cloud.

I’m going to call this the CSA, or Cloud Service Architecture.

Applications communicating with each other inside of the Amazon cloud enjoy some important benefits. They get high-bandwidth, low-latency communication, at little or no cost. They inherit all of the other attributes of cloud-based applications such as on-demand scalability, fault tolerance, cloud-wide network security, and cost efficiency. Applications running in loosely coupled fashion within the cloud can share data using SQS, S3, or other communication protocols of their choosing.

Right now, I see that forward-looking companies are starting to build components which fit into the CSA. On the database side, we have Vertica for the Cloud and MySQL Enterprise for EC2. On the media side, there’s Cruxy’s MuxCloud, IntrIdea’s MediaPlug, and Wowza Media Server Pro for Amazon EC2. I’m sure that there are others that I don’t know about.

Two_point_trend So who’s calling these services from other EC2 instances within the cloud? Here are my first two data points (that’s enough to draw a trend line, right?):

  1. I had breakfast with the CEO of Sonian yesterday. He told me that they are now using the Vertica product to help them store, index, and retrieve massive amounts of data (more info can be found in their case study).
  2. Earlier this year I paid a visit to VisualCV in Reston, Virginia. They use MediaPlug to support uploading and processing of a variety of types of images and videos.

My sense is that this is the start of something big. Web services made it possible to cross organizational boundaries with a simple HTTP request. Now, running within the cloud makes it possible to do this with minimal network latency.

As individual developers learn more about cloud computing, they will naturally look for some very high-level components up and running within the cloud. Over time I am sure that there will be a need for more sophisticated tracking and billing mechanisms, key management, a catalog of services, and other facilities that we can’t even envision just yet. As always, we love to get this feedback from you, so let us know what you need.

I’m sure that there are some other CSA-style applications running in the Amazon cloud now. If you’ve built one, post a comment!

— Jeff;

Two Good Podcasts

Rightscale_mashable_podcast I hardly ever listen to broadcast radio in my car anymore. Instead, I subscribe to a whole bunch of podcasts, some technical, some fun, and others educational. Here are two episodes which should be of interest to anyone who reads this blog:

The Mashable Podcast interviews Michael Crandell, CEO of RightScale. Michael talks about their product and how it helps organizations to use Amazon EC2 in a cost-effective fashion.

The IT Conversations Podcast captures Amazon CTO Werner Vogels as he talks about AWS at last years ETech conference.

You can listen to either or both of these on the respective sites or you can simply subscribe to their RSS feeds.

— Jeff;

PS – Congratulations are due to to RightScale for the successful completion of their fund raising endeavor.

Taking Massive Distributed Computing to the Common Man – Hadoop on Amazon EC2/S3

Not so long ago, it was both difficult and expensive to perform massive distributed processing using a large cluster of machines. Mainly because:

  1. It was difficult to get the funding to acquire this ‘large cluster of machines’. Once acquired, it was difficult to manage (powering/cooling/maintenance) it and we always had a fear of what-if the experiment failed and how would one recover the losses from the investment already made.
  2. After it was acquired and managed, there were technical problems. It was difficult to run massively distributed tasks on the machines, storing and accessing large datasets, parallelization was not easy and Job scheduling was error-prone. Moreover, If nodes failed, detecting this was difficult and recovery was very expensive. Tracking jobs and status was often ignored because it quickly became complicated as number of machines in cluster increased.

Hence it was difficult to innovate and/or solve real-world problems like these:

  • Web Company : Analyze large-data sets of user behavior and clickstream logs
  • Social Networking Company : Analyze social, demographic and market data
  • Phone Company : Locate all customers who have called in a given area
  • Large Retailer Chain : Wants to know what items a particular customer bought last month or recall a certain product and inform customers who bought that product.
  • Surveillance Company : Wants to transcode video accumulated over several years
  • Pharma Company : Wants locate people who were prescribed a certain drug

Just a few years ago, it was difficult. But now, it is easy.

The Open Source Hadoop framework has given developers the power to do some pretty extraordinary things.

Hadoop gives developers an opportunity to focus on their idea/implementation and not worry about software-level “muck” associated with distributed processing (#2 above). It handles job scheduling, automatic parallelization, and job/status tracking all by itself while developers focus on the Map and Reduce implementation. It allows processing of large datasets by splitting the dataset into manageable chunks, spreading it across a fleet of machines and managing the overall process by launching jobs, processing the job no matter where the data is physically located and, at the end, aggregating the job output into a final result.

Large companies can afford to acquire 10,000 node clusters and run their experiments on massive distributed processing platforms that process 20000 TB/day.

But if I am a startup, or a university with minimal funding, or a self-employed individual who would like to test distributed processing over a large cluster with 1000+ nodes, can I afford it? OR even If I am a well funded company (think “enterprise”) with lot of free cash flow, will management approve the budget for my experiment?  Every organization has a person who says “no”. Will I be able to fight the battle with those people? Should I even fight the battle (of logistics)? Will I be able to get an environment to experiment with large datasets (think “weather data simulation”, oer “genome comparisons”)?

Cloud Computing makes this a reality (solving #1 above). Click a button and get a server. Flick a switch and store terabytes of data geographically distributed. Click a button and dispose of temporary resources.

Posts like this and this inspired me to write this post. Amazon Web Services is leveling the playing field for experimentation, innovation and competition. Users are able to iterate on their ideas quickly, if your idea works, bingo! If it does not, shutdown your “droplet” in the cloud and move on to the next idea and start a new “droplet” whenever you are ready.

I would say:

The Open Source Hadoop framework on Amazon EC2/S3 has given every developer the power to do some pretty extraordinary things.

Everyday, I hear new stories about running Hadoop on EC2. For example, The New York Times used 100 Amazon EC2 instances and a Hadoop application to process 4TB of raw image TIFF data (stored in S3) into 1.1 million finished PDFs in the space of 24 hours at a computation cost of just $240. It not only makes massive distributed processing easy but also makes it headache-free.

Whether it is Startup companies or University Classrooms in UCSB, BYU, Stanford or even enterprise companies, its just amazing to see every new story that is utilizing Hadoop on Amazon EC2/S3 in innovative ways.

Thats what I love about Amazon Web Services – a common man with just a credit card can afford to think about massive distributed computing and compete with the rest and emerge to the top.


p.s.The real power and potential of hadoop over Amazon EC2 would be when I see Hadoop-on-demand with Condor spawning EC2 instances on-the-fly when I need them (or when situation demands them) automatically and shutting them down when I dont need them. Has anybody tried that yet ?

Amazon S3 for Science Grids

S3_for_science_grids_revised A team of researchers from the University of South Florida and the University of British Columbia have written a very interesting paper, Amazon S3 for Science Grids: A Viable Solution?

In this paper the authors review the features of Amazon S3 in depth, focusing on the core concepts, the security model, and data access protocols. After characterizing science storage grids in terms of data usage characteristics and storage requirements, they proceed to benchmark S3 with respect to data durability, data availability, access performance, and file download via BitTorrent. With this information as a baseline, they evaluate S3’s cost, performance, and security functionality.

They conclude by observing that many science grid applications don’t actually need all three of S3’s most desirable characteristics — high durability, high availability, and fast access. They also have some interesting recommendations for additional security functionality and some relaxing of limitations.

I do have one small update to the information presented in the article! Since it article was written, we have announced that S3 is now storing 5 billion objects, not the 800 million mentioned in section II.

— Jeff;

Search Engine Packed as an AMI?

Mix_dining_room_2 It never hurts to try to wish a product into existence…

I received an email from an EC2 user asking me about search tools. This user runs a high traffic site on an array of EC2 instances, and is in need of a search solution. He knew that he could buy a search appliance, but this didn’t fit with his company’s model. As he told me:

“we don’t want to do anything that involves us owning and operating a server…since we’re big believers in web services.”

After thinking about this for a while, I believe that one really cool solution would involve a search engine installed into an EC2 AMI (Amazon Machine Image), perhaps made available for use on a by-the-hour basis. This hypothetical AMI would incorporate all of the usual components: a crawler, data storage, and a query page for access to the actual search engine. There are bonus points for APIs for inserting and retrieving data, of course.

Perhaps the crawler runs once every 24 hours and then generates some indexed data structures which it stores in S3, where they are picked up by the engine and loaded into the instance’s RAM for fast processing. Once again, I’ll offer bonus points if spinning up multiple instances of the crawler makes the entire crawling and indexing process run faster.

To top it all off, the query page would be customizable and skinnable, so that this could be plugged into an existing site in a seamless fashion.

If you are doing something like this or have even thought about doing something similar, I’d like to hear from you. If you would pay to use it, same deal. Post some comments and let’s see what happens.

— Jeff;