Amazon CloudSearch – Even Better Searching for Less Than $100/Month
As I have said in the past, search plays a major role in many web sites and online applications!
The basic model is simple. Think of your set of documents or your data collection as a book or a catalog, composed of a number of pages. You know that you can find a desired page quickly and efficiently by simply consulting the index.
Search does the same thing by indexing each document in a way that facilitates rapid retrieval. You enter some terms into a search box and the site responds with a list of pages that match the search terms, sorted so that the best matches are at the top of the list.
Amazon CloudSearch, introduced in April of 2012, is a fully managed service, enabling you to focus on adding search functionality to your application rather than on building basic search functionality, scaling it, making it highly available, and so forth.
Today we are announcing a set of major enhancements to CloudSearch. We have added a plethora of new indexing and search features, support for a total of 33 languages, IAM integration, control over instance size and scaling, and a Multi-AZ option to enhance availability. We have also made major improvements to the CloudSearch Console.
I’ll talk about the features later in this post. First, I would like to review the concept of a managed service, as it applies to CloudSearch.
In the AWS lexicon, a managed service takes care of all of the messy details associated with running the service. Activities that once required error-prone human intervention and detailed runbooks are replaced with carefully designed and thoroughly tested workflows and verified by comprehensive monitoring.
Managed services can automatically handle provisioning, scaling, fault detection and recovery, operating system and service updates, and changes in operating conditions.
Because managed services are run at world-scale, the workflows quickly become mature and battle-tested. Unusual conditions that would occur once in a blue moon in a corporate data center surface quickly when tested at scale, and are addressed before they become problematic in production systems.
Further, because managed services are exposed in the form of high-level APIs, the operator of the service has the freedom to make changes behind the scenes, improving the features, reliability, scalability, and durability of the service while presenting an unchanging face to the programs which call into it via the API.
Now that you know a bit more about managed services, let’s take a look at what this means for website and document search.
CloudSearch hides all of the complexity and all of the search infrastructure from you. You simply provide it with a set of documents and decide how you would like to incorporate search into your application. Behind the scenes, CloudSearch will add and remove search instances as needed in order to make sure that it has adequate storage space for incoming documents and enough compute power to handle index and search requests.
You don’t have to write your own indexing, query parsing, query processing, results handling, or any of that other stuff. You don’t need to worry about running out of disk space or processing power, and you don’t need to keep rewriting your code to add more features.
With CloudSearch, you can focus on your application layer. You upload your documents, CloudSearch indexes them, and you can build a search experience that is custom-tailored to the needs of your customers.
CloudSearch is very easy to use. You simply create and configure a Search Domain, upload and index your documents, and perform searches. All of these operations can be initiated from the CloudSearch Console, the command line, or through the CloudSearch APIs.
As I mentioned earlier, we have added a plethora of new features to CloudSearch. Here’s the scoop:
Field Types – In addition to the original text field type, you can now index and search single or multiple (array) dates, doubles, integers, text, and literals. You can also index and search geographic locations (latlon):
Multiple Languages – CloudSearch can now handle text in 33 distinct languages (Arabic, Armenian, Basque, Bulgarian, Catalan, simplified Chinese, traditional Chinese, Czech, Danish, Dutch, English, Finish, French, Galician, German, Greek, Hindi, Hungarian, Indonesian, Irish, Italian, Japanese, Korean, Latvian, Norwegian, Persian, Portuguese, Romanian, Russian, Spanish, Swedish, Thai, and Turkish), with stemming and text analysis for each language. You can send documents with more than one language in different fields, and theres even a text processor to handle single fields that contain text in more than one language.
Enhanced Search – CloudSearch has been enhanced with the following new search features:
- Proximity Search
- Term Boosting
- Range Searching on all field types
- Match-All Queries
- Multiple, Configurable Query Parsers (simple, structured, Lucene, and DisMax)
- Partial Search Results
Geographic Queries – You can use the latlon (latitude / longitude) data type to represent a geographic location. You can search within a bounding box (northeast and southwest corners) and you can use the new haversin function to sort results by the great circle distance between two points.
Auto Complete, Supported by Suggestions – CloudSearch can generate exact or fuzzy suggestions based on a prefix match with a field:
Highlights – CloudSearch can return excerpts (plain text or HTML) along with the search results to show where the search terms occur within a given field of a matching document. For example, I searched for “beanstalk”; you can see that it is highlighted in the search results:
Enhanced Availability – You can now run CloudSearch in a Multi-AZ configuration to enhance availability. As a managed service, CloudSearch will automatically launch replacement search instances as needed. Queries are load balanced across the Availability Zones; updates are sent to the search instances in both zones.
IAM Integration – You can now use AWS Identity and Access Management (IAM) to regulate access to the CloudSearch Configuration APIs, adding fine-grained control over which accounts can create and delete domains, set access policies, set indexing options, and so forth.
Configurable Instance Size and Scaling Options – You can now exercise control over the initial instance size and the index replication factor. If you are starting with a date set that contains more than 2 GB of source data, we recommend starting with an m1.large instance or larger:
CloudSearch in Action
In order to learn more about CloudSearch, I created a search domain and uploaded 1900 posts from this blog (2004 to present). Here’s what I did and what I saw.
I began by opening up the CloudSearch Console and creating a search domain called aws-blog-posts:
Then I configured the indexes by uploading a sample HTML document. CloudSearch understands and can parse many different document types, and will suggest an appropriate set of index fields for each type. Here’s what it set up for my HTML documents:
I downloaded and installed the CloudSearch CLI tools (with a ritual Java upgrade along the way), and then ran the cs-import-documents command on a directory containing the blog posts:
I was hungry so I searched for chocolate:
CloudSearch includes a sample data set from IMDB. The data set contains detailed information about 5,000 movies. In order to experiment with some of CloudSearch’s more advanced search features, I created another search domain and used the Upload Documents button to populate it with the movie data (CloudSearch can also import data from Amazon S3 and DynamoDB).
I started out by searching for Star Wars:
As you can see from the screen shot, the search results are automatically grouped into facets, and the facets can be used to refine the search.
CloudSearch also gives you the power to use an arithmetic expression to reorder the search results, including a powerful rank comparison tool to let you do A/B testing of changes to the ranking, with the results visible in real time:
Partners and Reviews
Our friends at SMART InSight Corporation of Japan have incorporated this new version of CloudSearch into their flagship product, which is also called SMART/InSight. You can view their Welcome Announcement to learn more about how CloudSearch helps them to deliver their product without thinking about data type, size, or location.
The good folks at Search Technologies have posted a First Look at Amazon CloudSearch, and note that it meets a wide variety of real-world search requirements and drives down the cost of ownership.
Start Searching Today
As part of today’s release we are making CloudSearch available in the Asia Pacific (Tokyo), Asia Pacific (Sydney), and South America (So Paulo) Regions.
PS – Despite the length of this post, I have omitted many powerful and useful features. Please spend some time reading the CloudSearch documentation to learn more about how you can put CloudSearch to work in your own application.
You may also want to sign up for the upcoming CloudSearch webinar. The session will provide an overview of CloudSearch, discuss popular use cases, and share best practices that will help you to put CloudSearch to use in your own environment.