Migrating from Solr to Amazon CloudSearch

Articles & Tutorials>Migrating from Solr to Amazon CloudSearch
This article explains how Amazon CloudSearch differs from Solr and describes the steps you need to take to migrate to Amazon CloudSearch from a self-hosted, Solr-based search solution.

Details

Submitted By: Deborah Adair
Created On: December 14, 2012 10:26 PM GMT
Last Updated: July 15, 2013 10:12 PM GMT

Migrating from Solr to Amazon CloudSearch

Amazon CloudSearch is a fully-managed service that enables you to offload the administrative burden of operating and scaling your search platform. This is particularly valuable for large applications built on Apache Solr that require distributing sharded indices across multiple hosts. When you migrate from a Solr based search solution to Amazon CloudSearch, you no longer have to worry about provisioning hardware, partitioning your data, and applying software patches. Amazon CloudSearch automatically provisions and scales the resources needed to operate your search service according to the volume of data and query traffic.

What's Different in Amazon CloudSearch?

Both Solr and Amazon CloudSearch are search platforms that enable you to search your data by submitting HTTP requests and receive responses in either XML or JSON. The main differences are in:

  • How you manage resources and security
  • How you define your schema
  • How your data is indexed
  • How you sort search results

There are also differences in terminology between Solr and Amazon CloudSearch. For a mapping of Solr features and query parameters to the Amazon CloudSearch equivalents, see Understanding Amazon CloudSearch Terminology.

How You Manage Resources and Security

The biggest difference between Amazon CloudSearch and Solr is that Amazon CloudSearch is a fully-managed service. When you migrate to Amazon CloudSearch, you no longer have to worry about provisioning and managing your own search fleet and scaling the fleet as your volume of data and traffic fluctuate. Amazon CloudSearch handles all of this for you behind the scenes. You don't need to make allowances for sharding in your application code; your index is automatically moved to a larger instance type or partitioned across multiple instances as needed for optimal performance.

Similarly, while Solr leaves managing the security of your search platform entirely up to you, Amazon CloudSearch provides built-in mechanisms for controlling access to your domain configuration, update stream, and search service. Access to your search domain's configuration is restricted using standard AWS authentication. Access to your domain's document and search services is restricted to authorized IP addresses and can easily be configured through the Amazon CloudSearch console.

How You Define Your Schema

In Solr, you create a schema file that describes the fields in your index. Fields are configured using a wide variety of specialized data types, such as text_en, text_fr, string, date, float, and tfloat. You must also specify configuration options such as what fields you want to index, what data you want to store in the index, whether you want to store the position of every single term relative to every single other term, and so on. The resulting schema must be a valid XML document.

In Amazon CloudSearch, you specify your indexing options by configuring individual index fields. Amazon CloudSearch supports three basic index field types that enable you to duplicate virtually any Solr schema:

  • text-fields designated as text fields represent arbitrary alphanumeric data, such as names, sentences, paragraphs, and so on. This data is tokenized (broken up into pieces), so that a search on part of the field—such as a single word—will match. Text fields are always searchable, and they are processed according to the options you set for stopwords, synonyms, and stemming.

  • literal-fields designated as literal fields are much like string fields in Solr. They represent a string of text that is to be evaluated exactly as specified, such as an e-mail address or category. In addition to being searched, literal fields can be used for faceting and sorting.

  • uint-fields designated as uint fields are used to store numeric data represented as an unsigned integer. A uint field can be used for anything from rankings to dates (stored as timestamps). You can store higher-precision data in a uint field by shifting the decimal point when processing the values. For example, you might have a field called percent_approved that stores percentages represented as integer values, such as 50, 98, and so on. In your application logic, you could treat these as decimal values in the range 0-1.0 by shifting the decimal point: .50, .98, and so on. You can search uint fields for specific values or ranges of values.

In Amazon CloudSearch, you don't have to directly edit and validate your index schema: you can easily configure index fields and modify your indexing options through the AWS Management console. You can also configure index fields using the Amazon CloudSearch command line tools or REST API.

How Your Data is Indexed

When you upload data to Amazon CloudSearch, it is automatically indexed in near real time. You only have to explicitly rebuild your index when you make changes to your configuration options. When you initiate indexing, Amazon CloudSearch applies the updated configuration to the data that's already in your index. In Solr changes that require re-indexing literally force you to go through the entire process of indexing your data from scratch-assuming the data is still available to re-index!

How You Sort Results

Amazon CloudSearch and Solr both enable you to sort your data based on calculated values. However, Solr requires you to use an arcane syntax to create functions and include those functions in every query. In Amazon CloudSearch, you can use regular arithmetic notation to create reusable rank expressions that can be referenced in any search request. A rank expression can make use of the data in a document as well as the document's default text_relevance score, which is analogous to a "score" in Solr. You can also use rank expressions to replicate Solr's boosting capabilities.

Understanding Amazon CloudSearch Terminology

Solr and Amazon CloudSearch support many of the same features and query options, but call them by different names. The following tables show the mappings between Solr terminology and Amazon CloudSearch terminology.

Features

Solr Amazon CloudSearch
score text_relevance
copyField (source) source
stored returnable
filter query Boolean query
sort order rank
Solr XML Search Data Format (SDF)
function query rank expression
Solr instance search domain
query URL search endpoint
update URL update endpoint

Query Language

Solr Amazon CloudSearch
q q
sort rank
start start
rows size
fq bq
fl return-fields
wt results-type
facet (none, faceting is always enabled by default)
facet.query facet-<fieldname>-constraints
facet.field facet
facet.sort facet-<fieldname>-sort
facet.limit facet-top-n
facet.range facet-<fieldname>-constraints

Migrating Your Application

Migrating your application from Solr to Amazon CloudSearch is a relatively straightforward process. You need to:

  1. Create an Amazon CloudSearch domain. A search domain encapsulates your searchable data and the search instances that handle your search requests.
  2. Map your Solr schema to Amazon CloudSearch index fields. Once you create your search domain, you define index fields to configure Amazon CloudSearch to handle your data the same way Solr did.
  3. Implement your boosting and sorting preferences using rank expressions. Rank expressions are reusable Javascript-style expressions that you can define and use to customize how your results are ranked.
  4. Submit Your Data Using the Amazon CloudSearch Search Data Format. You need to adapt your application to submit data to Amazon CloudSearch instead of Solr. Data can be submitted to Amazon CloudSearch in either JSON or XML.
  5. Convert Your Solr queries to the Amazon CloudSearch search syntax. Once you've uploaded your data, you're ready to start submitting search requests to your domain's search endpoint using the Amazon CloudSearch search syntax.
  6. Display the results from Amazon CloudSearch. The response format is very similar to Solr, but note that by default Amazon CloudSearch returns responses in JSON. To get responses in XML, you need to specify the response format in each search request.

The following sections look at each of these steps in detail.

Create an Amazon CloudSearch Domain

The first step in migrating your application to Amazon CloudSearch is to create a search domain. The easiest way to create a search domain is to go to the Amazon CloudSearch console and click Create Your First Domain to launch the Create New Search Domain wizard. When asked how you want to configure your index fields, select Manual Configuration and click Continue on the Review Configuration page-the next section describes how to configure index fields for your new domain.

When prompted to set up access policies, select the Recommended Rules. (See Configuring Access in the Amazon CloudSearch Developer Guide for more information about how you can control access to your search domain.)

Configuring access policies

You can also create domains with the command line tools or REST API. For step-by-step instructions, see Creating a Search Domain in the Amazon CloudSearch Developer Guide.

Map Your Schema to Amazon CloudSearch Index Fields

Dynamic Fields

Amazon CloudSearch doesn't support dynamic fields directly. However, if your application depends on them, you can replicate that functionality by using the Configuration API to programmatically create fields before indexing your data. For information about creating fields using the DefineIndexField API, see Configuring Index Fields in the Amazon CloudSearch Developer Guide.

Before you can configure index fields for your Amazon CloudSearch domain, you need to understand your existing Solr schema and how your application uses your search index. Ask yourself the following questions:

  • What fields does my application actually use?

  • How is each field used? Is the data indexed? Does the application retrieve the data from the index? Is the field used for faceting or sorting?

  • What kind of searches do I need to perform? Are they mostly free-text searches? Do I need to perform parameterized (faceted) searches?

  • Does my application use dynamic fields? If so, do I know what those fields are?

  • Does my application use custom request handlers? What do they do?

First, gather the information about your existing fields from your Solr schema:

  1. Create a list of all of the fields your application uses. For each field, specify the following information from your Solr schema:

    • The name of the field.
    • The field type.

    Make sure to include any fields your application creates using Solr's dynamic fields capabilities. To make things easier, we've created a spreadsheet template you can use to record your fields.

    Configuration spreadsheet template
  2. Indicate whether or not each field is searchable-in your Solr schema, searchable fields are marked indexed='true'. When you configure the field in Amazon CloudSearch, you'll need to make sure it's search enabled.

  3. Indicate whether or not each field is returnable-in your Solr schema, returnable fields are marked stored='true'. When you configure the field in Amazon CloudSearch, you'll need to make it result enabled.

  4. Identify any fields that your application uses as facets. There isn't any indication of this in your Solr schema, so you'll need to look at how you're application uses the fields. When you configure the field in Amazon CloudSearch, you'll need to designate fields that you want to use as facets facet enabled.

  5. Record the default values for any fields that have a defaultValue attribute set in your Solr schema. You'll set this value when you configure the field in Amazon CloudSearch.

Next, identify the changes you need to make to map your index fields to Amazon CloudSearch:

  1. If necessary, convert your field names to conform to the Amazon CloudSearch naming conventions. Amazon CloudSearch field names must begin with a letter and be at least 3 and no more than 64 characters long. The allowed characters are: a-z (lower-case letters), 0-9, and _ (underscore). The names body, docid, and text_relevance are reserved names and cannot be specified as field names. If you use capital letters or hyphens in your field names, they must be converted. For example, you could convert a field called priceVariety or price-variety to the Amazon CloudSearch compatible name price_variety.
  2. For each field, map the Solr data type to one of the three Amazon CloudSearch data types: text, literal, or uint. The following table shows the mappings from Solr types to Amazon CloudSearch types:

    Solr Amazon CloudSearch
    text_* text
    string literal
    Boolean uint
    int uint
    float uint
    long uint
    double uint
    tint uint
    tfloat uint
    tlong uint
    tdouble uint
    date uint
    tdate uint

    In some cases, you might need to make additional changes to your application. For example, Amazon CloudSearch does not support storing binary data in the index. You must store binary data elsewhere, such as on the filesystem or in a database, and index an identifier for that data, such as a URL or database ID. Similarly, if you have text fields that are larger than 2 KB, you can index all of the data, but you cannot store more than 2 KB of data per field in an Amazon CloudSearch index. (If you make the field result enabled, the data will be truncated to 2 KB.)

  3. If you want to use a field as a facet, the data type should be either literal or uint. (While you can configure text fields as facets in Amazon CloudSearch, in most cases that won't produce the results you're looking for.)
  4. If you have a literal field that is result enabled and you want to also use the field as a facet, create a second field with the same data that you designate as facet enabled. In Amazon CloudSearch, a field can be either result enabled or facet enabled, but not both. For example, if you have a genre field that's result enabled, to facet on genre you could create a second field called genre_facet, and specify its source to be the same as the genre field. (By default, index fields are populated from the field of the same name in your SDF data, so by default the source for the genre field is genre.)

    Mapping configuration options

    Uint fields are always both facet enabled and result enabled.

Once you've mapped your existing Solr configuration to CloudSearch indexing options, you're ready to configure index fields for your search domain. The easiest way to configure index fields is to go to the Amazon CloudSearch console and click the Indexing Options link for your search domain.

For each field:

  1. Enter the field name and choose the field type.

  2. Select the checkboxes in the Search, Facet, and Result columns to configure the necessary options for the field.

  3. If the field has a default value, enter it in the Default column.

  4. If the source for the field is not the same as the field name, click Add in the Source column to configure the source. For example, if you create a duplicate of the genre field for faceting called genre_facet, you need to add genre as the source for the duplicate field. In this case, the mapping type should be set to Copy. You can copy up to 20 different source fields to an index field. Amazon CloudSearch also supports two other mapping types, Map and Trim Title. For more information, see Adding Sources for an Index Field in the Amazon CloudSearch Developer Guide.

    Configuring index fields

You can also add fields using the command line tools or REST API. For more information, see Configuring Index Fields in the Amazon CloudSearch Developer Guide.

Implement Your Boosting and Sorting Rules Using Rank Expressions

Amazon CloudSearch makes it easy to sort search results according to the value of any field-you just need to specify the field in the search request using the rank parameter. For example:

http://<my-endpoint-url>/2011-02-01/search?q=love
&return-fields=artist_name,genre,title,year 
&rank=-year

By default, results are sorted alphabetically or numerically in ascending order. To sort in descending order, you prefix the field name with a minus sign (-). In the previous example, specifying rank=-year sorts the results by year with the most recent year first.

To implement more complex sorting algorithms in Amazon CloudSearch, you define and use custom rank expressions. For example, you could create an expression that represents a combination of the relevance of a particular item, along with a boost for items that have brought in more overall revenue. You could express this with the following rank expression:

(100 * text_relevance) + (price * units_sold) 

The text_relevance value is a built in score in the range 0 to 1000 that's automatically calculated for each matching result. The price and units_sold values are references to index fields.

You can use rank expressions to replicate document boosting by adding a boost field to your documents and using that field in your rank expressions.

The easiest way to configure rank expressions is to go to the Amazon CloudSearch console and click the Rank Expressions link for your search domain.

Configuring rank expressions

You can also define rank expressions using the command line tools or REST API. For more information, see Customizing Result Ranking in the Amazon CloudSearch Developer Guide.

To use a rank expression, you specify the expression's name in your search requests using the rank parameter:

http://<my-endpoint-url>/2011-02-01/search?q=love
&return-fields=artist_name,genre,title,year
&rank=-revenue_boost

Like sorting by field value, you can prefix the rank expression name with a minus sign (-) to sort in descending order.

You can also use rank expressions to constrain the search results using the threshold parameter. The threshold parameter enables you to specify a value or range of values for a field or rank expression-only documents that match the constraint are included in the search results. For more information, see Constraining Search Results in the Amazon CloudSearch Developer Guide.

Submit Your Data Using the Amazon CloudSearch Search Data Format

Once you've configured your search domain, the next step is uploading your data so it can be indexed. Like with Solr, to send data to Amazon CloudSearch you need to structure it according to a standard XML or JSON format. The Amazon CloudSearch format is called Search Data Format (SDF). To migrate your application to Amazon CloudSearch, you'll need to modify it to generate SDF and submit the data to your domain's document service endpoint instead of Solr.

Search Data Format

SDF is very similar to the format you use for Solr. The following example shows a side-by-side comparison of a document formatted in XML.

Solr

Amazon CloudSearch

<add> 
  <doc>
    
    <field name="id">
      sogqtrz12a8c13c8b0
    </field>
    <field name="title">
      I Can't Stop Loving You
    </field>
    <field name="artist_name">
      Martina McBride
    </field>
    <field name="artist_id">
      arf3gx71187fb3eb66
    </field>
    <field name="year">2005</field>
    <field name="genre">country</field>
    <field name="genre">pop</field>
    <field name="genre">ballad</field>
  </doc>
  <doc>
  
    ...
  </doc>
</add>
<batch>
  <add id="sogqtrz12a8c13c8b0" 
    version="1" lang="en">
    <field name="song_id">
      sogqtrz12a8c13c8b0
    </field>
    <field name="title">
      I Can't Stop Loving You
    </field>
    <field name="artist_name">
      Martina McBride
    </field>
    <field name="artist_id">
      arf3gx71187fb3eb66
    </field>
    <field name="year">2005</field>
    <field name="genre">country</field>
    <field name="genre">pop</field>
    <field name="genre">ballad</field>
  </add>
  <add id="sobhvzq12ac9618285" 
    version="1" lang="en">
    ...
  </add>
</batch>

As you can see, the field definitions for a document are identical-the name of the field is specified with the name attribute, and the content is the content of the element.

ex.fm on CloudSearch

According to Lucas Hrabovsky, Chief Technology Officer of ex.fm, Amazon CloudSearch—or something like it—has been on his radar since long before it ever emerged. "Every time I started a project I wanted this type of search in the cloud. So when it actually came out, I jumped on it."

By the time the opportunity arose, ex.fm had a Solr-based solution already in production, but it wasn't keeping up with the load. "Like any startup, Solr started out on a machine with a bunch of other things, then we moved it to its own machine, then a bigger machine, but it still couldn't handle the update rate. [Moving to CloudSearch] was a business decision. Do we spend an additional 2 cents an hour, or spend $100,000—more here in New York City—to hire someone to handle our Solr instances?"

For ex.fm, migration was a straightforward process, completed within two weeks. The company then ran both systems in parallel, sending updates to both Solr and CloudSearch—for about two hours. After that, Hrabovsky says, they thought, "OK, CloudSearch works."

The overall structure of CloudSearch SDF data differs from Solr in two ways:

  • In Amazon CloudSearch, documents are submitted in batches of up to 5 MB at a time. In SDF, each batch of documents is contained in a batch element. (The batch element is the root element of an SDF document.)
  • Instead of wrapping each document in a doc element, in SDF you wrap each document in either an add or delete element. Unlike Solr, which can only handle one type of operation at a time, an SDF batch can contain a combination of add and delete operations.

In SDF, you also have to specify three attributes at the document level. These attributes are specified on each add or delete element:

Why are version numbers important?

They're important because they save you from one of the more difficult tasks associated with operating a large Solr installation: managing indexes on multiple servers. Amazon CloudSearch automatically adds and removes search instances from your domain as the volume of data and traffic fluctuates. Even if multiple updates for the same document hit different search instances, CloudSearch will always use the latest one, as indicated by the version number.

  • id-the document ID (docid) is a unique ID that you use to reference the document when updating or deleting it. This corresponds to the field designated as the uniqueKey in Solr. However, Amazon CloudSearch document IDs can only contain lowercase letters, numbers, and underscores, and must start with a letter or number, so you might have to adjust your IDs.

  • version-the version number is used to guarantee that older updates aren't accidentally applied, and to provide control over the ordering of concurrent updates to the service. Each subsequent add or delete must have a higher version than the previous update. One option is to use the Unix epoch time (the time in seconds since January 1, 1970) for versioning. Using a timestamp is convenient because you don't have to keep track of the last version you sent to the server, and it guarantees that the last update you sent has the highest version number.

  • lang-the language for the document. Currently, this can only be set to en for English. (The language controls what language-specific text processing, such as stemming, is performed on the data. Document fields can contain any UTF-8 characters that are valid in XML regardless of the language specified.)

SDF can be formatted in XML or JSON. For the complete XML and JSON SDF schemas, see the Amazon CloudSearch Document Service API Reference.

Generating SDF

If your application directly accesses or generates your Solr XML data, you can transform it to SDF either via XSLT or by altering the code that generates it. If you use Solr Cell to handle various document formats, or the DataInputHandler to import content from a database or other DIH-compatible source, you'll need to implement your own mechanism for generating SDF from your source data.

The data analysis component behind Solr Cell, Apache Tika, gives you easy access to the actual content, so you can use that content to create SDF. If you're using the DataInputHandler, there isn't an easy way to "intercept" the data, but in most cases you can programmatically access the database to extract your data and generate SDF.

You can also opt to use a third-party content processing system such as Aspire, which enables you to generate SDF from a variety of sources.

For more information about generating SDF, see Preparing Your Data in the Amazon CloudSearch Developer Guide.

Submitting Data to CloudSearch Instead of Solr

To submit your data to your search domain, you send an HTTP request that contains an SDF batch to your domain's document service endpoint. The changes you need to make to your application depend on how you're currently sending data to Solr. If you're directly submitting HTTP requests to send your data to Solr, you can just change the destination for your requests to your search domain's document service endpoint. If you're using an API such as SolrJ or SolrPHP, you'll need to replace the code that submits your data with HTTP POST requests to the documents/batch API. For example:

POST /2011-02-01/documents/batch HTTP/1.1 
Accept: application/json 
Content-Length: 1176 Content-Type: application/json 
Host: doc.imdb-movies-rr2f34ofg56xneuemujamut52i.us-east-1.cloudsearch.amazonaws.com  
[ { 
    "type": "add", 
    "id": "tt0484562", 
    "version": 1337648735, 
    "lang": "en", 
    "fields": { 
        "title": "The Seeker: The Dark Is Rising",        
        "director": "Cunningham, David L.", 
        "genre": [
            "Adventure",
            "Drama",
            "Fantasy",
            "Thriller"], 
        "actor": [
            "McShane, Ian",
            "Eccleston, Christopher",
            "Conroy, Frances"]  
    } 
  }, 
  { 
    "type": "delete", 
    "id": "tt0434409", 
    "version": 1337648735 
  } 
]

Using a Proxy to Divert Data to Amazon CloudSearch

Another approach is to leave your application as-is and write a proxy that intercepts the data when updates are posted. For example, a Solr update request might send data as an HTTP POST request to:

http://localhost:8983/solr/update

There's no reason that URL has to actually refer to Solr. Instead, it could reference a script that reads the data from the POST request, converts it to SDF, and submits it to CloudSearch. For example:

http://localhost:8983/solr/update/index.php

How you implement this proxy will depend on the structure of your Solr update requests.

Convert Your Solr Queries to the Amazon CloudSearch Search Syntax

Once you've uploaded your data to your search domain, you're ready to start searching. Converting Solr queries to the CloudSearch search syntax is fairly straightforward. Like Solr, CloudSearch search requests are submitted via HTTP and the results can be returned in either XML or JSON.

For simple searches, the syntax is virtually identical-both systems enable you to perform basic keyword searches with the q parameter:

Solr

Amazon CloudSearch

http://{your-search-server}:{your-search-port}/
solr/select?q=love&fl=title
http://{your-search-endpoint}/2011-02-01/
search?q=love&return-fields=title

One key difference, however, is that Solr returns all fields by default, while CloudSearch only returns the document ID by default. In CloudSearch, you specify the fields that you want to include in the results with the return-fields parameter.

Default Search Field

All Documents Search

You can duplicate Solr's *:* query by doing a negative query for a term that doesn't exist in any document, such as q=-bogus123

In Solr, you often define a "default" text field that includes copies of other fields so they can be searched together. Amazon CloudSearch automatically creates a default search field that includes all of the text fields configured for your domain. (You can also explicitly specify which fields you want to search by default with the UpdateDefaultSearchField API.) If a search request does not specify which fields to search, Amazon CloudSearch searches this default search field. Requests that use the q parameter to specify search terms always search the default search field. The default search field is also searched when you use the Boolean query (bq) parameter and don't specify a particular field to search.

Boolean Searches

It would be difficult, if not downright impossible, to write a search-related application without using some sort of Boolean logic. For example, you might want to search for songs by Kiss that have the words "rock" and "roll" in them. Amazon CloudSearch provides two ways to construct Boolean queries:

  • You can use the prefix operators + (AND) , - (NOT), and | (OR) when specifying the terms you want to search for. The prefix operators apply to individual terms. The prefix operators can be specified when using either the q (query) or bq (Boolean query) parameters.

  • You can construct more complex queries with nested Boolean logic using the bq parameter. The bq parameter also enables you to search particular fields.

For example, the simplest way to search for songs by Kiss that have the words "rock" and "roll" in them would be to specify:

q=rock roll kiss

This is equivalent to specifying +rock +roll +kiss. Amazon CloudSearch ANDs all of the search terms together by default-you don't have to specify the + prefix operator. (This is the same as in Solr if it's configured to use AND as the default Boolean.)

Similarly, if you want to get songs that contain the terms "rock" and "roll" but that aren't by Kiss, you might specify:

q=rock roll -kiss

If you want to get songs that contain the both terms "rock" and "roll" or the term Kiss, you could specify:

(rock roll) |kiss

However, none of these queries guarantee that you'll only get (or exclude) songs from the band Kiss-some results might match because they have the term "Kiss" in the title. In Solr, you filter the results by specifying the fq parameter. To search particular fields with Amazon CloudSearch, you use the Boolean query parameter, bq:

Solr

Amazon CloudSearch

q=(\+rock \+roll)&fq=artist_name:kiss
q=rock roll&bq=artist_name:'kiss'

Note that when you use bq to search a particular field for a string, you must enclose the search term(s) in quotes-if you leave them out you'll get a syntax error. (When using bq to search uint fields for a value or range of value, the value or range is not enclosed in quotes.)

Document Boosting

In Solr, the behavior of your queries depends on whether you're using the standard Solr query parser mode, the DisMax query parser (Disjunction Max), or its cousin Extended DisMax (EdisMax). The DisMax parsers enable Solr to search for individual terms across multiple fields. Amazon CloudSearch does this by default. You can replicate the boosting capabilities they offer by implementing custom rank expressions in Amazon CloudSearch.

The bq parameter doesn't just enable you to search particular fields, you can use it when you need to construct complex queries with nested Boolean logic. Instead of just using the +, -, and | operators with specific terms, you can use the and, or, and not operators to combine field-specific searches.

For example, the following query finds songs that either contain the term "rock" in the title, or do not contain the term "roll" in the title:

bq=(or title:'rock' (not title:'roll'))

This same query could be specified using the - prefix operator on the term "roll":

bq=(or title:'rock' (and title:'-roll'))

If you specify both the q parameter and the bq parameter in a search request, they are ANDed together. A search request can only contain one q parameter and one bq parameter.

Range Searches

In addition to using the bq parameter to search uint fields for specific values, you can search for values that fall within a certain range. Amazon CloudSearch uses double dot notation to specify ranges. Either end of the range can be omitted to perform an open-ended search.

Solr Amazon CloudSearch
q=year:[1990 TO 2000] bq=year:1990..2000
q=year:[* TO 2000 bq=year:..2000
q=year:[1990 TO *] bq=year:1990..

Faceted Searches

In Solr, you must explicitly turn on faceting in your search queries. In Amazon CloudSearch, faceting is automatically enabled for fields that are facet-enabled in the domain configuration.

To get facet counts for a field, you include the facet parameter in the search request. This is equivalent to enabling faceting and specifying the facet.field parameter in a Solr query:

Solr

Amazon CloudSearch

q=January&facet=true&facet.field=year&facet.field=genre

q=January&facet=year&facet=genre

Both Solr and Amazon CloudSearch provide several query parameters that enable you to control how data is faceted. The following table shows how the Solr parameters map to Amazon CloudSearch parameters:

Solr Amazon CloudSearch
facet.limit=10 facet-top-n=10
f.genre.facet.limit=5 facet-genre-top-n=5
facet.sort=count facet-sort=count
facet.sort=index facet-sort=alpha
facet.query=year:2000&facet.query=year:2001&facet.query=year:[2002 TO 2004]&facet.query=year[2005 TO *] facet-year-constraints=
2000,2001,2002..2004,2005..

For more information about performing faceted searches, see Getting and Using Facet Information in the Amazon CloudSearch Developer Guide.

Display Amazon CloudSearch Results

Once you've converted your queries to the Amazon CloudSearch syntax, the final step is to update how you display the results. The format of an Amazon CloudSearch response is similar to responses returned by Solr, so this is pretty straightforward.

The main difference is in what Amazon CloudSearch returns by default:

  • Amazon CloudSearch only returns the document IDs for each search hit by default. To include additional data in the response, you must explicitly specify the fields you want to get when you submit the search request. You specify the fields by including the return-fields parameter in your request.

    http://{my-search-endpoint}/2011-02-01/search?q=January&return-fields=title
  • Amazon CloudSearch returns results in JSON unless you explicitly set the results-type parameter to XML when you submit the request. For example:

    http://{my-search-endpoint}/2011-02-01/search?q=January&return-fields=title&results-type=xml

To see how an Amazon CloudSearch response differs from a Solr response, let's compare the JSON responses for the same simple query:

Solr

Amazon CloudSearch

{
    
    
  "responseHeader":{
    "status":0,
    "QTime":1,
    "params":{
       "indent":"on", 
       "wt":"json",
       "q":"January"
     }
   },
   "response":{
     "numFound":105,
     "start":0,
     "docs":[
       {
         "id":"soaxcoy12ab0182ecf", 
         
         "title":"January Summer"
         
       }, 
       {
         "id":"soabbkz12a8c141f8c", 
         
         "title":"The Month of January"
         
       },
       ...  
       
   
    
       
       
       
       
   }
{ 
  "rank": "-text_relevance",
    
  "match-expr": "(label 'January')",


  "hits": {
    "found": 105,
    "start": 0,
    "hit":[ 
      {
        "id": "soaxcoy12ab0182ecf", 
        "data": { 
          "title": ["January Summer"] 
        }
      },  
      {  
        "id": "soabbkz12a8c141f8c",
        "data": { 
          "title": ["The Month Of January"]
        } 
      },
      ... 
    ]
  },
  "info": {
    "rid": "7f1dfca4b37f...",
    "time-ms": 3,
    "cpu-time-ms": 0
  }
}

A couple of important things to note about the structure of the Amazon CloudSearch response:

  • CloudSearch encloses the data for each document in the data property.

  • Field values are always returned as an array. Instead of having to designate in your schema that a field is a multi-value field, in Amazon CloudSearch any field can contain multiple values.

Now let's look at the results of a faceted search, such as:

http://{my-search-endpoint}/2011-02-01/search?
     bq=title%3A'January'
     &return-fields=title%2Cyear%2Cgenre&
     facet=year,genre_facet&
     facet-year-constraints=2000,2001,2002..2004,2005..

Solr

Amazon CloudSearch

{
  …
  "facet_counts": {
    "facet_fields": {
      "genre_facet": [
      
      
      
         "pop", 67,
         
         
         
         "rock", 57,
         
         
         
         "electronic", 52,
         
         ...
      ]
    },
    "facet_queries": {
    
    
    
    
    
      "year:[2005 TO *]": 13,
      
      
      
      "year:[2000 TO 2000]": 7,
      
      
      
      "year:[2002 TO 2004]": 4,
      
      
      
      "year:[2001 TO 2001]": 3,
    },
    
    
    
    
    "facet_dates": {},
    "facet_ranges": {}
    
    
    
    
  }
}
{
    …
  "facets": {
    
    "genre_facet": {
      "constraints": [
        {
          "value": "pop",
          "count": 67
        },
        {
          "value": "rock",
          "count": 57
        },
        {
          "value": "electronic",
          "count": 52
        },
        ...
      ]
    },
    
    "year": {
      "min": 1959,
      "max": 2010,
      "constraints": [
        {
          "value": "2005..",
          "count": 13
        },
        {
          "value": "2000",
          "count": 7
        },
        {
          "value": "2002..2004",
          "count": 4
        },
        {
          "value": "2001",
          "count": 3
        }
      ]
    }
  },
  
  
  "info": {
    "rid": "7f1dfca4b...",
    "time-ms": 4,
    "cpu-time-ms": 0
  }
}

As you can see, Amazon CloudSearch doesn't make a distinction between how you created the facet buckets-all of the facet information is returned in the facets object. Also, while Solr presents regular field facet counts in a single array of names and values, CloudSearch breaks individual buckets out into objects, each of which contains a value and count.

Testing the Migration

Once you've finished modifying your search application to use Amazon CloudSearch, you need to test your migration to ensure that everything works and you're getting the results that you expect to get. It's likely that you'll need to do some fine-tuning of your indexing options and rank expressions to get the best possible results.

If you already have a test suite in place to verify your search results, you can simply run that same test suite against the results from your Amazon CloudSearch domain.

If you don't already have a test suite for evaluating search results, you can run your old and new applications side-by-side and compare results. You should also consider investing the time in building out a test suite so you can easily evaluate future modifications to your application.

Summary

Migrating an existing Solr-based search solution to Amazon CloudSearch is a relatively straightforward process:

  1. Create a CloudSearch domain.
  2. Map your Solr schema to Amazon CloudSearch index fields.
  3. Implement Your boosting and sorting preferences using rank expressions.
  4. Submit your data using the Amazon CloudSearch Search Data Format (SDF).
  5. Convert your Solr queries to the Amazon CloudSearch search syntax.
  6. Update your result processing to display results from Amazon CloudSearch.

When you're done, the reward is a system that enables you to focus on your own application and user experience, rather than operational issues and systems management.

For more information about Amazon CloudSearch or any of the functions or capabilities you saw here, see the Amazon CloudSearch Overview and the Amazon CloudSearch Developer Guide.

Search...At Scale

For Paul Nelson, Chief Architect of Search Technologies Corp., the company behind the Aspire framework for importing different types of data into Amazon CloudSearch, the decision to migrate from Solr to Amazon CloudSearch is all about scale, "because your own instance requires hardware and a person to manage it. You want to push that off to Amazon. They handle all the architecture and hardware, so lifecycle costs will be much less. Amazon handles the load balancing and other architecture work. You don't have to do monitoring, swapping failed machines, and so on."

But it's more than just basic maintenance, he points out. "CloudSearch means never having to worry about redesigning your application for 10, 20, 500 million documents."

©2014, Amazon Web Services, Inc. or its affiliates. All rights reserved.