Guide to Configuring Index Fields for an Amazon CloudSearch Domain

Articles & Tutorials>Guide to Configuring Index Fields for an Amazon CloudSearch Domain
A search domain's index fields control how document data in Search Data Format (SDF) is handled during indexing and how you will be able to search and use the data once your index is built. This document will help you understand how the index field configuration impacts index size and the cost of your Amazon CloudSearch domain. It describes how to decide what index field types to use, how the document data is mapped to the index fields, and how you can configure fields to support faceting and return data in the search results.

Details

Submitted By: Deborah Adair
AWS Products Used: Amazon CloudSearch
Created On: August 16, 2012 5:34 PM GMT
Last Updated: August 16, 2012 5:34 PM GMT

Guide to Configuring Index Fields for an Amazon CloudSearch Domain

Topics:

A search domain's index fields control how document data in Search Data Format (SDF) is handled during indexing and how you will be able to search and use the data once your index is built. This document will help you understand how the index field configuration impacts index size and the cost of your Amazon CloudSearch domain. It describes how to decide what index field types to use, how the document data is mapped to the index fields, and how you can configure fields to support faceting and return data in the search results.

Note: For information about generating SDF, see the Guide to Formatting Your Data in SDF for Amazon CloudSearch.

Choosing the Field Type

Amazon CloudSearch supports three types of index fields:

  • text - a text field contains arbitrary alphanumeric data. The contents of a text field are processed to split the text stream into individual words (tokens) based primarily on whitespace. The tokens are then further processed to remove stopwords and apply stemming and synonym options.
  • literal - a literal field contains an identifier or other data that you want to be able to match exactly. No text processing is performed on literal fields. The contents of a literal field are stored as-is and search terms must match the field value exactly for the document to be included in the search results.
  • uint - a uint field contains a 32-bit unsigned integer value in the range 0-4294967295.

Which type you use for a particular field depends on your data and the searches you want to support:

  • Does the field contain numeric values? If so, you can configure the field as a uint field. To store floating point values, you can multiply by a constant factor to convert the values to integers. For example, to store a price that's represented in dollars and cents, multiply the price by 100 to store it as an integer that represents the value in cents - 48.75 becomes 4875. Similarly, if you have a mix of positive and negative values in a relatively narrow range, you could just add a fixed amount to all of your values so they can all be stored as unsigned integers. If you aren't going to use the field for range searches or to construct rank expressions, you can also store numeric values in a literal field.
  • Do you need to search for individual words within the field? If so, configure the field as a text field. When a text field is searched, a document will be returned as a match if it contains the search terms anywhere within the field, in any order. Text fields are used for free text searches of data such as names, descriptions, or even the entire body of a document.
  • Does the field contain string values that you want to match exactly or use as facets? If so, configure the field as a literal field. When a literal field is searched, the search terms must match the entire field value exactly for the document to be included in the search results. Literal fields are often used for fields that have a small set of possible values, as well as for more arbitrary values like email addresses or brand names where an exact match is important. Literal fields are frequently used to enable faceted searches where you want to count the number of exact matches for a particular value.

As an example, consider a website that enables users to build and listen to playlists of songs that are available for purchase. In this application, users will search for and retrieve songs that they can purchase and add to their playlists. For this song search domain, the SDF contains six document fields for each song:

{
	"type": "add",
	"id": "sogqtrz12a8c13c8b0",
	"version": 1,
	"lang": "en",
	"fields": {
		"title": "I Can't Stop Loving You",
		"description": "A country standard performed live
						by Martina McBride."
		"artist_name": "Martina McBride",
		"year": 2005,
		"price": 100,
		"genre": ["country", "pop", "ballad"]
	}
}

We want to enable users to search a range of years or prices, so we need to configure uint fields for both the year and price values. Note that the prices are multiplied by 100 to store the price in pennies in the uint field.

For the title, description, and artist name, we want to enable free text search, so we'll configure text fields for these values. This way, if a user searches by artist and enters McBride, the results will include all of the songs by Martina McBride, as well as any songs by Justin McBride, Christian McBride, and so on. If we configured the artist field as a literal, a search for McBride wouldn't match Martina McBride. The user would have to enter an artist's full name to find their songs. When we configure the title, description, and artist fields as text fields, the field values and search terms are tokenized, so it doesn't matter where in the field or in what order the specified terms appear. If the search terms appear in the field being searched, the document is considered a match.

Each song can be associated with multiple genres, but we know there's a limited set of predefined values for genre. Since we always want to match these values exactly and want to use genre as a facet, we'll use a literal field. (In most cases, fields that you use as facets will be literal, rather than text fields, because the values won't make sense if they are tokenized.)

So far, our indexing options for the song search domain look like this:

Index Field Name

Index Field Type

title

text

description

text

artist_name

text

year

uint

price

uint

genre

literal

Note that our index field names are the same as the document field names in our SDF. While we don't have to use the same names, when we do the index fields are automatically populated with the data from the SDF document field of the same name.

Configuring Field Options

An index field has three attributes that control how you can use the field:

  • Search - when a field is search enabled, you can explicitly search that field for a value using the bq search syntax. Text and uint fields are always search enabled - you cannot disable search for a text or uint field. To search literal fields, you must explicitly enable the search option.
  • Facet - when a field is facet enabled, you can retrieve facet counts for the field when you search by specifying the facet parameter in the search request. Facet counts are the number of documents that contain a particular value in the field. Uint fields are always facet enabled. To use text and literal fields as facets, you must explicitly enable the facet option.
  • Result - when a field is result enabled, the original field values are stored in the index and you can retrieve that data when you search by specifying the return-fields parameter in the search request. Uint fields are always result enabled. To store and retrieve additional data from your index, you can enable the result option for selected text or literal fields.

    Note: Keep in mind that storing the original values in the index increases the size of the index and can increase the cost of running your domain. For example, if a field contains 1 KB of data and you make it result enabled, each document in your domain is going to add about 1 KB to the size of your index.

In our song search domain scenario, the title, description, and artist_name fields are defined as text fields, so they are automatically search enabled. To make it easy to list the search results by song title, we can make the title field result enabled. If the user wants to view additional information about the song, they can click on the title and we can use the song's document ID to look up the rest of the song information in our database.

Because the year and price fields are defined as uint fields, we can automatically use those fields as facets. This enables us to let the user view the matching songs by price range or the year that the song was released. However, we'd also like to let the user view the matching songs by genre. To do that, we need to enable faceting for the genre field. To enable users to search by genre, we also need enable searching for the genre field.

Now the indexing options for our song search domain look like this:

Index Field Name

Index Field Type

Search

Facet

Result

title

text

Yes

No

Yes

description

text

Yes

No

No

artist_name

text

Yes

No

No

year

uint

Yes

Yes

Yes

price

uint

Yes

Yes

Yes

genre

literal

Yes

Yes

No

Mapping Document Data to Index Fields

If you don't explicitly specify a source for an index field, the document field with the same name is used to populate the index field. For example, the title field in our songs SDF is used to populate the title field configured in our indexing options.

You can override the default behavior by explicitly defining one or more sources for an index field.

Creating Duplicate Fields

One common reason to explicitly define the source for a field is when you want to store the value of a document field in the index and also use it for faceting. Since a text or literal field can't be both facet and result enabled, you can create a duplicate field that uses the same source.

For example, if we want to store the genre information in our song search index so we can get it back with the search results, we could make our existing genre field search and facet enabled, and create an additional genre_result field that is result enabled. We can configure the source for the new field through the console:

  1. Go to the Amazon CloudSearch console at https://console.aws.amazon.com/cloudsearch/home.
  2. In the Navigation panel, click the domain name and then click the domain's Indexing Options link.
  3. To create the duplicate field, click Add Index Field to add a field specification to the list.
  4. Specify the name for the duplicate field, genre_result.
  5. In the Source column, click the add link.
  6. In the Add Source dialog box, enter the name of the document field you want to use as a source for the duplicate index field, genre.
  7. Make sure that Copy is selected and click the Add button.
  8. Click Submit to save your changes.

Now our song search indexing options look like this:

Index Field Name

Index Field Type

Search

Facet

Result

title

text

Yes

No

Yes

description

text

Yes

No

No

artist_name

text

Yes

No

No

year

uint

Yes

Yes

Yes

price

uint

Yes

Yes

Yes

genre

literal

Yes

Yes

No

genre_result

literal

No

No

Yes

Cost Considerations

The indexing options you configure can have a direct impact on the cost of running your search domain.

The bigger your index is, the more it will cost to run your search domain. The size of your index depends on the number of documents uploaded to the domain, how many index fields are configured, the number of fields configured as facets, and how much additional data you store in the index by making index fields result enabled.

Here are some guidelines to consider when configuring indexing options:

  • Only make fields searchable if you're actually going to search them. If you're not going to use a field from your source data, indexing it will unnecessarily increase the size of your index.
  • If you can easily retrieve document data from another source, return a lookup key instead of the document data. For example, you might index all of the data describing a product and the product ID, and use the product ID to retrieve the complete product data from your document store. Reducing the amount of data you store in your index will help minimize the cost of running your search domain.
  • If you find yourself storing a large amount of data in your index so it can be returned in results, carefully consider whether it's worth the convenience. Keep in mind that only the first 2 KB of data can be returned from a text field, and the maximum document size is 1 MB.
  • Use facets judiciously. To support faceting, additional information has to be stored in the index for each facet enabled field.
  • Only create duplicate fields when you really need to. The more index fields you configure, the bigger your index will be. If you want to store the value of a document field in the index and also use it for faceting, you'll need to use a duplicate field.
  • If you're not using the q parameter in searches, set the default search field to an index field that contains no data. Otherwise, the default search field will be populated with all of the configured text fields, unnecessarily increasing the size of the search index.
  • If you want to use the q parameter, but don't need to search all of the text fields by default, set the default search field to specify the subset of fields that you want to search. That way, you are only populating the default search field with the data you need.
  • Summary

    The document data in your SDF batches is mapped onto index fields that control how the data is indexed and whether you can search the data, use it for faceting, or return it in the search results. Amazon CloudSearch supports three types of index fields: text, literal, and uint.

    By default, an index field is populated using the SDF document field with the same name. You can also explicitly define one or more sources for an index field.

    The more index fields and facets you configure, and the more data you store in the index by making text and literal fields returnable, the larger your index will be. The larger your index, the more it will cost to run your search domain.

    For more information about configuring indexing options for your search domain, see Mapping Document Data to Index Fields and Configuring Index Fields in the Amazon CloudSearch Developer Guide.

©2014, Amazon Web Services, Inc. or its affiliates. All rights reserved.