Guide to Formatting Your Data in SDF for Amazon CloudSearch

Articles & Tutorials>Guide to Formatting Your Data in SDF for Amazon CloudSearch
Search Data Format (SDF) is the structured data format that you use to represent the data that you want to index and search with Amazon CloudSearch. This guide describes how to structure your data to support searching, describe it in SDF, and validate the SDF before uploading it to your search domain.

Details

Created On: June 27, 2012 8:23 PM GMT
Last Updated: July 15, 2013 10:16 PM GMT

Guide to Formatting Your Data in SDF for Amazon CloudSearch

Topics:

Search Data Format (SDF) is the structured data format that you use to represent the data that you want to index and search with Amazon CloudSearch. This guide describes how to structure your data to support searching, describe it in SDF, and validate the SDF before uploading it to your search domain.


What Do You Want to Search?

In SDF, each item that you can search for, such as a product, is represented as a search document. The item data that you want to be able to search, such as the product name and description, is represented as a collection of named field-value pairs. When you send a search request, Amazon CloudSearch returns a set of search documents that match the search criteria.

The first step in creating your search application is determining what data you want to search and how to structure it to support searching:

  • What will your users retrieve when they run a search? For example, when you search for something on Amazon.com, you get a set of products. Your search documents will be the items that you want to show in the search results. Some applications will return more than one kind of document as a search result. For example, an online forums application might have both threads and posts and allow users to retrieve either.
  • What information about each item needs to be searchable? This is the data you want to index. For example, each product might have a name and description. These attributes are represented as field-value pairs within each search document. Applications typically provide a text entry form that enables users to specify the words they want to search for--those search words are compared against the contents of the search document fields to generate a list of matching documents. To optimize the cost and performance of your search domain, you should only index information that is relevant to user searches--if it's not something the user cares about, don't add it as a document field.
  • Are there additional attributes that you want to use to filter the results? If there are categories or values that you don't need to search, but want to use to filter the results, they also need to be represented as field-value pairs within each search document. For example, each product might have a size and color, or you might want to be able to filter on a numeric value such as date or price.

As an example, consider a website that enables users to build and listen to playlists of songs that are available for purchase. The songs are stored in a database. In this application, users will search for and retrieve songs that they can purchase and add to their playlists. For this song search domain, each search document represents a single song.

To make it easy for users to find songs that they want to buy, we want to let them perform text searches on the song titles, song descriptions, and artist names. To enable this, we'll specify three fields for each search document: title, description, and artist_name.

Since these searches might generate a lot of matches, we want to let users narrow the search results by specifying the genre of the song, the year or range of years that the song was released, or a price range. To enable this, we'll specify three additional fields for each search document: year, genre, and price.

These six fields in the song documents will be used as the sources for the index fields we configure for the song domain. Amazon CloudSearch index fields can be configured as text fields to enable free text search, literal fields to enable exact matching and faceting using text strings, or integer fields that can be used to search and filter on exact values and ranges of values. For more information about configuring options for index fields, see the "Guide to Configuring Index Fields for an Amazon CloudSearch Domain".

Creating SDF Batches

Once you determine what your search documents are going to be and what fields they will contain, you need to describe them according to the SDF schema. You can create SDF using either JSON or XML.

When creating SDF, you group your search documents in batches of 5 MB or less. Individual documents within a batch cannot exceed 1 MB. A batch can specify documents that you want to add or update, as well as documents that you want to remove. You upload each batch to your search domain separately.

Should I Use JSON or XML?

When choosing between JSON and XML, a number of different factors come into play, including:

  • Your existing systems--if your application already uses JSON to represent all of the data you're moving around, it typically makes sense to stay with that.
  • Available export options--some systems make it easier to export data in one format over another. For example, the existing database in which the data resides might provide an easy way to export XML data.
  • The nature of the data--some data is easier to represent in one format than another. For example, if the target data consists of a string of XML, it is often easier to create the SDF as JSON to avoid encoding issues.
  • Storage and bandwidth issues--for large amounts of data, it might be more efficient to create, store, and upload SDF in JSON's more compact syntax.

Specifying Search Documents in SDF

Whether you are using JSON or XML, you must specify the following properties for each document that you want to add, update, or delete:

  • type--the action you want the system to take: add or delete. (To update an existing document, you use add and specify a version number that is greater than that of the document you want to replace.)
  • id--the unique document ID of the search document you are adding, updating, or deleting. The document ID identifies the search document within your search domain. The ID must be unique across all of the documents you upload to the domain and can contain the following characters: a-z (lowercase letters), 0-9, and the underscore character (_). Document IDs must start with a letter or number and can be up to 64 characters long.
  • version--an incremented version number. The version number is used to serialize document updates. Because a version number is specified with every add or delete request, you can publish document updates in parallel without partitioning by document ID or worrying about them being received out of sequence. Amazon CloudSearch will only apply an update if it has a higher version number. If the version number is lower than in a previously received update, the change is ignored. If the version number is the same as a previously received update, the results are undefined--there's no way to predict which update will take precedence.
  • lang--the language of the document. The current version of CloudSearch supports English only. Set the language to en for English. (The lang property is omitted when deleting a search document.)
  • fields--a collection of name-value pairs that contain the search document's data. Every search document must contain at least one field. The search document fields contain the data you want to be able to search, use as a filter, or return in the search results. Field names can contain the following characters: a-z (lowercase letters), 0-9, and the underscore character (_). Field names can be up to 64 characters long. The names "body", "docid", and "text_relevance" are reserved names and cannot be specified as field names. (The collection of fields is omitted when deleting a search document.)

Version Numbers

To simplify generating and tracking version numbers, we recommend using a timestamp as the version number, such as Unix time (the number of seconds since January 1, 1970). This technique can be particularly useful if you are frequently updating documents. Keep in mind that whatever you use as a timestamp, it must fit within a 32-bit unsigned integer. For example, if you're using Unix time, you need to strip off the milliseconds and only use second resolution or the value will overflow the uint.

Valid Values

SDF can contain only UTF-8 characters that are valid in XML, even if it is specified using JSON. Valid characters are the control characters tab (0009), carriage return (000D), and line feed (000A), and the legal characters of Unicode and ISO/IEC 10646. FFFE, FFFF, and the surrogate blocks D800DBFF and DC00DFFF are invalid and will cause errors. For more information, see Extensible Markup Language (XML) 1.0 (Fifth Edition).

If you are creating your SDF documents using XML, and your data also contains XML, you must either encode your XML (replace < with &lt; and so on) or enclose your data in a CDATA section. We recommend using CDATA as the most straightforward solution.

Formatting SDF in JSON

In JSON, an SDF batch is an array of objects where each object represents an add or delete request for a search document. The array is enclosed in square brackets, [], and each document object is enclosed in curly braces, {}.

For example, we could use the following JSON to add two songs to our song domain:

[
 {
    "type": "add",
    "id": "sogqtrz12a8c13c8b0",
    "version": 1,
    "lang": "en",
    "fields": {
      "title": "I Can't Stop Loving You",
      "description": "A country standard performed live by Martina McBride.",
      "artist_name": "Martina McBride",
      "year": 2005,	
      "price": 100,	
      "genre": ["country",  "pop",  "ballad"]
    }
  },
  {
    "type": "add",
    "id": "sobhvzq12ac9618285",
    "version": 1,
    "lang": "en",
    "fields": {
      "title": "I'm Gonna Love You Through It",
      "description": "An emotional track written by Ben Hayslip, Jimmy Yeary, and Sonya Isaacs.",
      "artist_name": "Martina McBride",
      "year": 2011,
      "price": 100,
      "genre": [ "country",  "pop",  "ballad"]
    }
  }
]

You must specify the type, id, version, lang, and fields properties for each search document in the batch. Each property is specified as a string: value pair, for example "type": "add".

  • The property name is always specified as a string. All string values must be enclosed in quotes.
  • The value of the type property is a string and must be either add or delete.
  • The value of the id property is a string that can be up to 64 characters and can contain a-z (lowercase letters), 0-9, and the underscore character (_). Document IDs cannot start with an underscore.
  • The value of the version property is a 32-bit unsigned integer. (Note that integer values are not enclosed in quotes--enclosing the version number in quotes will cause it to be treated as a string and will generate errors.)
  • The value of the lang property is a two-letter string that represents the language. Currently, en is the only supported value. (The lang property only needs to be specified when adding documents.)
  • The value of the fields property is an object that contains a collection of field: value pairs. Each document must contain at least one field. (The fields property only needs to be specified when adding documents.)
  • Field names can be up to 64 characters and can contain a-z (lowercase letters), 0-9, and the underscore character (_). Field names cannot start with an underscore.
  • Each field value can be a string, 32-bit unsigned integer, or an array of string or integer values. For example: "genre": ["country", "pop", "ballad"]. To store floating point values, you can multiply by a constant factor to convert the values to integers. For example, to store a price such as $3.99, you would store the value in pennies, rather than dollars: 399.

For more information about JSON syntax, see http://www.json.org/. For more information about the SDF JSON schema, see the Document Service API Reference in the Amazon CloudSearch Developer Guide.

Formatting SDF in XML

In XML, an SDF batch is contained in a <batch> element. The <batch> element contains a collection of <add> and <delete> elements where each element represents an add or delete request for a search document.

For example, we could use the following XML to add two songs to our song domain:

<batch>
  <add id="sogqtrz12a8c13c8b0" version="1" lang="en">
    <field name="title">I Can't Stop Loving You</field>
    <field name="description">
    A country standard performed live by Martina McBride.
  </field>
    <field name="artist_name">Martina McBride</field>
    <field name="year">2005</field>
    <field name="genre">country</field>
    <field name="genre">pop</field>
    <field name="genre">ballad</field>
  </add>
  <add id="sobhvzq12ac9618285" version="1" lang="en">
    <field name="title">I'm Gonna Love You Through It</field>
    <field name="description">
      An emotional track written by Ben Hayslip, Jimmy Yeary 
      and his wife, Sonya Isaacs.
    </field>
    <field name="artist_name">Martina McBride</field>
    <field name="year">2011</field>
    <field name="genre">country</field>
    <field name="genre">pop</field>
    <field name="genre">ballad</field>
  </add>
</batch>

A <batch> must contain at least one <add> or <delete> element. For an <add> element, you must specify the id, version, and lang attributes and at least one field attribute. For a <delete> element, you only need to specify the id and version attributes.

  • The value of the id attribute can be up to 64 characters and can contain a-z (lowercase letters), 0-9, and the underscore character (_). Document IDs cannot start with an underscore.
  • The value of the version attribute is a 32-bit unsigned integer.
  • The value of the lang attribute is a two-letter language code. Currently, en is the only supported value.

An <add> element must contain at least one <field> element. For each <field> element, you must specify the name attribute and a value. Field names can be up to 64 characters and can contain a-z (lowercase letters), 0-9, and the underscore character (_). Field names cannot start with an underscore. The value can be a string or 32-bit unsigned integer. To specify multiple values for a field, you specify multiple <field> elements with the same name attribute. For example:

  <field name="genre">country</field>  
  <field name="genre">pop</field>  
  <field name="genre">ballad</field> 

For more information about the SDF XML schema, see the Document Service API Reference in the Amazon CloudSearch Developer Guide.

Validating SDF Batches

Once you've generated your first SDF batches, you can make sure that they are well formed JSON or XML before attempting to upload them to your domain. To do that, you use a validation tool such as xmllint, the JSON Validator, or the W3C Markup Validation Service. This will tell you immediately if there is a syntax problem or invalid characters in your SDF. Resolving these issues before you upload your SDF is faster than relying on the document service to report SDF errors and having to iterate on the upload process.

Sending Documents to Amazon CloudSearch

You can submit SDF batches to a search domain using the Amazon CloudSearch console, command-line tools, or the documents/batch API.

For example, you could use cURL to upload an SDF file called data.json through your domain's documents/batch endpoint:

curl -X POST --upload-file data.json --header "Content-Type: application/json" 
doc-your-search-domain-name-yourrandomsearchdomainid.us-east-1.cloudsearch.amazonaws.com/2011-02-01/documents/batch

In addition to uploading SDF from a local file or an object in S3, you can programmatically generate SDF data and send it to your document endpoint without creating a physical file.

For more information about sending data to a search domain, see Uploading Data in the Amazon CloudSearch Developer Guide.


Letting Amazon CloudSearch Do the Heavy Lifting: Generate SDF Automatically

Amazon CloudSearch provides an experimental tool called cs-generate-sdf that can convert many common types of files into SDF. One of the easiest ways to generate SDF is to pass comma-delimited data (CSV files) to the cs-generate-sdf command. This command creates one search document per row. The cs-generate-sdf command can also create search documents from the following types of files:

  • Adobe Portable Document Format (.pdf)
  • HTML (.htm, .html)
  • Microsoft Excel (.xls, .xlsx)
  • Microsoft PowerPoint (.ppt, .pptx)
  • Microsoft Word (.doc, .docx)
  • Text Documents (.txt)
  • JSON Documents (.json)
  • XML Documents (.xml)

For these file types, the document data is extracted as a single text field. Unlike CSV files, the internal structure of the document is ignored and the contents are not parsed into separate fields. For example, the cs-generate-sdf command does not parse XML tags into separate fields in the search document. (However, the document metadata is parsed as separate search document fields where available.)

The cs-generate-sdf command is part of the Amazon CloudSearch command line tools. For information about downloading and installing the tools, see the Command Line Tool Reference in the Amazon CloudSearch Developer Guide.

Troubleshooting

If your SDF is not formatted correctly or contains invalid values, you will get errors when you attempt to upload it or use it to configure fields for you domain. Here are some common problems and their solutions:

Invalid JSON--if you are using JSON, the first thing to do is make sure there are no JSON syntax errors in your SDF batch. To do that, run it through a validation tool such as the JSON Validator. This will identify any fundamental issues with the data.

Invalid XML--SDF batches must be well-formed XML. You are especially likely to encounter issues if your fields contain XML data--the data must be XML-encoded or enclosed in CDATA sections. To identify any problems, run your SDF batch through a validation tool such as the W3C Markup Validation Service.

Not Recognized as SDF--if you are configuring your domain from SDF and Amazon CloudSearch doesn't recognize your data as valid SDF, it responds with a list of generic metadata fields:

content
content_encoding
content_language
content_type
language
resourcename

For example, this can happen if there are invalid document IDs or version numbers. Make sure that your SDF data contains all of the required properties for each document.

Document IDs with bad values--capital letters, hyphens, and other special characters are not allowed in document IDs. Document IDs can only contain the characters a-z (lowercase letters), 0-9, and underscore (_). Document IDs must start with a letter or number; they cannot start with an underscore.

Bad version numbers--version numbers must fit within a 32-bit unsigned integer. When specifying your SDF in JSON, make sure that the version number is not enclosed in quotes. If it is, the version is treated as a string and Amazon CloudSearch will reject the SDF as invalid.

Multi-valued fields without a value--when specifying SDF in JSON, you cannot specify an empty array as the value of a field. Multi-valued fields must contain at least one value.

Bad characters--one problem that can be difficult to detect if you do not filter your data while generating your SDF batch is that can contain characters that are invalid in XML. Both JSON and XML batches can contain only UTF-8 characters that are valid in XML. You can use a validation tool such as the JSON Validator or W3C Markup Validation Service to identify invalid characters.

Summary

Search Data Format (SDF) is the structured data format used to submit data to Amazon CloudSearch for indexing. SDF batches can be submitted as well-formed JSON or XML.

To structure your data for Amazon CloudSearch, start by thinking about the ways in which you expect users to search your data. Define search documents and fields based on the searches that you want to support, then use the source data from your primary store to populate the search document fields.

Before uploading the resulting SDF, use a validation tool such as the JSON Validator or the W3C Markup Validation Service to ensure that it's well-formed JSON or XML and doesn't contain characters that are invalid in XML. If you run into problems, make sure that:

  • Document IDs and field names conform to the naming restrictions.
  • Version numbers are specified as 32-bit unsigned integers
  • At least one value is specified for each multi-valued field.

Amazon CloudSearch provides a suite of tools to help you generate, upload, and manage your data. For more information, see the Amazon CloudSearch Overview page and the Amazon CloudSearch Developer Guide.

©2014, Amazon Web Services, Inc. or its affiliates. All rights reserved.