What is document search?
Document search is search that works primarily on unstructured free text (not only documents). Whether you search for a web page, find a product, or work with curated content, you use a search engine to do it. You come to a web page and type in the Search text box. Click “search” and you receive (hopefully) relevant items that meet your information goal.
Search engines grew out of database technology – they store data, and they process queries against that data. Traditional databases work primarily with structured content – data is organized into tables, and columns, with schema built in. The database’s job is to retrieve all the rows of data, based on queries that match the values in the columns. Search engines work with structured data (documents), which contain both metadata and large blocks of unstructured text (free text). Search engines use linguistic rules to break up these large text blocks into matchable terms. And, search engines come with a built-in ranking capability to order the results and bring the best to the top. Where relational and NoSQL databases retrieve all results, search engines retrieve the best results.
Applications of search engines break down into three big categories: document search, which works primarily on unstructured free text; e-commerce search, which works on a mix of structured and unstructured data; and query offloading, which works mostly on structured data.
Does document search work with metadata?
In document search, you search the main document which can be as small as a paragraph and as large as thousands of pages. Documents include a variety of other fields, including both unstructured text fields (title and summary), semi-structured fields (author), and structured fields (publication date, originating group, category)—the metadata. The search engine handles a mix of text and metadata in user queries.
What are the main challenges of document search?
The main challenges of document search fall into two areas – data preparation and ingestion, and search relevance.
In document search use cases, the body of documents (corpus) originates from user-generated or other uncurated content. This content usually contains typos or other errors, repetitions, and nonsense data. Before loading this data into a search engine, you need to curate, cleanse, and normalize the data. After the data is prepared, you need to load that data into the engine (by calling the ingestion APIs). Finally, you need a process to update the documents as they change.
The core value of document search is to retrieve documents that are relevant to the user’s query – search relevance. During retrieval, the search engine scores and sorts all matching documents via a statistical measure (BM25). BM25 uses the search term’s uniqueness crossed with their count in the matching documents. The more times the query matches more unique terms, the higher its score. You must adjust the scoring function for your particular data set; machine learning (ML) techniques help you improve your ranking. The search is only as good as the relevance of the documents it retrieves, and you want the best.
What are other search use cases?
You go to an eCommerce engine to find and buy products from a catalog of available products. These products comprise many metadata fields – size, color, brand, and so on – along with longer fields like title, product description, and reviews. The engine’s primary job is to retrieve the most relevant results, which brings revenue. Site designers employ many tools to build a good relevance function – from embedded, numerical values, to ML models based on user behavior.
To improve the end-user experience, eCommerce sites frequently add faceted search. The engine provides a bucketed count for the values in various fields (size, color, and so on) – and the UI gives the user a clickable list that they use to narrow the results.
Some types of e-commerce search depend heavily on personalization and recommendations. If a shopper searches for “dresses”, the search engine should find dresses that the customer might be interested in, even though the query itself is very open-ended. Similarity metrics like k-nearest neighbor (k-NN) help with that.
Curated data set search
Search of a curated data set like an enterprise document repository (clinical trial data, legal briefs, real estate, and so on). Search engines contain linguistic rules and other language-specific features that help them to break down large blocks of text into component terms (words from a field or large block of text) for matching. Its rich query language enables searching these large blocks of text for combinations of terms, like “long sleeveless dress”. But the engine doesn’t retrieve everything that matches: it uses relevance scoring to rank and sort documents and return only the best matches.
Search engines contain specialized data structures to facilitate high-volume, low-latency search. The most important of these structures is the inverted index, which maps individual terms to a list of documents that contain those terms. Because of these data structures, search engines outperform relational databases for query processing. The trade-off is that search engines are not relational. It’s common to see a tandem relational database and search engine. You use the relational database to serve application data, and a search engine to provide low-latency, relevant search across that data.
Who builds document search?
Building a rich, delightful search experience requires many job functions. Developers integrate a search solution, create a search interface, and understand how to structure the data to get the best search results. Product managers deliver requirements for metadata structure and search interface user experiences. Data scientists curate source data, as well as tracking and working with user behavior. Executives set business KPIs which guide the product and development teams in meeting the business goals for the engine.
What is the future of document search?
Search engines have been optimized to match terms. Searching for “8-foot sofa” should bring you results that are 8-foot sofas, and it does that by matching “8”, “foot”, and “sofa”. This is keyword search. In many cases, searchers don’t know the exact terms they are looking for and want to search by meaning. This is semantic search, and it is at the frontier of search and ML technologies. With semantic search you use queries like “comfy place to sit by the fire” to retrieve items like an 8-foot sofa.
Semantic search requires ML techniques. You must build a vector space of items and queries and then use vector similarity calculations to find items that are close in that space. With vector search, a document doesn’t need any words or synonyms in common with a query to be relevant. For example, a search on “bicycle maintenance” could match a document on “derailleur lubrication”, because the ML algorithm knows that “derailleur lubrication” often appears close to discussions of bicycle maintenance.
How can you make your search results better?
The key to effective document and e-commerce search is relevance — do the search results meet the searcher's needs? Search engines attempt to put the best results on top using a variety of techniques. This is called relevance ranking. Databases return everything that matches, and search engines are optimized for scoring relevant items.
- Your search can span multiple fields with differential weighting. For example, if you search a movie database, you may want to span fields like title, director, and actor, and give title matches more weight than actor matches.
- Consider adjusting your search results for freshness. Add a release date field to your index and an exponential decay function based on that date to your score function.
- Consider adding facets or filters of your search results to help your users drill down through specific elements. Many document search systems support faceting on metadata — typically presented as categories along the left side of the search result page.
- Consider adding synonyms. Synonyms can help your end users find the results they are looking for. In clothing, a tee is a T-shirt or teeshirt. Your end users should find the same results whether they search for “tee” or “t-shirt”. Adding synonyms can return these results.
How are customers using document search?
Document search spans many different applications.
- eCommerce sites use document search to retrieve products that their users want to buy.
- Photo sites use document search to find photos based on metadata like title and description, or even based on matching image vectors.
- Legal users use document search to find relevant case law.
- Doctors use document search to find drugs for their patient’s conditions.
- Customer Relationship Management (CRM) solutions use document search to retrieve notes, interactions, and customers to target.
When you need to find something, use a search engine!
AWS Document Search next steps
Instantly get access to the AWS free tier.
Get started building in the AWS Management Console.