AWS Database Blog

Creating a Simple Autocompletion Service with Redis: Part One of Two

Darin Briskman (@briskmad) is a developer evangelist for Amazon Web Services.

So you’ve figured out how to use a cool framework to make a great-looking web form. You’ve made it super-easy for people to understand each field and how to fill it out. But there is still a lot of stuff to type. How can we make it even easier, especially for mobile devices where a lot of data entry is a challenge?

One easy thing to do is add an autocomplete function, which detects when characters are entered into a field and suggests which words or phrases are desired. All you need is a list of phrases, Amazon ElastiCache for Redis, and a little bit o’ code.

If you’re not a Redis expert, don’t worry: For the purposes of autocompletion, everything you’ll need to know about Redis is in this blog post. Because you’re using Amazon ElastiCache, all the rest of the work (installation, updates, data management, maintenance, and so forth) is automated and done for you—all you need to do is provide the data and fetch the data.

Using Redis Sorted Sets
What makes this fast and easy is the Redis Sorted Sets data type. This type is simply a nonrepeating collection of strings, with a score associated with each one. The strings are always stored in descending order of scores. When more than one string has the same score, the set is ordered lexicographically. Because the set is always kept in order, retrieving data from the beginning or the end or the middle is easy, using the Redis ZRANGE command.

If we put the names of a few countries into a Redis sorted set, we’ll get the following.
redis> zadd mylist 0 "United States"
(integer) 1
redis> zadd mylist 0 "Sweden"
(integer) 1
redis> zadd mylist 0 "Union of South Africa"
(integer) 1
redis> zadd mylist 0 "United Kingdom"
(integer) 1
redis> zrange mylist 0 -1
1. "Sweden"
2. "Union of South Africa"
3. "United Kingdom"
4. "United States"

The other Redis command we need is ZRANK, which can tell us where a string lies inside an ordered set.
redis> zrank mylist "United Kingdom"
(integer) 2

(Indexes in Redis always start at zero, so 2 is the third item on the list.)

And now that we know that United Kingdom is item 2 in the sorted set, we can easily find what comes before and after it.
redis> zrange mylist 1 2
1) "Union of South Africa"
2) "United Kingdom"
redis> zrange mylist 2 3
1) "United Kingdom"
2) "United States"

For autocompletion, we need a sorted set that includes the letters that someone would type. We also need a character to show that we’ve reached the end of the phrase. I’m going to use %.
redis> zadd mylist2 0 "United States%"
(integer) 1
redis> zadd mylist2 0 "United State"
(integer) 1
redis> zadd mylist2 0 "United Stat"
(integer) 1
redis> zadd mylist2 0 "United Sta"
(integer) 1
redis> zadd mylist2 0 "United St"
(integer) 1
redis> zadd mylist2 0 "United S"
(integer) 1
redis> zadd mylist2 0 "United "
(integer) 1
redis> zadd mylist2 0 "United"
(integer) 1
redis> zadd mylist2 0 "Unite"
(integer) 1
redis> zadd mylist2 0 "Unit"
(integer) 1
redis> zadd mylist2 0 "Uni"
(integer) 1
redis> zadd mylist2 0 "Un"
(integer) 1
redis> zadd mylist2 0 "U"
(integer) 1
redis> zrange mylist2 0 -1
1. "U"
2. "Un"
3. "Uni"
4. "Unit"
5. "Unite"
6. "United
7. "United "
8. "United S"
9. "United St"
10. "United Sta"
11. "United Stat"
12. "United State"
13. "United States%"

Now we can use code to loop through the entries to find the complete entry (ending with %).

Putting the pieces together
Let’s put it all together. First, we’ll set up an ElastiCache Redis node. If there are no repeating prefixes, this algorithm will create (Number of Items) x (Average Length of Each Item + 1) x (Average Length of each item) ÷ 2 entries in our sorted set. Unless you have a very weird set of items, there will be repetition among the prefixes, so the actual amount of memory needed will be smaller. For this example, countries.txt has 195 items that are an average of 10.4 characters each, including spaces (and the file’s encoded in UTF-8, so it’s one byte per character). So we’ll need no more than 11,560 bytes to store all the possible prefixes for this list. For more information on how to size your Redis node, see Selecting Your Redis Node Size in the ElastiCache User Guide.

This isn’t a lot of data, so we can use the very smallest cache.t2.micro node, which can run using the AWS Free Tier.

Start by going to the Amazon ElastiCache for Redis page and choosing Get Started or by signing on to your AWS Management Console and choosing ElastiCache.

Console

Then create a Redis cluster. For simplicity, we’ll avoid replication (though for a production instance, replication is a good idea!). Uncheck Enable Replication, name the cluster redis-autocomplete, and choose cache.t2.micro for node type. You can leave all the other settings at their defaults, then launch the cluster.

RedisParam
It can be handy to have redis-cli (the Redis command line interface) where you can test and debug your redis commands. If you’re anything like me, writing code is easy… debugging it to make it actually work, not so easy (sigh).

For the CLI, create a new Amazon EC2 instance (t2.micro will be enough) and log in to it. If you haven’t done this before, see Getting Started with Amazon EC2 Linux Instances for a guide on creating your EC2 instance and the needed security groups.

In theory, all you need to install is redis-cli, but there are a lot of dependencies. Redis is pretty small, so it’s easiest to just install the whole package.
sudo yum --enablerepo=epel install redis
To actually use the CLI, you’ll need to know the name of your ElastiCache instance. On your AWS Management Console, choose ElastiCache from the Services menu, then choose Cache Clusters.

CacheCluster

Then choose your node.

node

Look for the DNS name of your cache instance under Endpoint.

endpoint

Now from your EC2 instance, you can use redis-cli.
$ redis-cli -h <your ElastiCache endpoint DNS>
<Your ElastiCache endpoint DNS>.cache.amazonaws.com:6379>

Try using the INFO command at the redis> prompt to validate that all is working as it should.
When you’re done, use CTRL-D to end the interactive redis> prompt and return to the system shell.
In a future post, we’ll use AWS Lambda to make this completely serverless. For now, we’ll keep it simple, using Python 2.7 (the version included with most Linux distributions) on our EC2 node.

Testing our simple autocompletion service
Let’s do a test run. First, we need some code that reads our text file and parses it into prefixes. We will need to grab a file with a list of items for autocompletion—you can use your own or you can use the list of UN member countries I’ve posted online.
$ wget http://autocomplete.s3.amazonaws.com/Countries.txt
$ cp Countries.txt autocomplete.txt

We also need to install the Redis SDK for Python.
$ sudo pip install redis

Now that we have all of the prerequisites, we can use some code to load and parse the file. Use your favorite text editor (I’m old school, so I use vi) to create a file called ac_setup.py.

import redis
r = redis.StrictRedis(host='<your ElastiCache endpoint DNS>', port=6379, db=0)
f = open(‘autocomplete.txt',"r")
  for line in f:
    n = line.strip()
    for l in range(1,len(n)):
      prefix = n[0:l]
      r.zadd('autocomplete',0,prefix)
    r.zadd('autocomplete',0,n+"%")
else:
  exit

This code should pull in countries.txt and parse it into an ordered set of every unique prefix, with the full names identified by ending with %. Let’s execute the code.
$ python ac_setup.py

To check that this code worked, we can go back into the redis> interface and see what we have.
$ redis-cli -h <your ElastiCache endpoint DNS>
<Your ElastiCache endpoint DNS>.cache.amazonaws.com:6379> zrange autocomplete 0 -1
1) "A"
2) "Af"
3) "Afg"
4) "Afgh"
5) "Afgha"
6) "Afghan"
7) "Afghani"
8) "Afghanis"
9) "Afghanist"
10) "Afghanista"
11) "Afghanistan%"
12) "Al"
13) "Alb"
14) "Alba"
15) "Alban"
...

Now, we need the code to pull the potential autocompletion list from Redis. Again, use your favorite text editor (I’m still using vi!) to create a file named ac.py.

import sys
import redis
r = redis.StrictRedis(host=<your ElastiCache endpoint DNS>..cache.amazonaws.com', port=6379, db=0)
def complete(r,prefix,count):
  results = []
  grab = 42
  start = r.zrank('autocomplete',prefix)
  if not start:
    return []
  while (len(results) != count):
    range = r.zrange('autocomplete',start,start+grab-1)
    start += grab
    if not range or len(range) == 0:
      break
    for entry in range:
      minlen = min(len(entry),len(prefix))
      if entry[0:minlen] != prefix[0:minlen]:
        count = len(results)
        break
      if entry[-1] == "%" and len(results) != count:
        results.append(entry[0:-1])
  return results

def autoComplete():
  print complete(r,sys.argv[1],50)

if __name__ == "__main__":
  autoComplete()

This program needs one argument, which is the prefix we want to search. Try it out!
$ python ac.py U
['Uganda', 'Ukraine', 'United Arab Emirates', 'United Kingdom', 'United States', 'Uruguay', 'Uzbekistan']

$ python ac.py Uni
['United Arab Emirates', 'United Kingdom', 'United States']

Congratulations! We’ve now built a working recommendation engine. If you want to try this with a larger dataset, let’s use the list of all of the cities and settlements in the US (provided by the US Census Bureau).

Flush the data out of Redis.
$ redis-cli -h <your ElastiCache endpoint DNS>
<Your ElastiCache endpoint DNS>.cache.amazonaws.com:6379>FLUSHDB
OK

Then double-check to make sure it’s gone.
<Your ElastiCache endpoint DNS>.cache.amazonaws.com:6379> zcard autocomplete
(integer) 0

Use CTRL-D to exit the Redis shell.
re<your ElastiCache Endpoint DNS>.cache.amazonaws.com:6379>^D
$

Then load the cities data.
$ wget http://autocomplete.s3.amazonaws.com/Cities.txt
$ cp Cities.txt autocomplete.txt
$ python ac_setup.py

This test will take a bit longer to run, because there are 15,637 items on the list to be parsed, which will take a few minutes on our micro family instance. Once done, we can now search for cities.
$ python ac.py Chi
['Chicago Heights, Illinois', 'Chicago Park, California', 'Chicago Ridge, Illinois', 'Chicago, Illinois', 'Chichester, New Hampshire', 'Chickamauga, Georgia', 'Chickasha, Oklahoma', 'Chico, California', 'Chicopee, Massachusetts', 'Chiefland, Florida', 'Childersburg, Alabama', 'Childress, Texas', 'Chilhowee, Missouri', 'Chillicothe, Illinois', 'Chillicothe, Missouri', 'Chillicothe, Ohio', 'Chilmark, Massachusetts', 'Chiloquin, Oregon', 'Chilton, Texas', 'Chilton, Wisconsin', 'Chimacum, Washington', 'China Spring, Texas', 'Chincoteague Island, Virginia', 'Chinese Camp, California', 'Chinle, Arizona', 'Chino Hills, California', 'Chino Valley, Arizona', 'Chino, California', 'Chinook, Montana', 'Chipley, Florida', 'Chippewa Falls, Wisconsin', 'Chisago City, Minnesota', 'Chisholm, Minnesota', 'Chittenango, New York', 'Chittenden, Vermont']

$ python ac.py Chic
['Chicago Heights, Illinois', 'Chicago Park, California', 'Chicago Ridge, Illinois', 'Chicago, Illinois', 'Chichester, New Hampshire', 'Chickamauga, Georgia', 'Chickasha, Oklahoma', 'Chico, California', 'Chicopee, Massachusetts']

In part two of this post, we’ll explore using AWS Lambda and Amazon S3 to automate loading the data and running the code, plus look at ways to integrate this autocompletion service into a webpage.

 

I hope you found this fun and useful! If you’d like more on this Redis pattern, this blog post was inspired by Salvatore “antirez” Sanfilippo, Redis’ creator, who wrote about Redis autocompletion on his blog. Amazon ElastiCache for Redis is a quickly evolving service, and you can always get the latest updates at the Amazon ElastiCache for Redis marketing website.