AWS Open Source Blog

Run Rally with Open Distro for Elasticsearch

It’s hard to size and scale an Elasticsearch cluster. You need to have sufficient storage for your data, but your mappings and the contents of the data are key components to your data’s size on disk. You need capacity for your queries and updates, but the amounts of CPU, JVM, disk, and network bandwidth you use are critically dependent on the queries you run and the updates you send. There’s no formula guaranteed to get it right: instead, you deploy, monitor, and adjust.

To test your deployment, you can play back your own logs and data, or you can use an automated tool. Elastic has created Rally, a performance testing framework for Elasticsearch which you can use to generate a simulated load for your Open Distro for Elasticsearch cluster. Through testing, you can ensure that your cluster is sized correctly and its performance is within your desired specifications.

Rally is full-featured, letting you run your own tests against your Elasticsearch cluster. Rally “tracks” are different types of benchmarks that you can run to exercise and measure your workload. You can add new tracks to extend Rally for custom workloads, or use one of the pre-configured tracks. In this blog post, you will run Rally on an Open Distro for Elasticsearch instance and measure its performance.

Set up

You can follow our documentation to set up Open Distro for Elasticsearch via RPM or you can follow the instructions in our prior post to set up Open Distro for Elasticsearch with Docker Desktop.

Rally is Python code. You need to install Python 3 and install pip 3 on the host where you plan to run Rally.

To install Rally, run:

$ pip3 install esrally

Rally supports custom configurations allowing you to fine tune many aspects of your test, including the location of your benchmarking directory, the tests you want to run, the level of forensic data to store, and more. You can read the Rally docs to find out how to customize your tests. For this post, start with the default configuration:

$ esrally configure

Rally will set up configuration, data, and log directories in your home directory under the .rally directory.

Run your first race

By default, Rally creates an Elasticsearch cluster to test. You already have an Open Distro for Elasticsearch cluster running. You can use the --pipeline benchmark-only command line parameter to point Rally at your existing cluster instead.

Rally’s tracks specify test configurations. There are twelve pre-configured tracks, each complete with data sets. You can see these tracks with the command esrally list tracks. For your first test, you’ll use the nyc_taxis track.

The --client-options are where you specify credentials that allow Rally to authenticate against the security plugin. I’ve used the default credentials for the admin user (which I haven’t changed, though best practice is to change them). I also specified verify-certs:false, since Rally will reject the demo certificate.

esrally --pipeline benchmark-only --track=nyc_taxis --challenge append-no-conflicts-index-only --target-host=http://localhost:9200 --client-options="use_ssl:true,basic_auth_user:'admin',basic_auth_password:'admin',verify_certs:false"

The above command assumes you are running Open Distro for Elasticsearch’s Security plugin with basic auth and SSL transport. Be sure to replace the username and password with your own.

Rally will download a portion of the New York City Taxi and Limousine Trip Record data set. It will send the data to Elasticsearch and run corresponding search workloads. At the end, Rally issues a summary of indexing times, query latencies, and cluster metrics.

Where to go from here

You can customize Rally’s tracks, or create your own track to load your data and run your queries against Open Distro for Elasticsearch. With solid testing, you can dial in your cluster’s instance types and your shard strategy for your workload. With the right setup, Open Distro for Elasticsearch will be easier to run and manage.

Have an issue or question? Want to contribute? You can get help and discuss Open Distro for Elasticsearch on our forums. You can file issues here.

Atri Sharma

Atri Sharma

Query Processing and Search Geek, hacking OSS since 2012. Worked on a variety of data engines, now hacking Elasticsearch and Lucene.

Jon Handler

Jon Handler

Jon Handler (@_searchgeek) is a Principal Solutions Architect at Amazon Web Services based in Palo Alto, CA. Jon works closely with the CloudSearch and Elasticsearch teams, providing help and guidance to a broad range of customers who have search workloads that they want to move to the AWS Cloud. Prior to joining AWS, Jon's career as a software developer included four years of coding a large-scale, eCommerce search engine. Jon holds a Bachelor of the Arts from the University of Pennsylvania, and a Master of Science and a Ph. D. in Computer Science and Artificial Intelligence from Northwestern University.