Open Source Korean Analyzer Support in Amazon Elasticsearch Service

Need for an improved Korean analyzer in Amazon Elasticsearch Service (AES)

Processing content for search in Asian languages such as Chinese, Japanese, and Korean presents unique challenges. Typical processing done by a search engine, such as tokenizing based on whitespaces and stemming to derive root forms, is not sufficient for Asian languages, as many words are compound, with different meanings depending on their compound context. AES has improved its support for Korean language processing by adding support for the widely-used open source Korean language analyzer called the Seunjeon plugin. The launch announcement of the Seunjeon plugin support was accompanied by a blog post describing steps to use the plugin with AES. As part of the Seunjeon plugin support, we made several optimizations to the plugin to reduce its memory footprint in order to support instances running with low memory. This post covers the analysis done behind the decision to support the Seunjeon plugin, as well as details of the memory optimizations done to reduce its heap utilization.

Choosing the best Korean analysis plugin for Amazon Elasticsearch Service

We evaluated several Korean analysis plugins before deciding on the Seunjeon plugin. Below we cover the main alternatives that were considered along with the pros and cons of each:

elasticsearch-analysis-openkoreantext This plugin supports all Elasticsearch versions up to 6.1.1. It uses the Twitter opensource korean-open-text module, which has a very powerful Korean analyzer and a built-in dictionary, and is licensed under the Apache License Version 2.0. The drawbacks of this plugin included a less-active community and the lack of capability to add a custom inline dictionary for tokenization. Although it provides a way to add custom dictionaries as files, they would need to be added inside the plugin folder in a pre-determined location which cannot be made dynamic, making adding a custom dictionary file difficult.
elasticsearch-twitter-korean This plugin only supports Elasticsearch up to version 2.3.
open-korean-text-elastic-search This plugin is obsolete and is no longer maintained.
Seunjeon This plugin supports all Elasticsearch versions up to 6.1.1, and leverages mecab-ko-dic internally to perform text analysis. It is licensed under Apache License Version 2.0 and includes the ability to specify a custom dictionary inline, as well as a relative file path. A minor shortcoming of this plugin is the necessity to edit the plugin descriptor file to use it with the newer versions of Elasticsearch (> 5.4).

We chose the Seunjeon plugin which has key functionalities we need such as custom dictionary support, and it is being actively developed in the open source community.

Optimizations done to the Seunjeon plugin

In this section, we cover the details of the heap analysis that was performed leading up to the heap optimizations.

Heap analysis

AES supports running Elasticsearch on smaller instance types like t2.small which has limited RAM (2 GB). In our testing of the Seunjeon plugin, we observed that it was consuming too much heap (~560 MB). This level of heap utilization can cause smaller instance types to run out of memory frequently. Hence, we decided to optimize the memory footprint of the plugin.

We used the Eclipse MAT plugin to analyze heap and identify the largest object used by the plugin using “leak suspects report” provided by the MAT tool as shown in the screenshot below:

“leak suspects report” provided by the MAT tool

From the leak suspect report, the top contributor to the heap was an array of objects belonging to class “Morpheme” used by the plugin. We went on to check which classes use the array of “Morpheme” objects, using the details in the leak suspect report which shows all outbound references to objects as shown in the screenshot below:

details in the leak suspect report which shows all outbound references to objects

Detailed report analysis showed that the top member which had an array of Morpheme class was the “leixconDict” object, which in turn led us to check what was present within the “lexiconDict” as shown below:

what was present within the “lexiconDict”

As shown above, the “termDict” member of the “lexiconDict” object contained the array of Morpheme and was the predominant contributor to heap consumption (around 477 MB). Below we show the structure of the Morpheme class using the plugin source code in Scala:

@SerialVersionUID(1000L)
case class Morpheme(var surface:String,
                    var leftId:Short,
                    var rightId:Short,
                    var cost:Int,
                    var feature:mutable.WrappedArray[String],
                    var mType:MorphemeType,
                    var poses:mutable.WrappedArray[Pos]) extends Serializable {

  //class logic goes here
}

Morpheme class has a member variable named “feature” which is an array of strings and is the largest object. For each object of Morpheme (with a size of ~664 bytes), “feature” array alone occupies 520 bytes of memory. We drilled down into the contents of the feature array as shown below:

the contents of the feature array

With this level of granular analysis, we had sufficient insight into heap consumption to decide on optimizations.

Heap optimizations performed

The feature array of Morpheme includes metadata to store different characteristics of a word, e.g. whether the word is a noun or a pronoun. In the above screenshot example, you can see two feature arrays with the same word “NNP” repeated multiple times across the arrays, and even within the same array. Different copies of the same “NNP” string point to different memory locations, which means that a new string object is created for each word in the array.

As strings are immutable, there is no need to have multiple objects with the same underlying string – the string objects can point to the same memory location. This led to our first optimization, where we used string canonicalization to identify duplicate strings, and pointed them to a single version of the string (single memory location). We built a cache for the string as shown below and converted to a UTF-8 character encoding to reduce the space utilized.

private static Map<String, byte[]> stringCache = new ConcurrentHashMap<String, byte[]>(1000);
public static byte[] compressStr(String str) {
        if (stringCache.containsKey(str)) {
            return stringCache.get(str);
        }
        final byte[] compressedStringBytes = compress(str.getBytes(UTF_8));
        if (stringCache.size() >= 10000) {
            stringCache.clear();
        }
        stringCache.putIfAbsent(str, compressedStringBytes);
        return compressedStringBytes;

    }

After applying string canonicalization, the feature array elements looked like this:

feature array elements after applying string canonicalization

In the above screenshot, you can see that the repeated entries in the string array “NNP” now point to the same memory location. The index positions 0, 5 and 6 in the array, with “NNP” as the value, now all point to same memory location 0x75ac86550. This optimization reduced the entry for “NNP” to 48 bytes, down from the 144 bytes needed to store the entire value at three different index positions.

We also optimized other member variables of the Morpheme object. An example of this is “mType” which is an ENUM of type MorphemeType declared in Scala as –

var mType:MorphemeType

We converted the ENUM to a byte as shown below, as it had only 4 underlying values, which can be represented using a byte. Before the optimization, each ENUM of type MorphemeType would occupy 32 bytes as compared to one byte after the optimization.

var _mType: Byte

Conclusion

After the optimizations described above, heap utilized by the Seunjeon plugin was reduced from 560 MB to 271.5 MB – an overall reduction of 51%. Users of the Seunjeon plugin can take advantage of the heap optimizations in the plugin included with Amazon Elasticsearch Service, as well as the standalone open source version. Amazon Elasticsearch Service offers the heap-optimized Seunjeon plugin out-of-the box. For the open source version, the pull request for our heap optimizations is at: https://bitbucket.org/eunjeon/seunjeon/pull-requests/11/optimizing-heap-utilization/diff. Currently in the open source version, the heap-optimized version is the default only for machines where heap size is <= 1GB, otherwise the heap optimizations are offered as an option. You can now ensure high fidelity parsing and matching for Korean text using the Seunjeon plugin – with an optimized memory footprint.

Pallavi Priyadarshini

Pallavi is an Engineering Manager at Amazon Web Services, leading the design and development of high-performing search technologies. Prior to AWS, she led global teams in several analytics and database products and worked closely with enterprise customers on mission-critical applications.

Vengadnathan Srinivasan

Vengadnathan Srinivasan

Vengadanathan is a Software Development Engineer at Amazon Web Services. He likes working on solving distributed systems-related problems. He is also a Java enthusiast and likes to optimize code to improve performance. In free time, he likes to listen to music and read tech blogs.

AWS Open Source Blog

Open Source Korean Analyzer Support in Amazon Elasticsearch Service

Need for an improved Korean analyzer in Amazon Elasticsearch Service (AES)

Choosing the best Korean analysis plugin for Amazon Elasticsearch Service

Optimizations done to the Seunjeon plugin

Heap analysis

Heap optimizations performed

Conclusion

Resources

Follow

Learn

Resources

Developers

Help