AWS Open Source Blog
Introducing DenseClus, an open source clustering package for mixed-type data
Today we announce the alpha release of DenseClus, an open source package for clustering high-dimensional, mixed-type data. DenseClus uses the uniform manifold approximation and projection (UMAP) and hierarchical density based clustering (HDBSCAN) algorithms to arrive at a clustering solution for both categorical and numerical data. With DenseClus, you provide a dataframe, and it will then generate homogeneous clusters with no need for extensive preprocessing or worrying about how to treat categorical features. This capability opens a wide range of use cases, from customer segmentation in marketing to mapping cells in biomedicine.
All the software in the DenseClus project is released under the the MIT license. We invite you to check out the code for DenseClus on GitHub and join the community.
What is DenseClus?
Clustering is a hard problem because there is never truly a “right” answer when labels are unknown. To complicate matters further, there is no free lunch for clustering algorithms. Even if one algorithm might fit a certain dataset well, there are no guarantees that it will work on a different dataset in the exact same way. Likewise, clustering is noted as being “strongly dependent on contexts, aims and decisions of the researcher,” which adds fire to the argument that there is no such thing as a “universally optimal method that will just produce natural clusters” (refer to “What Are True Clusters?” by Christian Hennig).
Moreover, clustering techniques that generalize well, such as KMeans, assume that data is numerical and sphere-shaped. Having data of mixed types with high dimensionality also presents challenges for the downstream clustering task, as classical methods such as principal component analysis (PCA) for dimensionality reduction do not work when categorical values are included. This situation leads to a conundrum for the practitioner, where specific featurization schemes must be formalized — such as including only numerical values or transforming all to categorical and then using multiple correspondence analysis (MCA) instead.
DenseClus seeks to solve both the difficulty in finding a default clustering algorithm and to circumvent the difficulties represented when data is in a mixed type form. DenseClus uses a combination of UMAP and HDBSCAN to map mixed-type data into a dense, lower dimensional space. From this dense space, it then build groups hierarchically into clusters based on the density of points. This approach makes DenseClus an easy-to-use solution that can be applied on wide variety of data to find meaningful clusters.
Getting started with DenseClus
DenseClus is registered on PyPi, and the code is available on GitHub. The easiest way to install it directly is from pip for Python 3.7 or 3.8:
DenseClus requires a Panda’s dataframe as input with both numerical and categorical columns. All preprocessing and extraction are done under the hood; call the fit
function and then retrieve the clusters.
Try it out
We are excited about the alpha launch of DenseClus. You can find a more detailed walkthrough in the DenseClus Example NB.ipynb notebook in the GitHub repository. We invite you to try it out, report issues, send pull requests, and let us know what you think.