AWS Open Source Blog

Introducing DenseClus, an open source clustering package for mixed-type data

Today we announce the alpha release of DenseClus, an open source package for clustering high-dimensional, mixed-type data. DenseClus uses the uniform manifold approximation and projection (UMAP) and hierarchical density based clustering (HDBSCAN) algorithms to arrive at a clustering solution for both categorical and numerical data. With DenseClus, you provide a dataframe, and it will then generate homogeneous clusters with no need for extensive preprocessing or worrying about how to treat categorical features. This capability opens a wide range of use cases, from customer segmentation in marketing to mapping cells in biomedicine.

All the software in the DenseClus project is released under the the MIT license. We invite you to check out the code for DenseClus on GitHub and join the community.

What is DenseClus?

Clustering is a hard problem because there is never truly a “right” answer when labels are unknown. To complicate matters further, there is no free lunch for clustering algorithms. Even if one algorithm might fit a certain dataset well, there are no guarantees that it will work on a different dataset in the exact same way. Likewise, clustering is noted as being “strongly dependent on contexts, aims and decisions of the researcher,” which adds fire to the argument that there is no such thing as a “universally optimal method that will just produce natural clusters” (refer to “What Are True Clusters?” by Christian Hennig).

Moreover, clustering techniques that generalize well, such as KMeans, assume that data is numerical and sphere-shaped. Having data of mixed types with high dimensionality also presents challenges for the downstream clustering task, as classical methods such as principal component analysis (PCA) for dimensionality reduction do not work when categorical values are included. This situation leads to a conundrum for the practitioner, where specific featurization schemes must be formalized — such as including only numerical values or transforming all to categorical and then using multiple correspondence analysis (MCA) instead.

Chart with two boxes: Numerical Data and Categorical Data. Arrows below the boxes point down to Numerical Data UMAP Embedding and Categorical Data UMAP Embedding, respectively. One arrow points down to Union Two embeddings. An arrow points to the right to a box that says HDBSCAN Clustering, and and arrow points up from that box to a box that shows Clustering Results.

DenseClus seeks to solve both the difficulty in finding a default clustering algorithm and to circumvent the difficulties represented when data is in a mixed type form. DenseClus uses a combination of UMAP and HDBSCAN to map mixed-type data into a dense, lower dimensional space. From this dense space, it then build groups hierarchically into clusters based on the density of points. This approach makes DenseClus an easy-to-use solution that can be applied on wide variety of data to find meaningful clusters.

Getting started with DenseClus

DenseClus is registered on PyPi, and the code is available on GitHub. The easiest way to install it directly is from pip for Python 3.7 or 3.8:

python3.8 -m pip install Amazon-DenseClus

DenseClus requires a Panda’s dataframe as input with both numerical and categorical columns. All preprocessing and extraction are done under the hood; call the fit function and then retrieve the clusters.

from denseclus import DenseClus 

clf = DenseClus(
    umap_combine_method="intersection_union_mapper",
)
clf.fit(df)

print(clf.score())

Try it out

We are excited about the alpha launch of DenseClus. You can find a more detailed walkthrough in the DenseClus Example NB.ipynb notebook in the GitHub repository. We invite you to try it out, report issues, send pull requests, and let us know what you think.

Charles Frenzel

Charles Frenzel

Charles is a Senior Data Scientist for Professional Services based in Tokyo, Japan. He works directly with AWS customers to build machine learning models for production. In his spare time he enjoys biking with his children, kettlebell training, and drinking matcha tea.

Baichuan Sun

Baichuan Sun

Dr. Baichuan Sun, currently serving as a Sr. AI/ML Solution Architect at AWS, focuses on generative AI and applies his knowledge in data science and machine learning to provide practical, cloud-based business solutions. With experience in management consulting and AI solution architecture, he addresses a range of complex challenges, including robotics computer vision, time series forecasting, and predictive maintenance, among others. His work is grounded in a solid background of project management, software R&D, and academic pursuits. Outside of work, Dr. Sun enjoys the balance of traveling and spending time with family and friends, reflecting a commitment to both his professional growth and personal well-being.

Eden Duthie

Eden Duthie

AWS Professional Service Machine Learning lead for the APJC region.

Yin Song

Yin Song

Yin Song is a data scientist from the AWS ProServe ML APJC team since May 2019. He works very closely to several enterprises and industries (e.g., telecommunication, mining, FSI and etc.) to design and apply machine learning and AI solutions, and create value for customers. Before joining AWS, Yin worked for Telstra, the largest telecommunication company in Australia, and delivered several projects about customer and network experience optimisation. Earlier to this, he was working as a data scientist in the field of online advertising and was leading the ML-based advertising optimisation. He obtained hiss PHD back in 2014; his thesis was about probabilistic machine learning and applications.