K-Means Clustering: How to Use Unsupervised Learning Techniques

Let’s discuss ways you can help businesses analyze customer behavior and make decisions designed to drive customer satisfaction and loyalty.

Alex Jonas
Product Coalition

--

Source: vertica.com

The beauty of machine learning is that data doesn’t lie. With a few specific steps based in decades old statistical models, one can uncover predictive insights from seemingly randomized data sets.

AI is now more publicly accessible than ever. Due to advances in processing power and the abundance of low cost technologies, storing data and running complex models is no longer restricted to large corporations with big budgets and tremendous resources.

Most people are familiar with GenAI and applications like natural language processing. Some may even have dabbled in MidJourney where text prompts are run through general adversarial networks to create original and unique images. Few however, may have been exposed to the underlying machine learning (ML) principles of supervised and unsupervised learning.

Supervised and unsupervised learning

Supervised learning uses regression or classification techniques to come up with very specific predictions. Unsupervised learning is less specific. It approaches data from a more general perspective and looks for patterns amidst perceived chaos.

The best part about unsupervised learning is that it’s a methodology that embraces self acknowledged ignorance. Just imagine — there’s something immediately admirable about an organization that admits that they may not already know everything about their customers.

Unsupervised learning is different because there are purposely fewer rules in place. It answers the broad question of what trends may exist in a large dataset rather than narrow the focus down to a specific goal or output. It’s ambiguity at the outset is it’s secret weapon.

Too often when setting up an ML model, we assume connections between inputs that may not tell the whole story. Instead, if you use unsupervised techniques such as Clustering and Associations, you may be surprised as to what you’ll find. One clear application for this type of approach is customer segmentation.

Visualization of customers segmented on pie chart
Image Source: LinkedIn

Customer segmentation

It’s a rare occurrence for any web experience today to be without some form of personalization or segmentation built into the user interface (UI). Most modern content management systems (CMS) are designed to handle concurrently running campaigns with distinct customer journeys broken down by audiences. But, how can you clearly delineate who is supposed to get what experience?

Sometimes the answer is easy if you’re looking at geography or demographics, but time and time again we find there are potential audiences out there that don’t meet such definitive criteria. This is where K-Means clustering comes in.

Source: serokell.io

K-Means clustering

K-Means clustering uses unlabeled and unclassified data to establish cohorts or groups of datapoints (customers) that perform similarly. Each cluster is defined by its dimensional (two, three, four, five…) distance from an infinite amount of comparative data points (centroids).

These clusters are easily represented in two dimensions below where color is used to define a cohort. It’s a bit of a treasure hunt and actually a pretty fun exercise when done by hand. The machines running these over and over again however, may or may not agree.

K Means Clustering Diagram on Two Dimensional Grid
Image Source

What you quickly discover, though, is that there are previously unknown relationships hiding in plain sight. The data often shows that it may not be as simple as grouping your customers into traditional verticals such as age, gender, geography, or income. More detailed clusters provide opportunities to outsmart the competition with hard data. They can easily be applied to define new audiences that are made up of multiple variables.

Source: boldbusiness.com

Example: Zappos digital marketing campaign

Let’s say for instance that you work for Zappos and are preparing for a July 4th digital marketing campaign. You’re investigating which populations are interested in which products, and you’re looking at 50,000 Black Friday purchases from 2023 as a baseline to train and execute your model.

Here are steps you might take towards executing a targeted campaign:

1. Identify variable agnostic data:

When working with unsupervised data, one of the most important tasks is to expand your scope from a limited set of variables. Besides including the basic demographic data described above (age, gender, geography, income) let’s say you expand the scope to be as detailed as possible and also include user actions.

For the purpose of this exercise, let’s call these: products purchased, products viewed, time spent per product viewed, scroll-depth per product viewed, product rating views, product sizing customizations, and product material customizations.

2. Establish a K-Means cluster:

Now that you have a wealth of data to run your model against, you execute a K Means cluster algorithm using your studio of choice (more on publicly available ML studios below). You define three hierarchical data categories: ‘customer demographics’, ‘products purchased’, and ‘site activities.’ After you run the model you find that your results return 27 unique clusters.

3. Refine with classification:

At this point you’re psyched that you have 27 clusters but still might not have a great idea of what makes each one unique. To get more information you can run a binary classification technique such as a logistical regression to test each cluster (also now available in most ML studios).

The trends should begin to present themselves. For example, you may find that one cluster is uniquely defined as women, with high net incomes, that look at comfort ratings and view designer shoes greater than $200 but most often purchase shoes less than $150. Let’s call this cohort: Price-conscious Fashionistas. You may also find a cluster of men over 6'5" that look at hiking boots of all styles but of sizes greater than 14 with few or no purchases tied to the cluster. Let’s call this cohort: Out of Stock Outdoorsmen.

4. Put the results to work:

The two identified cohorts each require a unique digital marketing strategy (as well as a possible discussion with inventory/fulfillment teams). For the Price-conscious Fashionista’s you could target these customers with an email campaign specifically recommending comfort designer shoe styles but that fall within their price point of under $200. For the Out of Stock Outdoorsmen, you could use Paid Search (SEM) to promote new in stock hiking boots with larger sizes available and also pair them on site with your Big and Tall clothing selection.

It’s about learning more about your customers.

The big takeaway from the example above is that clusters derived from unsupervised learning will give you a leg up when defining your digital audiences. Custom cohorts can then be targeted with the latest and greatest digital marketing software (Adobe Campaign, Marketo, Salesforce Marketing Cloud, Hubspot, or Microsoft Dynamics) to provide the right message to the right people at the right time. Ultimately it comes down to learning more about your customers, what they are interested in, and how your product is serving their needs.

Hopefully by now you’re convinced of unsupervised learning’s potential. To go one step further, what’s even more exciting is that it’s an especially great time to make this part of your product and marketing strategy because of the omnipresence of new and established resources to help even a novice get started. With ML Studios, out of the box Data Lakes, and easy to provision nonrelational databases, there isn’t much standing in a team’s way of having a fully functional unsupervised data platform at their fingertips.

Data Science concepts today are more accessible.

When I got my MBA from Johns Hopkins a few years back, you used to have to spend hours preparing your data, training your models, and running algorithms to get to any meaningful conclusions. From learning R programming language to painstakingly sifting through spreadsheets to applying sum of squares calculations to establish the centroids of your models, the time invested was significant. No one would have expected a busy product manager or digital marketer to be able to put the effort into ML in years past. This is no longer the case.

You may have heard or experimented with ChatGPT and been astounded by its flexibility and easy of use, but few recognize the advances across the rest of the data science industry. IBM Watson Studio and Amazon Sagemaker now make it easy for even a novice to introduce data science concepts into their business operations.

This is a huge leg up for digital marketers especially who need to focus most of their time organizing and executing campaigns even more complex than the Zappos example discussed above. Automating some of the process of audience creation with Watson or Sagemaker saves time and resources, but it’s not all flowers and roses though.

Despite the newly available non-technical AI tools from IBM and Amazon, you still might need development support to capture and store your user data. Luckily, Apache Cassandra and MongoDB, two of the most common non-relational databases, are now available from AWS for $0.30/Gig-Month and 0.80/Hr respectively.

Amazon also has inexpensive Data Lake capabilities with its S3 service although there are so many others to choose from: Microsoft, Google, Oracle, Snowflake. So although you might need to allocate dollars in your budget for technical support, you won’t necessarily be breaking the bank. And don’t forget, each of technologies listed above offers fully managed versions of their software as well, so you don’t necessarily have to have technical resources on staff to get these set up.

Source: datasklr.com

Unsupervised learning offers a prudent strategy for understanding your data.

It’s an exciting time, to say the least, to be involved in the predictive (and now generative) field of data science. When it comes to applying learnings to business operations don’t let your marketing strategy get stuck in traditional forms of segmentation.

Unsupervised learning provides the most risk averse approach to getting your audiences and cohorts right. Even if you go through the exercise of setting up a few clusters, like with the Zappos example above, but don’t end up using them, the knowledge you will gain about your users will be worth the effort.

The data ultimately won’t lie. On top of all of this, there’s little getting in your way of kicking things off even if you don’t have deep pockets or a background in engineering or data science. Good luck, but I don’t think you’ll need it!

I would like to thank Tremis Skeete, Executive Editor of Product Coalition, for his valuable contributions to the editing of this article.

I also thank Product Coalition founder Jay Stansell, who has provided a collaborative product management education environment.

--

--

Digital Product for Comcast Business. MBA — Johns Hopkins Carey Business School.