Cheatsheets
Build a Machine Learning Model - Unsupervised Learning

Build a Machine Learning Model - Unsupervised Learning

Understanding Clustering

Measuring Clustering Quality

Inertia is a way to measure how well K-Means has grouped the data. It calculates the average distance between each point and the center of its group (centroid). Lower inertia means better clustering.


# Calculate the distance between two points
import numpy as np
def calculate_distance(a, b):
  return np.sqrt((a[0] - b[0]) ** 2 + (a[1] - b[1]) ** 2)
point1 = [1, 2]
point2 = [4, 6]
print(calculate_distance(point1, point2))
# Output will be the distance between the two points

Basics of Unsupervised Learning

In unsupervised learning, we find patterns in data without predefined labels. K-Means clustering groups data into clusters. A good model has low inertia and an appropriate number of clusters, but finding the right number of clusters involves a tradeoff.


from sklearn.cluster import KMeans
# Example data
data_samples = [[1, 2], [2, 3], [5, 6], [8, 9]]
# Create and fit K-Means model
model = KMeans(n_clusters=2)
model.fit(data_samples)
labels = model.predict(data_samples)
print(labels)
# Output will show the cluster each sample belongs to

Finding the Right Number of Clusters

To find the best number of clusters (K) for K-Means, use the Elbow method. Plot inertia for different values of K and look for the point where adding more clusters no longer significantly improves the model. This point is called the ‘elbow’.


import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
# Sample data
data_samples = [[1, 2], [2, 3], [5, 6], [8, 9]]
# Calculate inertia for different K values
inertia = []
for k in range(1, 5):
    model = KMeans(n_clusters=k)
    model.fit(data_samples)
    inertia.append(model.inertia_)
# Plot the Elbow graph
plt.plot(range(1, 5), inertia, marker='o')
plt.xlabel('Number of Clusters')
plt.ylabel('Inertia')
plt.title('Elbow Method')
plt.show()

How K-Means Works

K-Means is a clustering algorithm that groups data into clusters. The algorithm works in steps: initially, centroids are placed randomly, then data points are assigned to the nearest centroid. The centroids are recalculated, and the process repeats until the centroids no longer move significantly.


from sklearn.cluster import KMeans
# Sample data
data_samples = [[1, 2], [2, 3], [5, 6], [8, 9]]
# Create and fit K-Means model
model = KMeans(n_clusters=2)
model.fit(data_samples)
print(model.cluster_centers_)
# Output will show the center points of each cluster

Using Scikit-Learn for K-Means

Scikit-Learn provides an easy-to-use implementation of K-Means. You can use it to group your data into clusters by specifying the number of clusters you want, and the algorithm will handle the rest.


from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans
# Generate sample data
data_samples, _ = make_blobs(n_samples=300, centers=4)
# Create and fit K-Means model
model = KMeans(n_clusters=4)
model.fit(data_samples)
labels = model.predict(data_samples)
print(labels)
# Output will show the cluster assignments for each data point

Cross Tabulation Overview

Cross tabulation helps to compare the clustering results with actual categories or labels. It’s a way to see how well the clusters match known groups.


import pandas as pd
# Sample data with predicted and actual labels
pred_labels = [1, 0, 1, 0]
user_labels = [1, 0, 1, 0]
# Create a cross-tabulation
cross_tab = pd.crosstab(pd.Series(pred_labels), pd.Series(user_labels))
print(cross_tab)
# Output will show a table comparing predicted vs actual labels

Convergence in K-Means

Convergence in K-Means occurs when the centroids (cluster centers) no longer change significantly with each iteration. The algorithm stops when it reaches this state, meaning clusters are stable.

Assigning Data to Clusters

In the K-Means algorithm, each data point is assigned to the cluster whose centroid is closest. This is done by calculating the distance from the data point to each centroid and selecting the nearest one.

Initial Step of K-Means

The first step in K-Means involves choosing initial positions for the centroids. These positions are updated iteratively to improve clustering accuracy.

Programming Cheatsheets: Quick Reference for Productivity

Welcome to our comprehensive collection of programming language cheatsheets! Whether you're a seasoned developer or a beginner, these quick reference guides provide essential tips and key information for all major languages. They focus on core concepts, commands, and functions—designed to enhance your efficiency and productivity.

ManageEngine Site24x7, a leading IT monitoring and observability platform, is committed to equipping developers and IT professionals with the tools and insights needed to excel in their fields.