Cheatsheets
Machine Learning: Unsupervised Learning

Machine Learning: Unsupervised Learning

Clustering K-Means

Measuring Cluster Quality with Inertia

Inertia is a measure of how well data points are grouped in K-Means clustering. It is calculated by finding the distance from each data point to the center of its cluster, squaring those distances, and then summing them up. Lower inertia means better clustering.


# Function to calculate distance between two points
def calculate_distance(point1, point2):
  x_diff = (point1[0] - point2[0]) ** 2
  y_diff = (point1[1] - point2[1]) ** 2
  return (x_diff + y_diff) ** 0.5

Basics of Unsupervised Learning

In unsupervised learning, we find patterns in data without pre-existing labels. A good model balances low inertia with a small number of clusters. Increasing the number of clusters usually decreases inertia, but finding the right balance is key.


from sklearn.cluster import KMeans
# Create a KMeans model with 3 clusters
model = KMeans(n_clusters=3)
model.fit(data_samples)
# Predict cluster labels for the data samples
labels = model.predict(data_samples)

Choosing the Right Number of Clusters

To determine the best number of clusters (K), use the Elbow method. Plot inertia against different values of K and look for the 'elbow' point where the rate of decrease slows down. This point suggests the optimal number of clusters.


import pandas as pd
# Example predicted and actual labels (e.g., from a clustering task)
predicted_labels = [0, 1, 0, 1]
actual_labels = [0, 1, 1, 1]
# Create a DataFrame for cross-tabulation
df = pd.DataFrame({'predicted_labels': predicted_labels, 'actual_labels': actual_labels})
# Create the cross-tabulation
cross_tab = pd.crosstab(df['predicted_labels'], df['actual_labels'])
print(cross_tab)

Understanding Unsupervised Learning

Unsupervised learning helps find patterns in data without labeled examples. Clustering, a common unsupervised learning technique, groups data into clusters based on similarity. It’s useful for analyzing unlabeled datasets.


# Example of clustering using KMeans
from sklearn.cluster import KMeans
model = KMeans(n_clusters=4)
model.fit(data_samples)
labels = model.predict(data_samples)

Applications of Clustering

Clustering can be used in various applications such as customer segmentation, image compression, and anomaly detection.

How K-Means Clustering Works

K-Means clustering groups data into K clusters using an iterative process. Each data point is assigned to the nearest cluster center (centroid) to minimize the average distance from the center.

Creating Final Clusters

Continue updating clusters and centroids until the centroids no longer change significantly, indicating that the algorithm has converged.

Assigning Data Points to Clusters

In K-Means, after setting initial cluster centers, each data point is assigned to the nearest center. This helps in forming more accurate clusters as the algorithm progresses.

Calculating Distance for Clustering

Use the distance formula to measure how close each data point is to the cluster centers. The point with the smallest distance is assigned to that cluster.

Programming Cheatsheets: Quick Reference for Productivity

Welcome to our comprehensive collection of programming language cheatsheets! Whether you're a seasoned developer or a beginner, these quick reference guides provide essential tips and key information for all major languages. They focus on core concepts, commands, and functions—designed to enhance your efficiency and productivity.

ManageEngine Site24x7, a leading IT monitoring and observability platform, is committed to equipping developers and IT professionals with the tools and insights needed to excel in their fields.