A simple guide to k-Means Clustering with scikit-learn
An unsupervised machine learning method for classifying data into clusters is called k-means clustering. Each data point will be allocated to the cluster with the closest mean after a set of data points has been divided into K clusters. The iterative method aims to minimize the total squared distances between each assigned data point and the cluster centroid. Here’s how to create the K-means clustering algorithm in Python:
K-Means Clustering Algorithm in Python:
Import the necessary libraries first:
import numpy as np import matplotlib.pyplot as plt from sklearn.datasets import make_blobs from sklearn.cluster import KMeans
Generating the sample data or loading a local dataset is the second step. As an example, let’s generate some artificial data using make_blobs from sklearn.datasets:
# Generate synthetic data with 4 clusters X, _ = make_blobs(n_samples=300, centers=4, cluster_std=1.0, random_state=42)
Next, utilize K-Means Clustering:
# Define the number of clusters (K) k = 4 # Initialize the KMeans object kmeans = KMeans(n_clusters=k) # Fit the model to the data kmeans.fit(X) # Get the cluster assignments and cluster centers cluster_labels = kmeans.labels_ cluster_centers = kmeans.cluster_centers_
For the clusters’ visual representation:
# Plot the data points and cluster centers plt.scatter(X[:, 0], X[:, 1], c=cluster_labels, cmap='viridis', edgecolors='k') plt.scatter(cluster_centers[:, 0], cluster_centers[:, 1], marker='x', s=200, c='red', label='Cluster Centers') plt.xlabel('Feature 1') plt.ylabel('Feature 2') plt.legend() plt.title('K-means Clustering') plt.show()
The data points will be plotted with colors corresponding to the clusters to which they have been assigned, and the cluster centers will be indicated by red “x” markers.
It is significant to remember that the algorithm’s performance may be affected by the choice of the number of clusters (K). Also, it may be necessary to experiment with different K values in order to find the best clustering solution for your dataset. The optimal K can be determined using the “Elbow Method” or other clustering evaluation metrics.
For real-world datasets, it’s also essential to manage any missing values and preprocess the data. The Scaling or normalization of features may be necessary during clustering to guarantee equitable comparisons between different features. Since we used simple synthetic data, these preprocessing steps are not required in this case.