How to do Principal Component Analysis PCA with Python
Using a dimensionality reduction technique known as principal component analysis (PCA), high-dimensional data can be reduced to a lower-dimensional space while preserving the majority of the data’s variability. It is widely used to accelerate the training of machine learning models, identify characteristics in data, and present data visually.
Here’s how to apply PCA in Python using the sklearn.decomposition package:
Include the required libraries:
import numpy as npimport matplotlib.pyplot as pltfrom sklearn.datasets import load_irisfrom sklearn.preprocessing import StandardScaler#Import the PCA from sklearnfrom sklearn.decomposition import PCA
Get ready or load the dataset:
For this example, we’ll use the scikit-learn Iris dataset:
# Load the Iris dataset data = load_iris() X, y = data.data, data.target
It is a good idea to standardize the data before utilizing PCA because this technique is sensitive to the features’ magnitude:
# Standardize the data scaler = StandardScaler() X_std = scaler.fit_transform(X)
# Create a PCA object specifying the number of components you want to keep (in this example, 2 components) pca = PCA(n_components=2) # Fit the PCA model to your standardized data X_pca = pca.fit_transform(X_std)
Visualize the changed data:
# Scatter plot the transformed data with the first and second principal components plt.figure(figsize=(8, 6)) plt.scatter(X_pca[:, 0], X_pca[:, 1], c=y, cmap='viridis') plt.xlabel('First Principal Component') plt.ylabel('Second Principal Component') plt.title('PCA: Iris Dataset') plt.colorbar(label='Class') plt.show()
The different classes of the Iris dataset will be represented by different colors on the scatter plot, and the data points will be projected onto the first two main components.
Steps of Principal Component Analysis PCA
PCA automatically sorts the components according to decreasing order of explained variance. The majority of the variation in the data is explained by the first principal component, followed by the second, and so forth. By setting n_components, you can indicate the number of primary components you want to retain.
Recall that the transformation does not account for class labels and that PCA is an unsupervised technique. It is often used as a preprocessing step before using supervised machine learning algorithms in order to reduce the dimensionality of the data and remove noise or irrelevant information.