// Security Analytics Project
EigenFlow Profiler
Unsupervised attack-type classification using K-means cluster images and PCA eigenprofiles, with a 6.27x benign separation ratio.
Project Overview
Traditional signature-based detection misses novel and subtle threats by design. This project takes a fundamentally different approach: instead of learning from labeled attack signatures, it learns what each attack family looks like structurally and flags anything that doesn't fit.
Inspired by the "eigenfaces" technique in facial recognition, the pipeline encodes each NetFlow attack file as a series of K-means cluster-boundary images, trains group-specific PCA eigenprofile models on those images, then classifies new traffic by measuring which group's model reconstructs it with the lowest L2 error. No labels are required at inference time.
Source data is the CIC-IDS-2018 dataset (Canadian Institute for Cybersecurity / University of New Brunswick), ten parquet files covering four attack families across multiple days of simulated enterprise traffic. The method achieves 70% test classification accuracy and a 6.27x benign separation ratio.
How It Works
Step 1: Shared Normalization
A single MinMaxScaler is fitted once on a 20,000-row representative sample (5,000 rows from one file per attack group) and serialized. All subsequent pipelines use this same fitted scaler, ensuring the eigenprofiles and inference operate on the same feature distribution.
Step 2: Stratified Sampling
Each source file is balanced across label classes using a stratified sample capped at 10,000 rows per class. This prevents memory exhaustion on heavily imbalanced files (e.g. Web1-Thursday: 829,883 benign vs 341 attack rows).
Step 3: K-means Cluster-Boundary Images
The scaled feature matrix is compressed to two dimensions via PCA. A K-means (k=3) model is fit on the 2D projection and reused for both scatter point labels and mesh-grid background, ensuring colors are always consistent. The resulting cluster-boundary plot is saved as a 100x100 grayscale PNG, the visual fingerprint of that attack's structure. Five augmented images per file are produced by injecting relative Gaussian noise before clustering.
Step 4: Eigenprofile Training
Augmented images are grouped by attack family. For each of the four groups, all images are converted to grayscale, resized to 100x100, flattened into 10,000-dimensional vectors, and a PCA model (5 components) is fit on the group's image matrix. The resulting principal components, the "eigenprofiles", are serialized for reuse at inference.
Step 5: Reconstruction Error Classification
A new flow image is projected onto each group's 5-dimensional eigenspace and reconstructed. The L2 norm between the original and reconstructed image is computed for each group and the group yielding the lowest error wins. For benign rejection, raw NetFlow rows are normalized with the shared scaler and zero-padded to 10,000 dimensions before projection.
Results
Classification Accuracy: 70% (21/30)
Application (6/6), Credential (3/3), Botnet (3/3), and Infil2 (3/3) all classify correctly with low, stable reconstruction errors. The misclassifications are systematic rather than random: DDoS1-Tuesday and DoS1-Thursday (6 images) are consistently predicted as exploit, as their cluster-boundary images visually resemble the exploit group's training data more than the other denial files. Infil1-Wednesday (3 images) is consistently predicted as application, because low-rate infiltration traffic can produce 2D cluster patterns geometrically similar to web application traffic. Notably, misclassified files show errors of 4,900 to 5,540 vs. correctly classified denial files at 896 to 1,679, so the error magnitude itself is a useful anomaly signal even when group assignment is wrong.
Benign Separation: 6.27x Ratio
8,000 benign flow rows (from four source files) are passed through the shared scaler and zero-padded before projection. Mean best-group reconstruction error for benign traffic is ~21,110 vs. ~3,369 for matched attack traffic, a 6.27x separation. All four group models produce consistently high errors on benign data and none generalizes to out-of-distribution traffic. This gap enables a simple threshold-based anomaly filter: any sample with a best-group error above ~10,000 can be flagged as likely benign before reaching the group classifier.
Eigenprofiles by Attack Group
The 5 principal components learned for each group. Application traffic is highly consistent, with PC1 alone explaining 79.4% of variance. Credential traffic spreads across three components of near-equal weight (33.6%, 33.1%, 19.0%), reflecting more diverse cluster patterns within brute-force traffic.
Sample Cluster Images & Reconstructions
Sample cluster-boundary images (left column) and their eigenprofile reconstructions (right column) for correctly classified test cases. Each cluster image shows NetFlow records in 2D PCA space with K-means decision boundaries. The reconstruction shows how well the group's eigenprofile captures the image's visual structure.
Application
Credential
Denial-of-Service
Exploit / Infiltration
Notebook
Complete Jupyter notebook covering shared normalization, stratified sampling, cluster-boundary image generation, eigenprofile training, test classification, and benign rejection evaluation.
View Notebook →Full Report
Complete writeup including problem statement, methodology, per-group results with explained variance tables, classification results table, benign rejection analysis, and discussion of limitations and future work.
Read the Full Report →Key Takeaways
- 70% classification accuracy reflects real structural similarity between some attack families; the misclassifications are systematic and informative, not random noise.
- The 6.27x benign separation ratio is the more operationally valuable result: a simple reconstruction-error threshold can reliably filter benign traffic before the group classifier runs.
- Two lossy projections in series (77D to 2D to 100x100 image to 5D eigenspace) is a significant limitation. Replacing the first PCA projection with a supervised method like LDA would likely improve classification accuracy substantially.
- Image-style encoding of network flow data opens the door to the full computer vision toolkit for network security problems.