← John

Project Overview

The goal of this project is to build a practical, high-performance classifier that can handle the full triage workflow a SOC analyst faces: first, is this flow an attack at all? And if so, what kind? Rather than collapsing everything into one multiclass problem, the pipeline uses two chained XGBoost models — one binary, one multiclass — so each stage can be tuned, evaluated, and replaced independently.

All 10 source files are concatenated into a single dataset, an 80/20 train/test split is made (stratified on the binary target to preserve class ratios), and both models are trained on the same split for a fair end-to-end evaluation.

Stage 01

Attack vs. Benign

A binary XGBoost classifier is trained to flag each flow as "benign" or "attack." It achieves ~98% accuracy and 0.99 AUC on held-out data and trains on millions of flows in under 20 seconds on commodity hardware.

Stage 02

Attack Type Identification

Flows predicted as "attack" by Stage 1 are passed to a multiclass XGBoost model trained exclusively on attack-labeled flows. It outputs one of 14 specific attack labels — DDoS-HTTP, SSH-BruteForce, SQL Injection, Bot, Infiltration, and more. Dominant classes reach F₁ > 0.99; minority classes with very few training samples (e.g. FTP-BruteForce: 53 records, SQL Injection: 85 records) show lower scores, highlighting the real-world challenge of long-tail attack distributions.

Pipeline Overview
Two-stage pipeline flowchart

Dataset

Data is sourced from the CIC-IDS-2018 dataset, published by the Canadian Institute for Cybersecurity (CIC) at the University of New Brunswick and hosted on the AWS Open Data Registry. The dataset simulates a realistic enterprise network over 10 days in February–March 2018, capturing both benign background traffic and coordinated attack scenarios.

Each flow is represented by 77 numeric features extracted by CICFlowMeter — including packet lengths, inter-arrival times, flag counts, flow duration, and byte ratios — plus a label column. All 10 files are combined for a total of ~6.66 million flow records and 14 distinct attack labels.

6.66M
Total Flow Records

Across 10 daily capture files, Feb–Mar 2018

77
Features per Flow

CICFlowMeter-derived: timing, packet stats, flags, byte ratios

14
Attack Labels

Spanning DDoS, DoS, Brute Force, Botnet, Infiltration, and Web attacks

80/20
Train / Test Split

Stratified on binary target to preserve class balance

Source Files & Attack Labels

Highlights

Visualizations

F₁ score breakdown by attack class alongside the Sankey diagram showing how the two-stage pipeline routes traffic from the full dataset through binary detection to specific attack type classification.

F₁ Score by Attack Class
F1 Score by Class
Two-Stage Classification Flow (Sankey)
Two-Stage Classification Sankey Diagram

Notebook

Full pipeline in a single notebook: data loading and concatenation, 80/20 stratified split, Stage 1 binary XGBoost training and evaluation, Stage 2 multiclass training on attack-only flows, end-to-end classification report, and Sankey diagram generation.

View Notebook →

Key Takeaways