Intrusion Detection with Scikit-learn Models

Personal Projects #Python#Machine Learning#Data Science

Overview#

A predictive model that can differentiate between bad connections, called intrusions/attacks, and good connections. Attack categories include:

DOS/DDOS: denial of service ex) syn flooding
R2L: unauthorized access from remote machine
U2L: unauthorized access to local superuser privileges ex) buffer overflow
probing: ex) port scanning

View the project on GitHub

Key Achievements#

Achieved 100% training accuracy and 99.88% prediction accuracy for the Random Forest Classifier.
All models achieved training accuracy >80% and prediction accuracy of >80%
Identified and removed highly correlated features to increase model interpretability and reduce redundancy
Created correlation heatmaps, numeric feature bar plots, and line graphs using Matplotlib and Seaborn
Demonstrated understanding of machine learning model pipeline: data preprocessing -> training -> tuning -> prediction -> evaluation

Implementation#

Data Preprocessing: Identify and plot categorical features, filter numeric columns and create target and feature matrix, 67/33 train/test split, map protocol_type and flag into integers and update train/test data
Exploratory Data Analysis: Determine correlations between all numeric features of training data and create heatmap, remove highly correlated features and create heatmap, remove columns that don’t add value from train/test data
Model Training: Fit transform trainX and testX, train models (Naive Bayes, Decision Tree, Random Forest, SVM, Logistic Regression, Gradient Boosting) with appropriate parameters, track time taken to train models, predict on training data, predict on test data, track time taken to make predictions, calculate accuracy scores

Results#

Model	Train Accuracy %	Prediction Accuracy %	Train Time	Prediciton Time
Naive Bayes	87.99%	88.28%	0.92s	1.25s
Decision Tree	90.05%	87.64%	1.57s	0.08s
Random Forest	100%	99.88%	8.58s	1.38s
SVM	99.69%	99.68%	7.86s	0.13s
Logistic Regression	99.41%	99.41%	7.41s	0.12s
Gradient Boosting	99.94%	99.16%	125.49s	1.97s

The Random Forest classifier has the highest accuracy for both training and prediction.

Technologies#

Python, Seaborn, Matplotlib, NumPy, Scikit-learn, Logistic Regression, Decision Trees, Random Forest, Support Vector Machines (SVM), Gradient Boosting, Naive Bayes

← Back to Projects

Overview#

Key Achievements#

Implementation#

Results#

Technologies#

Related Skills#