Intrusion Detection with Scikit-learn Models
Personal Projects #Python#Machine Learning#Data Science

Overview#

A predictive model that can differentiate between bad connections, called intrusions/attacks, and good connections. Attack categories include:

  • DOS/DDOS: denial of service ex) syn flooding
  • R2L: unauthorized access from remote machine
  • U2L: unauthorized access to local superuser privileges ex) buffer overflow
  • probing: ex) port scanning

View the project on GitHub

Key Achievements#

  • Achieved 100% training accuracy and 99.88% prediction accuracy for the Random Forest Classifier.
  • All models achieved training accuracy >80% and prediction accuracy of >80%
  • Identified and removed highly correlated features to increase model interpretability and reduce redundancy
  • Created correlation heatmaps, numeric feature bar plots, and line graphs using Matplotlib and Seaborn
  • Demonstrated understanding of machine learning model pipeline: data preprocessing -> training -> tuning -> prediction -> evaluation

Implementation#

  • Data Preprocessing: Identify and plot categorical features, filter numeric columns and create target and feature matrix, 67/33 train/test split, map protocol_type and flag into integers and update train/test data
  • Exploratory Data Analysis: Determine correlations between all numeric features of training data and create heatmap, remove highly correlated features and create heatmap, remove columns that don’t add value from train/test data
  • Model Training: Fit transform trainX and testX, train models (Naive Bayes, Decision Tree, Random Forest, SVM, Logistic Regression, Gradient Boosting) with appropriate parameters, track time taken to train models, predict on training data, predict on test data, track time taken to make predictions, calculate accuracy scores

Results#

ModelTrain Accuracy %Prediction Accuracy %Train TimePrediciton Time
Naive Bayes87.99%88.28%0.92s1.25s
Decision Tree90.05%87.64%1.57s0.08s
Random Forest100%99.88%8.58s1.38s
SVM99.69%99.68%7.86s0.13s
Logistic Regression99.41%99.41%7.41s0.12s
Gradient Boosting99.94%99.16%125.49s1.97s

The Random Forest classifier has the highest accuracy for both training and prediction.

Technologies#

Python, Seaborn, Matplotlib, NumPy, Scikit-learn, Logistic Regression, Decision Trees, Random Forest, Support Vector Machines (SVM), Gradient Boosting, Naive Bayes

← Back to Projects