Intrusion Detection with Scikit-learn Models
Personal Projects #Python#Machine Learning#Data Science
Overview
A predictive model that can differentiate between bad connections, called intrusions/attacks, and good connections. Attack categories include:
- DOS/DDOS: denial of service ex) syn flooding
- R2L: unauthorized access from remote machine
- U2L: unauthorized access to local superuser privileges ex) buffer overflow
- probing: ex) port scanning
Key Achievements
- Achieved 100% training accuracy and 99.88% prediction accuracy for the Random Forest Classifier.
- All models achieved training accuracy >80% and prediction accuracy of >80%
- Identified and removed highly correlated features to increase model interpretability and reduce redundancy
- Created correlation heatmaps, numeric feature bar plots, and line graphs using Matplotlib and Seaborn
- Demonstrated understanding of machine learning model pipeline: data preprocessing -> training -> tuning -> prediction -> evaluation
Implementation
- Data Preprocessing: Identify and plot categorical features, filter numeric columns and create target and feature matrix, 67/33 train/test split, map
protocol_typeandflaginto integers and update train/test data - Exploratory Data Analysis: Determine correlations between all numeric features of training data and create heatmap, remove highly correlated features and create heatmap, remove columns that don’t add value from train/test data
- Model Training: Fit transform trainX and testX, train models (Naive Bayes, Decision Tree, Random Forest, SVM, Logistic Regression, Gradient Boosting) with appropriate parameters, track time taken to train models, predict on training data, predict on test data, track time taken to make predictions, calculate accuracy scores
Results
| Model | Train Accuracy % | Prediction Accuracy % | Train Time | Prediciton Time |
|---|---|---|---|---|
| Naive Bayes | 87.99% | 88.28% | 0.92s | 1.25s |
| Decision Tree | 90.05% | 87.64% | 1.57s | 0.08s |
| Random Forest | 100% | 99.88% | 8.58s | 1.38s |
| SVM | 99.69% | 99.68% | 7.86s | 0.13s |
| Logistic Regression | 99.41% | 99.41% | 7.41s | 0.12s |
| Gradient Boosting | 99.94% | 99.16% | 125.49s | 1.97s |
The Random Forest classifier has the highest accuracy for both training and prediction.
Technologies
Python, Seaborn, Matplotlib, NumPy, Scikit-learn, Logistic Regression, Decision Trees, Random Forest, Support Vector Machines (SVM), Gradient Boosting, Naive Bayes