Data preprocessing transforms raw data into a suitable format for machine learning algorithms. Key techniques include:
Data Cleaning: Handling missing values, removing duplicates, and smoothing noisy data.Data Transformation: Normalizing, encoding, scaling, and reducing data.Data Integration: Merging and aggregating data from multiple sources.Data Formatting: Converting data types and reshaping data structures.
import pandas as pd
import re
data = pd.read_csv('./demo_dataset.csv')To identify invalid source_ip values, you can use a regular expression to validate the IP addresses:
def is_valid_ip(ip):
pattern = re.compile(r'^((25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)$')
return bool(pattern.match(ip))
# Check for invalid IP addresses
invalid_ips = data[~data['source_ip'].astype(str).apply(is_valid_ip)]
print(invalid_ips)log_id source_ip destination_port protocol bytes_transferred \ 40 41 10.0.0.300 25 SMTP 4096 51 52 10.10.10.450 STRING_PORT FTP 4096 55 56 NaN 53 DNS 1024 57 58 192.168.1.475 NaN UDP 2048 63 64 MISSING_IP 53 DNS 1024 65 66 192.168.1.600 UNUSED_PORT UDP 2048 71 72 MISSING_IP 53 DNS 1024 74 75 172.16.1.400 80 HTTP 1024 82 83 172.16.1.450 80 HTTP 1024 87 88 MISSING_IP 53 DNS 1024 88 89 10.10.10.700 443 TLS 512 92 93 INVALID_IP 110 POP3 4096 93 94 192.168.1.1050 53 DNS NON_NUMERIC 95 96 MISSING_IP 25 SMTP 4096 97 98 192.168.1.1100 UNUSED_PORT UDP 2048 threat_level 40 0 51 ? 55 0 57 1 63 0 65 1 71 0 74 0 82 0 87 0 88 1 92 1 93 0 95 1 97 0
To identify invalid destination_port values, you can check if the port numbers are within the valid range (0-65535):
def is_valid_port(port):
try:
port = int(port)
return 0 <= port <= 65535
except ValueError:
return False
# Check if invalid port numbers
invalid_ports = data[~data['destination_port'].apply(is_valid_port)]
print(invalid_ports)log_id source_ip destination_port protocol bytes_transferred \ 0 10 10.0.0.100 STRING_PORT FTP 4096 34 35 192.168.1.200 STRING_PORT FTP 4096 51 52 10.10.10.450 STRING_PORT FTP 4096 57 58 192.168.1.475 NaN UDP 2048 65 66 192.168.1.600 UNUSED_PORT UDP 2048 67 68 10.10.10.77 STRING_PORT FTP 4096 78 79 172.16.254.77 999999 HTTP 2048 97 98 192.168.1.1100 UNUSED_PORT UDP 2048 threat_level 0 ? 34 ? 51 ? 57 1 65 1 67 ? 78 1 97 0
To identify invalid protocol values, you can check against a list of known protocols:
valid_protocols = ['TCP', 'TLS', 'SSH', 'POP3', 'DNS', 'HTTPS', 'SMTP', 'FTP', 'UDP', 'HTTP']
# Check if invalid protocol values
invalid_protocols = data[~data['protocol'].isin(valid_protocols)]
print(invalid_protocols)log_id source_ip destination_port protocol bytes_transferred \ 30 31 192.168.1.119 443 UNKNOWN 9513 80 81 192.168.1.224 25 UNKNOWN 1161 threat_level 30 2 80 1
To identify invalid bytes_transferred values, you can check if the values are numeric and non-negative:
def is_valid_bytes(bytes):
try:
bytes = int(bytes)
return bytes >= 0
except ValueError:
return False
# Check for invalid bytes transferred
invalid_bytes = data[~data['bytes_transferred'].apply(is_valid_bytes)]
print(invalid_bytes)log_id source_ip destination_port protocol bytes_transferred \ 1 12 172.16.254.100 110 POP3 NEGATIVE 2 27 172.16.254.200 110 POP3 NON_NUMERIC 93 94 192.168.1.1050 53 DNS NON_NUMERIC threat_level 1 1 2 1 93 0
To identify invalid threat_level values, you can check if the values are within a valid range (e.g., 0-2):
def is_valid_threat_level(threat_level):
try:
threat_level = int(threat_level)
return 0 <= threat_level <= 2
except ValueError:
return False
# Check for invalid threat levels
invalid_threat_levels = data[~data['threat_level'].apply(is_valid_threat_level)]
print(invalid_threat_levels)log_id source_ip destination_port protocol bytes_transferred \ 0 10 10.0.0.100 STRING_PORT FTP 4096 34 35 192.168.1.200 STRING_PORT FTP 4096 51 52 10.10.10.450 STRING_PORT FTP 4096 67 68 10.10.10.77 STRING_PORT FTP 4096 threat_level 0 ? 34 ? 51 ? 67 ?
Imputing is the process of replacing missing or invalid values in a dataset with estimated values. This is crucial for maintaining the integrity and usability of the data, especially in machine learning and data analysis tasks where missing values can lead to biased or inaccurate results.
First, convert all invalid or corrupted entries, such as MISSING_IP, INVALID_IP, STRING_PORT, UNUSED_PORT, NON_NUMERIC, or ?, into NaN. This approach standardizes the representation of missing values, enabling uniform downstream imputation steps.
import pandas as pd
import numpy as np
import re
from ipaddress import ip_address
df = pd.read_csv('demo_dataset.csv')
invalid_ips = ['INVALID_IP', 'MISSING_IP']
invalid_ports = ['STRING_PORT', 'UNUSED_PORT']
invalid_bytes = ['NON_NUMERIC', 'NEGATIVE']
invalid_threat = ['?']
df.replace(invalid_ips + invalid_ports + invalid_bytes + invalid_threat, np.nan, inplace=True)
df['destination_port'] = pd.to_numeric(df['destination_port'], errors='coerce')
df['bytes_transferred'] = pd.to_numeric(df['bytes_transferred'], errors='coerce')
df['threat_level'] = pd.to_numeric(df['threat_level'], errors='coerce')
def is_valid_ip(ip):
pattern = re.compile(r'^((25[0-5]|2[0-4][0-9]|[01]?\d?\d)\.){3}(25[0-5]|2[0-4]\d|[01]?\d?\d)$')
if pd.isna(ip) or not pattern.match(str(ip)):
return np.nan
return ip
df['source_ip'] = df['source_ip'].apply(is_valid_ip)After this step, NaN represents all missing or invalid data points.
For basic numeric columns like bytes_transferred, use simple methods such as the median or mean. For categorical columns like protocol, use the most frequent value.
from sklearn.impute import SimpleImputer
numeric_cols = ['destination_port', 'bytes_transferred', 'threat_level']
categorical_cols = ['protocol']
num_imputer = SimpleImputer(strategy='median')
df[numeric_cols] = num_imputer.fit_transform(df[numeric_cols])
cat_imputer = SimpleImputer(strategy='most_frequent')
df[categorical_cols] = cat_imputer.fit_transform(df[categorical_cols])For more sophisticated scenarios, employ advanced techniques like KNNImputer or IterativeImputer. These methods consider relationships among features to produce contextually meaningful imputations.
from sklearn.impute import KNNImputer
knn_imputer = KNNImputer(n_neighbors=5)
df[numeric_cols] = knn_imputer.fit_transform(df[numeric_cols])For source_ip values that remain missing, assign a default such as 0.0.0.0. Validate protocol values against known valid protocols. For ports, ensure values fall within the valid range 0-65535, and for protocols that imply certain ports, consider mode-based assignments or domain-specific mappings.
The protocols and ports are not aligning, this is purely data cleaning
valid_protocols = ['TCP', 'TLS', 'SSH', 'POP3', 'DNS', 'HTTPS', 'SMTP', 'FTP', 'UDP', 'HTTP']
df.loc[~df['protocol'].isin(valid_protocols), 'protocol'] = df['protocol'].mode()[0]
df['source_ip'] = df['source_ip'].fillna('0.0.0.0')
df['destination_port'] = df['destination_port'].clip(lower=0, upper=65535)print(df.describe(include='all')) log_id source_ip destination_port protocol bytes_transferred \
count 100.000000 100 100.000000 100 100.00000
unique NaN 76 NaN 9 NaN
top NaN 0.0.0.0 NaN HTTP NaN
freq NaN 15 NaN 27 NaN
mean 50.500000 NaN 776.860000 NaN 4138.64000
std 29.011492 NaN 6542.582099 NaN 2526.40978
min 1.000000 NaN 22.000000 NaN 498.00000
25% 25.750000 NaN 53.000000 NaN 1693.25000
50% 50.500000 NaN 80.000000 NaN 4096.00000
75% 75.250000 NaN 110.000000 NaN 5971.75000
max 100.000000 NaN 65535.000000 NaN 9765.00000
threat_level
count 100.000000
unique NaN
top NaN
freq NaN
mean 0.930000
std 0.781801
min 0.000000
25% 0.000000
50% 1.000000
75% 2.000000
max 2.000000