Data Preprocessing

Data preprocessing transforms raw data into a suitable format for machine learning algorithms. Key techniques include:

Data Cleaning: Handling missing values, removing duplicates, and smoothing noisy data.
Data Transformation: Normalizing, encoding, scaling, and reducing data.
Data Integration: Merging and aggregating data from multiple sources.
Data Formatting: Converting data types and reshaping data structures.

Jupyter Notebook

data_preprocessing_sample.ipynb

In [1]:

import pandas as pd
import re
 
data = pd.read_csv('./demo_dataset.csv')

To identify invalid source_ip values, you can use a regular expression to validate the IP addresses:

In [2]:

def is_valid_ip(ip):
    pattern = re.compile(r'^((25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)$')
    return bool(pattern.match(ip))
 
# Check for invalid IP addresses
invalid_ips = data[~data['source_ip'].astype(str).apply(is_valid_ip)]
print(invalid_ips)

Out[2]:

    log_id       source_ip destination_port protocol bytes_transferred  \
40      41      10.0.0.300               25     SMTP              4096   
51      52    10.10.10.450      STRING_PORT      FTP              4096   
55      56             NaN               53      DNS              1024   
57      58   192.168.1.475              NaN      UDP              2048   
63      64      MISSING_IP               53      DNS              1024   
65      66   192.168.1.600      UNUSED_PORT      UDP              2048   
71      72      MISSING_IP               53      DNS              1024   
74      75    172.16.1.400               80     HTTP              1024   
82      83    172.16.1.450               80     HTTP              1024   
87      88      MISSING_IP               53      DNS              1024   
88      89    10.10.10.700              443      TLS               512   
92      93      INVALID_IP              110     POP3              4096   
93      94  192.168.1.1050               53      DNS       NON_NUMERIC   
95      96      MISSING_IP               25     SMTP              4096   
97      98  192.168.1.1100      UNUSED_PORT      UDP              2048   

   threat_level  
40            0  
51            ?  
55            0  
57            1  
63            0  
65            1  
71            0  
74            0  
82            0  
87            0  
88            1  
92            1  
93            0  
95            1  
97            0

To identify invalid destination_port values, you can check if the port numbers are within the valid range (0-65535):

In [3]:

def is_valid_port(port):
    try:
        port = int(port)
        return 0 <= port <= 65535
    except ValueError:
        return False
 
# Check if invalid port numbers
invalid_ports = data[~data['destination_port'].apply(is_valid_port)]
print(invalid_ports)

Out[3]:

    log_id       source_ip destination_port protocol bytes_transferred  \
0       10      10.0.0.100      STRING_PORT      FTP              4096   
34      35   192.168.1.200      STRING_PORT      FTP              4096   
51      52    10.10.10.450      STRING_PORT      FTP              4096   
57      58   192.168.1.475              NaN      UDP              2048   
65      66   192.168.1.600      UNUSED_PORT      UDP              2048   
67      68     10.10.10.77      STRING_PORT      FTP              4096   
78      79   172.16.254.77           999999     HTTP              2048   
97      98  192.168.1.1100      UNUSED_PORT      UDP              2048   

   threat_level  
0             ?  
34            ?  
51            ?  
57            1  
65            1  
67            ?  
78            1  
97            0

To identify invalid protocol values, you can check against a list of known protocols:

In [4]:

valid_protocols = ['TCP', 'TLS', 'SSH', 'POP3', 'DNS', 'HTTPS', 'SMTP', 'FTP', 'UDP', 'HTTP']
 
# Check if invalid protocol values
invalid_protocols = data[~data['protocol'].isin(valid_protocols)]
print(invalid_protocols)

Out[4]:

    log_id      source_ip destination_port protocol bytes_transferred  \
30      31  192.168.1.119              443  UNKNOWN              9513   
80      81  192.168.1.224               25  UNKNOWN              1161   

   threat_level  
30            2  
80            1

To identify invalid bytes_transferred values, you can check if the values are numeric and non-negative:

In [5]:

def is_valid_bytes(bytes):
    try:
        bytes = int(bytes)
        return bytes >= 0
    except ValueError:
        return False
 
# Check for invalid bytes transferred
invalid_bytes = data[~data['bytes_transferred'].apply(is_valid_bytes)]
print(invalid_bytes)

Out[5]:

    log_id       source_ip destination_port protocol bytes_transferred  \
1       12  172.16.254.100              110     POP3          NEGATIVE   
2       27  172.16.254.200              110     POP3       NON_NUMERIC   
93      94  192.168.1.1050               53      DNS       NON_NUMERIC   

   threat_level  
1             1  
2             1  
93            0

To identify invalid threat_level values, you can check if the values are within a valid range (e.g., 0-2):

In [6]:

def is_valid_threat_level(threat_level):
    try:
        threat_level = int(threat_level)
        return 0 <= threat_level <= 2
    except ValueError:
        return False
 
# Check for invalid threat levels
invalid_threat_levels = data[~data['threat_level'].apply(is_valid_threat_level)]
print(invalid_threat_levels)

Out[6]:

    log_id      source_ip destination_port protocol bytes_transferred  \
0       10     10.0.0.100      STRING_PORT      FTP              4096   
34      35  192.168.1.200      STRING_PORT      FTP              4096   
51      52   10.10.10.450      STRING_PORT      FTP              4096   
67      68    10.10.10.77      STRING_PORT      FTP              4096   

   threat_level  
0             ?  
34            ?  
51            ?  
67            ?

Jupyter Notebook

imputing_missing_values.ipynb

Imputing is the process of replacing missing or invalid values in a dataset with estimated values. This is crucial for maintaining the integrity and usability of the data, especially in machine learning and data analysis tasks where missing values can lead to biased or inaccurate results.

First, convert all invalid or corrupted entries, such as MISSING_IP, INVALID_IP, STRING_PORT, UNUSED_PORT, NON_NUMERIC, or ?, into NaN. This approach standardizes the representation of missing values, enabling uniform downstream imputation steps.

In [1]:

import pandas as pd
import numpy as np
import re
from ipaddress import ip_address
 
df = pd.read_csv('demo_dataset.csv')
 
invalid_ips = ['INVALID_IP', 'MISSING_IP']
invalid_ports = ['STRING_PORT', 'UNUSED_PORT']
invalid_bytes = ['NON_NUMERIC', 'NEGATIVE']
invalid_threat = ['?']
 
df.replace(invalid_ips + invalid_ports + invalid_bytes + invalid_threat, np.nan, inplace=True)
 
df['destination_port'] = pd.to_numeric(df['destination_port'], errors='coerce')
df['bytes_transferred'] = pd.to_numeric(df['bytes_transferred'], errors='coerce')
df['threat_level'] = pd.to_numeric(df['threat_level'], errors='coerce')
 
def is_valid_ip(ip):
    pattern = re.compile(r'^((25[0-5]|2[0-4][0-9]|[01]?\d?\d)\.){3}(25[0-5]|2[0-4]\d|[01]?\d?\d)$')
    if pd.isna(ip) or not pattern.match(str(ip)):
        return np.nan
    return ip
 
df['source_ip'] = df['source_ip'].apply(is_valid_ip)

After this step, NaN represents all missing or invalid data points.

For basic numeric columns like bytes_transferred, use simple methods such as the median or mean. For categorical columns like protocol, use the most frequent value.

In [2]:

from sklearn.impute import SimpleImputer
 
numeric_cols = ['destination_port', 'bytes_transferred', 'threat_level']
categorical_cols = ['protocol']
 
num_imputer = SimpleImputer(strategy='median')
df[numeric_cols] = num_imputer.fit_transform(df[numeric_cols])
 
cat_imputer = SimpleImputer(strategy='most_frequent')
df[categorical_cols] = cat_imputer.fit_transform(df[categorical_cols])

For more sophisticated scenarios, employ advanced techniques like KNNImputer or IterativeImputer. These methods consider relationships among features to produce contextually meaningful imputations.

In [3]:

from sklearn.impute import KNNImputer
 
knn_imputer = KNNImputer(n_neighbors=5)
df[numeric_cols] = knn_imputer.fit_transform(df[numeric_cols])

For source_ip values that remain missing, assign a default such as 0.0.0.0. Validate protocol values against known valid protocols. For ports, ensure values fall within the valid range 0-65535, and for protocols that imply certain ports, consider mode-based assignments or domain-specific mappings.

The protocols and ports are not aligning, this is purely data cleaning

In [4]:

valid_protocols = ['TCP', 'TLS', 'SSH', 'POP3', 'DNS', 'HTTPS', 'SMTP', 'FTP', 'UDP', 'HTTP']
df.loc[~df['protocol'].isin(valid_protocols), 'protocol'] = df['protocol'].mode()[0]
 
df['source_ip'] = df['source_ip'].fillna('0.0.0.0')
df['destination_port'] = df['destination_port'].clip(lower=0, upper=65535)

In [5]:

print(df.describe(include='all'))

Out[5]:

            log_id source_ip  destination_port protocol  bytes_transferred  \
count   100.000000       100        100.000000      100          100.00000   
unique         NaN        76               NaN        9                NaN   
top            NaN   0.0.0.0               NaN     HTTP                NaN   
freq           NaN        15               NaN       27                NaN   
mean     50.500000       NaN        776.860000      NaN         4138.64000   
std      29.011492       NaN       6542.582099      NaN         2526.40978   
min       1.000000       NaN         22.000000      NaN          498.00000   
25%      25.750000       NaN         53.000000      NaN         1693.25000   
50%      50.500000       NaN         80.000000      NaN         4096.00000   
75%      75.250000       NaN        110.000000      NaN         5971.75000   
max     100.000000       NaN      65535.000000      NaN         9765.00000   

        threat_level  
count     100.000000  
unique           NaN  
top              NaN  
freq             NaN  
mean        0.930000  
std         0.781801  
min         0.000000  
25%         0.000000  
50%         1.000000  
75%         2.000000  
max         2.000000

Recent Notes

Custom domain for Codeberg page

Different type of DNS Records

Installing OpenBSD on macOS

Murphy's Law and more

Lumix DMC-GX80

Data Preprocessing

Graph View

Recent Notes

Custom domain for Codeberg page

Different type of DNS Records

Backlinks