Datasets, which are collections of data points used for analysis and model training.

Data Preprocessing is a crucial step in the machine-learning pipeline that involves transforming raw data into a suitable format for algorithms to process effectively.

They come in various forms, including:

  • Tabular Data: Data organized into tables with rows and columns, common in spreadsheets or databases.
  • Image Data: Sets of images represented numerically as pixel arrays.
  • Text Data: Unstructured data composed of sentences, paragraphs, or full documents.
  • Time Series Data: Sequential data points collected over time, emphasizing temporal patterns.

Jupyter Notebook
In [1]:
import pandas as pd
 
# loading dataset
data = pd.read_csv("./demo_dataset.csv")
In [2]:
# Display the first few rows of the dataset
print(data.head())
Out[2]:
   log_id       source_ip destination_port protocol bytes_transferred  \
0      10      10.0.0.100      STRING_PORT      FTP              4096   
1      12  172.16.254.100              110     POP3          NEGATIVE   
2      27  172.16.254.200              110     POP3       NON_NUMERIC   
3       1   192.168.1.100               80     HTTP              1024   
4       2    192.168.1.81               53      TLS              9765   

  threat_level  
0            ?  
1            1  
2            1  
3            0  
4            0  
In [3]:
# Get a summary of column data types and non-null counts
print(data.info())
Out[3]:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 6 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   log_id             100 non-null    int64 
 1   source_ip          99 non-null     object
 2   destination_port   99 non-null     object
 3   protocol           100 non-null    object
 4   bytes_transferred  100 non-null    object
 5   threat_level       100 non-null    object
dtypes: int64(1), object(5)
memory usage: 4.8+ KB
None
In [4]:
# Identify columns with missing values
print(data.isnull().sum())
Out[4]:
log_id               0
source_ip            1
destination_port     1
protocol             0
bytes_transferred    0
threat_level         0
dtype: int64