Datasets

Datasets, which are collections of data points used for analysis and model training.

Data Preprocessing is a crucial step in the machine-learning pipeline that involves transforming raw data into a suitable format for algorithms to process effectively.

They come in various forms, including:

Tabular Data: Data organized into tables with rows and columns, common in spreadsheets or databases.
Image Data: Sets of images represented numerically as pixel arrays.
Text Data: Unstructured data composed of sentences, paragraphs, or full documents.
Time Series Data: Sequential data points collected over time, emphasizing temporal patterns.

Jupyter Notebook

dataset_sample.ipynb

In [1]:

import pandas as pd
 
# loading dataset
data = pd.read_csv("./demo_dataset.csv")

In [2]:

# Display the first few rows of the dataset
print(data.head())

Out[2]:

   log_id       source_ip destination_port protocol bytes_transferred  \
0      10      10.0.0.100      STRING_PORT      FTP              4096   
1      12  172.16.254.100              110     POP3          NEGATIVE   
2      27  172.16.254.200              110     POP3       NON_NUMERIC   
3       1   192.168.1.100               80     HTTP              1024   
4       2    192.168.1.81               53      TLS              9765   

  threat_level  
0            ?  
1            1  
2            1  
3            0  
4            0

In [3]:

# Get a summary of column data types and non-null counts
print(data.info())

Out[3]:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 6 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   log_id             100 non-null    int64 
 1   source_ip          99 non-null     object
 2   destination_port   99 non-null     object
 3   protocol           100 non-null    object
 4   bytes_transferred  100 non-null    object
 5   threat_level       100 non-null    object
dtypes: int64(1), object(5)
memory usage: 4.8+ KB
None

In [4]:

# Identify columns with missing values
print(data.isnull().sum())

Out[4]:

log_id               0
source_ip            1
destination_port     1
protocol             0
bytes_transferred    0
threat_level         0
dtype: int64

Recent Notes

Custom domain for Codeberg page

Different type of DNS Records

Installing OpenBSD on macOS

Murphy's Law and more

Lumix DMC-GX80

Datasets

Graph View

Recent Notes

Custom domain for Codeberg page

Different type of DNS Records