2. Data Preprocessing#
In this section, we learn how to preprocess data using CSV.jl, DataFrames.jl, FeatureTransforms.jl, and Statistics.
2.1. Reading the Dataset¶#
Comma-separated values (CSV) files are ubiquitous for storing tabular (spreadsheet-like) data. Here, each line corresponds to one record and consists of several (comma-separated) fields, e.g., “Albert Einstein,March 14 1879,Ulm,Federal polytechnic school,Accomplishments in the field of gravitational physics”. To demonstrate how to load CSV files with CSV.jl, we create a CSV file below ../data/house_tiny.csv. This file represents a dataset of homes, where each row corresponds to a distinct home and the columns correspond to the number of rooms (NumRooms), the roof type (RoofType), and the price (Price).
using CSV
csv_data = """
NumRooms,RoofType,Price
,,127500
2,,106000
4,Slate,178100
,,140000
"""
dir_path = joinpath("..","data")
file_path = joinpath(dir_path,"house_tiny.csv")
mkpath(dir_path)
CSV.write(file_path,CSV.File(IOBuffer(csv_data)))
"../data/house_tiny.csv"
Now let’s import DataFrames.jl
and load the dataset with CSV.read
.
using DataFrames
data = CSV.read(open(file_path),DataFrame)
Row | NumRooms | RoofType | Price |
---|---|---|---|
Int64? | String7? | Int64 | |
1 | missing | missing | 127500 |
2 | 2 | missing | 106000 |
3 | 4 | Slate | 178100 |
4 | missing | missing | 140000 |
2.2. Data Preparation#
In supervised learning, we train models to predict a designated target value, given some set of input values. Our first step in processing the dataset is to separate out columns corresponding to input versus target values. We can select columns either by name or via integer-location based indexing.
You might have noticed that replaced all CSV entries with value with a special missing
value. This can also happen whenever an entry is empty, e.g., “3,,,270000”. These are called missing values and they are the “bed bugs” of data science, a persistent menace that you will confront throughout your career. Depending upon the context, missing values might be handled either via imputation or deletion. Imputation replaces missing values with estimates of their values while deletion simply discards either those rows or those columns that contain missing values.
Here are some common imputation heuristics. For categorical input fields, we can treat missing
as a category. Since the RoofType column takes values Slate
and missing
, FeatureTransforms.jl
can convert this column into two columns RoofType_Slate
and RoofType_missing
. A row whose roof type is Slate
will set values of RoofType_Slate
and RoofType_missing
to true
and false
, respectively. The converse holds for a row with a missing RoofType
value.
using FeatureTransforms
inputs, targets = data[:,1:2], data[:,3]
# get all unique roof types
roof_types = unique(inputs.RoofType)
# one-hot encoding column RoofType, and append to original DataFrames
inputs = FeatureTransforms.apply_append(inputs,OneHotEncoding(roof_types);cols=:RoofType,header="RoofType_".*string.(roof_types))
# drop column RoofType
inputs = inputs[:,Not(:RoofType)]
Row | NumRooms | RoofType_missing | RoofType_Slate |
---|---|---|---|
Int64? | Bool | Bool | |
1 | missing | true | false |
2 | 2 | true | false |
3 | 4 | false | true |
4 | missing | true | false |
For missing numerical values, one common heuristic is to replace the missing
entries with the mean value of the corresponding column.
using Statistics
inputs.NumRooms = replace(inputs.NumRooms, missing=>mean(skipmissing(inputs.NumRooms)))
inputs
Row | NumRooms | RoofType_missing | RoofType_Slate |
---|---|---|---|
Float64 | Bool | Bool | |
1 | 3.0 | true | false |
2 | 2.0 | true | false |
3 | 4.0 | false | true |
4 | 3.0 | true | false |
2.3. Conversion to the Matrix#
We can load inputs
and targets
into Matrix
X, y = Matrix(inputs), Vector(targets)
display(X), display(y);
4×3 Matrix{Float64}:
3.0 1.0 0.0
2.0 1.0 0.0
4.0 0.0 1.0
3.0 1.0 0.0
4-element Vector{Int64}:
127500
106000
178100
140000