2. Data Preprocessing#

In this section, we learn how to preprocess data using CSV.jl, DataFrames.jl, FeatureTransforms.jl, and Statistics.

2.1. Reading the Dataset¶#

Comma-separated values (CSV) files are ubiquitous for storing tabular (spreadsheet-like) data. Here, each line corresponds to one record and consists of several (comma-separated) fields, e.g., “Albert Einstein,March 14 1879,Ulm,Federal polytechnic school,Accomplishments in the field of gravitational physics”. To demonstrate how to load CSV files with CSV.jl, we create a CSV file below ../data/house_tiny.csv. This file represents a dataset of homes, where each row corresponds to a distinct home and the columns correspond to the number of rooms (NumRooms), the roof type (RoofType), and the price (Price).

using CSV

csv_data = """
NumRooms,RoofType,Price
,,127500
2,,106000
4,Slate,178100
,,140000
"""

dir_path = joinpath("..","data")
file_path = joinpath(dir_path,"house_tiny.csv")
mkpath(dir_path)
CSV.write(file_path,CSV.File(IOBuffer(csv_data)))
"../data/house_tiny.csv"

Now let’s import DataFrames.jl and load the dataset with CSV.read.

using DataFrames
data = CSV.read(open(file_path),DataFrame)
4×3 DataFrame
RowNumRoomsRoofTypePrice
Int64?String7?Int64
1missingmissing127500
22missing106000
34Slate178100
4missingmissing140000

2.2. Data Preparation#

In supervised learning, we train models to predict a designated target value, given some set of input values. Our first step in processing the dataset is to separate out columns corresponding to input versus target values. We can select columns either by name or via integer-location based indexing.

You might have noticed that replaced all CSV entries with value with a special missing value. This can also happen whenever an entry is empty, e.g., “3,,,270000”. These are called missing values and they are the “bed bugs” of data science, a persistent menace that you will confront throughout your career. Depending upon the context, missing values might be handled either via imputation or deletion. Imputation replaces missing values with estimates of their values while deletion simply discards either those rows or those columns that contain missing values.

Here are some common imputation heuristics. For categorical input fields, we can treat missing as a category. Since the RoofType column takes values Slate and missing, FeatureTransforms.jl can convert this column into two columns RoofType_Slate and RoofType_missing. A row whose roof type is Slate will set values of RoofType_Slate and RoofType_missing to true and false, respectively. The converse holds for a row with a missing RoofType value.

using FeatureTransforms

inputs, targets = data[:,1:2], data[:,3]

# get all unique roof types
roof_types = unique(inputs.RoofType)
# one-hot encoding column RoofType, and append to original DataFrames
inputs = FeatureTransforms.apply_append(inputs,OneHotEncoding(roof_types);cols=:RoofType,header="RoofType_".*string.(roof_types))
# drop column RoofType
inputs = inputs[:,Not(:RoofType)]
4×3 DataFrame
RowNumRoomsRoofType_missingRoofType_Slate
Int64?BoolBool
1missingtruefalse
22truefalse
34falsetrue
4missingtruefalse

For missing numerical values, one common heuristic is to replace the missing entries with the mean value of the corresponding column.

using Statistics
inputs.NumRooms = replace(inputs.NumRooms, missing=>mean(skipmissing(inputs.NumRooms)))
inputs
4×3 DataFrame
RowNumRoomsRoofType_missingRoofType_Slate
Float64BoolBool
13.0truefalse
22.0truefalse
34.0falsetrue
43.0truefalse

2.3. Conversion to the Matrix#

We can load inputs and targets into Matrix

X, y = Matrix(inputs), Vector(targets)
display(X), display(y);
4×3 Matrix{Float64}:
 3.0  1.0  0.0
 2.0  1.0  0.0
 4.0  0.0  1.0
 3.0  1.0  0.0
4-element Vector{Int64}:
 127500
 106000
 178100
 140000