Data Preprocessing

2. Data Preprocessing#

In this section, we learn how to preprocess data using CSV.jl, DataFrames.jl, FeatureTransforms.jl, and Statistics.

2.1. Reading the Dataset¶#

Comma-separated values (CSV) files are ubiquitous for storing tabular (spreadsheet-like) data. Here, each line corresponds to one record and consists of several (comma-separated) fields, e.g., “Albert Einstein,March 14 1879,Ulm,Federal polytechnic school,Accomplishments in the field of gravitational physics”. To demonstrate how to load CSV files with CSV.jl, we create a CSV file below ../data/house_tiny.csv. This file represents a dataset of homes, where each row corresponds to a distinct home and the columns correspond to the number of rooms (NumRooms), the roof type (RoofType), and the price (Price).

using CSV

csv_data = """
NumRooms,RoofType,Price
,,127500
2,,106000
4,Slate,178100
,,140000
"""

dir_path = joinpath("..","data")
file_path = joinpath(dir_path,"house_tiny.csv")
mkpath(dir_path)
CSV.write(file_path,CSV.File(IOBuffer(csv_data)))

"../data/house_tiny.csv"

Now let’s import DataFrames.jl and load the dataset with CSV.read.

using DataFrames
data = CSV.read(open(file_path),DataFrame)

4×3 DataFrame

Row	NumRooms	RoofType	Price
	Int64?	String7?	Int64
1	missing	missing	127500
2	2	missing	106000
3	4	Slate	178100
4	missing	missing	140000

2.2. Data Preparation#

In supervised learning, we train models to predict a designated target value, given some set of input values. Our first step in processing the dataset is to separate out columns corresponding to input versus target values. We can select columns either by name or via integer-location based indexing.

You might have noticed that replaced all CSV entries with value with a special missing value. This can also happen whenever an entry is empty, e.g., “3,,,270000”. These are called missing values and they are the “bed bugs” of data science, a persistent menace that you will confront throughout your career. Depending upon the context, missing values might be handled either via imputation or deletion. Imputation replaces missing values with estimates of their values while deletion simply discards either those rows or those columns that contain missing values.

Here are some common imputation heuristics. For categorical input fields, we can treat missing as a category. Since the RoofType column takes values Slate and missing, FeatureTransforms.jl can convert this column into two columns RoofType_Slate and RoofType_missing. A row whose roof type is Slate will set values of RoofType_Slate and RoofType_missing to true and false, respectively. The converse holds for a row with a missing RoofType value.

using FeatureTransforms

inputs, targets = data[:,1:2], data[:,3]

# get all unique roof types
roof_types = unique(inputs.RoofType)
# one-hot encoding column RoofType, and append to original DataFrames
inputs = FeatureTransforms.apply_append(inputs,OneHotEncoding(roof_types);cols=:RoofType,header="RoofType_".*string.(roof_types))
# drop column RoofType
inputs = inputs[:,Not(:RoofType)]

4×3 DataFrame

Row	NumRooms	RoofType_missing	RoofType_Slate
	Int64?	Bool	Bool
1	missing	true	false
2	2	true	false
3	4	false	true
4	missing	true	false

For missing numerical values, one common heuristic is to replace the missing entries with the mean value of the corresponding column.

using Statistics
inputs.NumRooms = replace(inputs.NumRooms, missing=>mean(skipmissing(inputs.NumRooms)))
inputs

4×3 DataFrame

Row	NumRooms	RoofType_missing	RoofType_Slate
	Float64	Bool	Bool
1	3.0	true	false
2	2.0	true	false
3	4.0	false	true
4	3.0	true	false

2.3. Conversion to the Matrix#

We can load inputs and targets into Matrix

X, y = Matrix(inputs), Vector(targets)
display(X), display(y);

4×3 Matrix{Float64}:
0  1.0  0.0
0  1.0  0.0
0  0.0  1.0
0  1.0  0.0

4-element Vector{Int64}:
 127500
 106000
 178100
 140000

Data Preprocessing

Contents

2. Data Preprocessing#

2.1. Reading the Dataset¶#

2.2. Data Preparation#

2.3. Conversion to the Matrix#