we define “population” - which is all possible data (e.g. all people on the world, all possibilities etc.)
and each dataset is a sample of population, and by sampling, we want to sample it even more (= pick specific data samples)
it has to be representative sample
by sampling, we reduce computational costs and it allows us to focus on the business objective
focus on the future (by sampling only recent data, not long ago in the past)
focus on the positive class (bank_fraud = yes) that is really rare in the negative class (bank_fraud = no)
strategies:
stratified sampling = divide population into subgroups and then sample proportionally from each subgroup
temporal sampling = sampling based on the time window (season sampling, recent data only sampling etc.)
random sampling = select instances uniformly at random
oversampling the positive class = when training for positive class detection and the positive class is really rare in the data
problem of selective labels:
in some cases, we are sampling from an already biased dataset where labels only exist for a selected subset
for example, when we train the model for predicting the job performance to help with the hiring decisions
we have data on job performance (and therefore samples) only on the hired applicants, which skews the data
we don’t know, how the rejected applicants would perform at the job if they had been be hired
the outcome (=label) is only observable for the selected group (hired applicants)
outcome - the model learns only on the hired employees, but in practice, it must evaluate all applicants (including those similar to previously rejected applicants)
so we need to be aware of this problem - we cannot have good generalization with this biased data
there are techniques to resolve this (causal inference methods, propensity score weighting etc.)
Encoding
turning features into more machine readable format
examples:
continuous numerical features → normalization (into some defined range)
could be min/max normalization OR standardization (mean and standard deviation)
continuous numerical features → discretiazation to bins
equal-width partitioning OR equal-frequency partitioning
categorical features
nominal features - encoding the categories that do not need to be sorted
e.g. one-hot encoding
ordinal features - encoding the features that have to be sorted (they have an order)
assign each category an unique integer based on it’s order
beware of “high cardinality variables”, a nominal variables with a large number of unique values (40+ unique values)
can be tackled with one-hot encoding, but that severaly increases the dimensionality
better solution: convert discrete variables into numeric values
tackled with Weight of Evidence (WoE) approach
it connects the category values with the target values
so if I have a feature with 1000+ cities, for a concrete city = “Prague”, I would calculate the occurence of “Prague” with target = 1 and the occurence of “Prague” with target = 0 and divide it inside a logartihm
WoE=ln(%ofBads%ofGoods)
WoE=0, no additional value (the number of Goods/Bad is equal)
WoE>0 or WoE<0, it has a meaninful information for the classifier whether the “Prague” label correlates more with 0 or 1 values of target label
then I get a useful numerous information instead of X city names
or another approach: Supervised ratio
just the mean of the target variable for this category
needs smoothing because of overfitting risk
Missing values
“the fact that the value is missing can be informative”
strategies:
keep - missing value is imporant information
we can add a boolean feature indicating if the value in other feature (= column) is missing or not
replace
compute the missing value (mean, median, mode (modus, for categorical features)) or predict with regression
could create bias in the data
the statistical values (mean, median etc.) should be calculated from the training set (not the test set) to avoid data leakage
delete
if the sample/feature has a lot of missing values, delete it to reduce noise
results in data loss
Outliers
= values that are outside of the expected range
could be found using boxplots or computing the z-score (the standardization (mean and deviation)) and outliers have |z > 3|
Modelling
selecting the right evaluation model for the data (clustering, classification, regression, ANN, GBDT, anomaly detection etc.)
tips:
using GridSearchCV or RandomizedGridSearchCV (not better, but faster) for systematic tuning of the parameters
each model has more and less important parameters, we should always focus on the more important ones first
start with broad selection of values for the first iteration, like C= [0.0001, 0.001, 0.01, 0.1, 1, 10, 100, 1000] and if I get C=100, then I run again with values closer to 100
for validation, use Cross validation, do not use the test set
use pipelines for standardized and reusable data preprocessing from raw data to modelling-ready data
watch data importance to see, which features contribute to the prediction or not
Evaluation
checking if the learned data and model outputs are valid and reliable