What is CRISP-DM process?

  • = Cross-Industry Standard Process for Data Mining
    • a process/framework for data mining or machine learning projects which consists of 6 stages and we repeatedly iterate through it
    • it’s flexible, iterative (not linear)

Business understanding

  • understand the goal of the project, business objectives etc.
  • define the success criteria, resources, contraints and risks
  • the fun part: transforming the business problem into the data science problem!
    • all about creativity

Data understanding/exploration

  • collecting all initial data and understanding their structure
  • exploring data quality issues, complexity, initial insights etc.

Data preparation

  • building/selecting/formatting the final dataset for modelling
  • cleaning data, handling missing values, outliers etc.
    • ! garbage in = garbage out (the same principle is with LLMs)
  • feature engineering (= constructing new features)
  • what are proxy data?
    • data collected for reasons other than our data science problem or just data collected for no reason at all
    • those data can help modelling if they are correlated or related to wanted data (which are not available - little data, expensive etc.)
    • those data are often not in the right form we want it to be so a lot of data preparation
  • quality data with labels are often costly
  • dataset preprocessing: sampling, encoding, missing values, outliers
Sampling
  • we define “population” - which is all possible data (e.g. all people on the world, all possibilities etc.)
    • and each dataset is a sample of population, and by sampling, we want to sample it even more (= pick specific data samples)
      • it has to be representative sample
  • by sampling, we reduce computational costs and it allows us to focus on the business objective
    • focus on the future (by sampling only recent data, not long ago in the past)
    • focus on the positive class (bank_fraud = yes) that is really rare in the negative class (bank_fraud = no)
  • strategies:
    • stratified sampling = divide population into subgroups and then sample proportionally from each subgroup
    • temporal sampling = sampling based on the time window (season sampling, recent data only sampling etc.)
    • random sampling = select instances uniformly at random
    • oversampling the positive class = when training for positive class detection and the positive class is really rare in the data
  • problem of selective labels:
    • in some cases, we are sampling from an already biased dataset where labels only exist for a selected subset
    • for example, when we train the model for predicting the job performance to help with the hiring decisions
      • we have data on job performance (and therefore samples) only on the hired applicants, which skews the data
        • we don’t know, how the rejected applicants would perform at the job if they had been be hired
      • the outcome (=label) is only observable for the selected group (hired applicants)
      • outcome - the model learns only on the hired employees, but in practice, it must evaluate all applicants (including those similar to previously rejected applicants)
    • so we need to be aware of this problem - we cannot have good generalization with this biased data
      • there are techniques to resolve this (causal inference methods, propensity score weighting etc.)
Encoding
  • turning features into more machine readable format
  • examples:
    • continuous numerical features normalization (into some defined range)
      • could be min/max normalization OR standardization (mean and standard deviation)
    • continuous numerical features discretiazation to bins
      • equal-width partitioning OR equal-frequency partitioning
  • categorical features
    • nominal features - encoding the categories that do not need to be sorted
      • e.g. one-hot encoding
    • ordinal features - encoding the features that have to be sorted (they have an order)
      • assign each category an unique integer based on it’s order
    • beware of “high cardinality variables”, a nominal variables with a large number of unique values (40+ unique values)
      • can be tackled with one-hot encoding, but that severaly increases the dimensionality
      • better solution: convert discrete variables into numeric values
        • tackled with Weight of Evidence (WoE) approach
          • it connects the category values with the target values
          • so if I have a feature with 1000+ cities, for a concrete city = “Prague”, I would calculate the occurence of “Prague” with target = 1 and the occurence of “Prague” with target = 0 and divide it inside a logartihm
              • , no additional value (the number of Goods/Bad is equal)
              • or , it has a meaninful information for the classifier whether the “Prague” label correlates more with 0 or 1 values of target label
            • then I get a useful numerous information instead of X city names
        • or another approach: Supervised ratio
          • just the mean of the target variable for this category
            • needs smoothing because of overfitting risk
Missing values
  • “the fact that the value is missing can be informative”
  • strategies:
    • keep - missing value is imporant information
      • we can add a boolean feature indicating if the value in other feature (= column) is missing or not
    • replace
      • compute the missing value (mean, median, mode (modus, for categorical features)) or predict with regression
      • could create bias in the data
      • the statistical values (mean, median etc.) should be calculated from the training set (not the test set) to avoid data leakage
    • delete
      • if the sample/feature has a lot of missing values, delete it to reduce noise
      • results in data loss
Outliers
  • = values that are outside of the expected range
  • could be found using boxplots or computing the z-score (the standardization (mean and deviation)) and outliers have |z > 3|

Modelling

  • selecting the right evaluation model for the data (clustering, classification, regression, ANN, GBDT, anomaly detection etc.)
  • tips:
    • using GridSearchCV or RandomizedGridSearchCV (not better, but faster) for systematic tuning of the parameters
      • each model has more and less important parameters, we should always focus on the more important ones first
      • start with broad selection of values for the first iteration, like C= [0.0001, 0.001, 0.01, 0.1, 1, 10, 100, 1000] and if I get C=100, then I run again with values closer to 100
    • for validation, use Cross validation, do not use the test set
    • use pipelines for standardized and reusable data preprocessing from raw data to modelling-ready data
    • watch data importance to see, which features contribute to the prediction or not

Evaluation

  • checking if the learned data and model outputs are valid and reliable

Deployment

  • deploying the model in production
  • done by machine learning engineers