MLB 3., 4. and 5. lecture - Classification (classification models) have binary outputs (positives = 1 and negatives = 0)
- positives imply something interesting/alarming
- negatives are not interesting cases
- dataset will often be imbalanced (0 > 1)

(number of correct predictions) / (number of all samples)
it does not care about the cost of the correct prediction (it is not sufficient by itself)
- if the dataset is imbalanced with 96 % negatives and the classifier predicts negative 100 % of the time → it still has 96 % accuracy…

a simple model or strategy to which we can compare other models or approaches
- it should tell us to invest more (or not)
the usage of concrete baseline differs by the desired usage
baselines:
- random classifier = assigns a random label to every sample
  - comparing to this classifier we measure how much did the model learn from the data
  - if the model has the same or similar accuracy, the model learned nothing (it acts as a random classifier)
- majority classifier = assigns only one label to every sample (the one that is majority, often it is 0)
  - so if the majority of target label is 0, it will without exception assign 0 to all predictions
  - if majority classifier has accuracy 97 % (because the 97 % of all samples are negative), having model accuracy 97.1 % is not that great
- single data source classifier = uses only one data source
  - we can measure if adding a new data source improves the model or not
- simple model (logistic regression, decision tree)
  - if we should invest in an advanced modelling techniques or not

Visualizing the performance

we score the customers using our classifier (e.g. the higher score means more probability that the customer will buy/convert)
and the goal is to calculate the profit of targeting top X% of customers
- it’s the revenue if the customer buys - costs for the targeting (e.g. marketing)
if we target 0 % of customers, the revenue will be zero (logically)
if we target 100 %, we will have negative revenue (the cost of targeting is higher than the cost of buys from the best customers) + we don’t need to model
- the added value of the model is the area between random classifier and the profit curve

“How many of the true positives am I reaching if I target X% of the population?”
for example, if I decide to target 20 % of the population, I get 50 % of the true positives (which is better than random classifier)
but the information about the false positive rate is missing
- this chart can be used if the FPS is uniform or the cost of FP is small (e.g. sending an e-mail is cheap)

is derived from the Cumulative Response Curve
it just shows the difference between the model probabilities versus the random classifier

Petrova digitální zahrada 🚀