Different areas
- Data science - process of extracting knowledge and insights from data (using statistics, computer science and domain expertise)
- the broadest topic in MLB
- Data analytics - process of examining, cleaning, transforming and modeling data in order to discover useful information and support decision-making
- Data mining - process of discovering patterns, anomalies and relationships in large datasets
- Machine learning - process of building algorithms that can learn from data and predict the future values without being explicitly programmed to do so
- Artificial intelligence - the art of creating systems to perform tasks where human intelligence is required (reasoning, learning and perception)
- General AI - theoretical form of AI that can perform any task a human can with general reasoning, learning abilities across domains
- Big Data - see Big Data
- Data engineering - discipline of designing and building systems that can collect, store and process big data efficiently and reliably
- really important for any data analysis or machine learning
- Machine learning engineering - the practice of implementing, deploying and maintaining ML models in production environments
- Data governance - set of policies, processes and standards that ensure data is accurate, secure and used responsibly across the whole organization
Structured data
- stored in tables (rows, columns)
- relational database
Unstructured data
- data primarily meant for humans, not machines (images, videos, audio, text)
- needs to be preprocessed before use in a classifier
- 80 % of the data on the internet
Semi-structured data
- half way in the middle :)
- it gives some structure to unstructured data (e.g. text labelling)
- but the structure is not that rigid as structured data
- XML, HTML, JSON, YAML
Key concepts in ML
- features
- = input variables, could be boolean features, categorical features etc.
- used as input variables to predict the outcome I am interested in
- instance/sample
- individual observation in the dataset, described by a set of features
- labels/targets
- the predicted/outcome variable
- label in classification, target in regression
Supervised learning
- predicting the output variable (target/label) based on previous (already labelled data)
- labelling past data sometimes requires human labour
Classification
- predicting the category or class label based on previously labelled data
- we have binary classification (1 - is sick, 0 - is not sick) or multi-label classification
Regression
- predicting the continuous numerical value (based on previous data with continuous numerical value outcomes)
Unsupervised learning
- we have only the matrix X of samples and features (no vector of outcomes Y)
- for finding patterns and relationships
Clustering
Pattern mining
- for finding patterns, associations, relationships in large datasets
- e.g. finding items that are often purchased together
- Association rule mining