The Data Science Roadmap, 8 main chapters
β†’ AI engineering notes for more study related to AI specifically

  1. Maths and Statistics
  2. Coding
  3. Data101, EDA
  4. Machine Learning
  5. Deep Learning, AI
  6. MLOps, cloud
  7. Data Engineering
  8. Data Science Challenges
  9. Appendix

… other related study roadmaps:


1. Maths and Statistics

def:

  • Mathematics is the foundation of Data Science (DS) and AI, it’s the study of numbers, quantities, shapes and patterns using logic and abstract reasoning.
    • β†’ find universal truths and relationships through deduction.
    • β‡’ in DS, this provides the fundamental language for algorithms, it’s the β€œhow” things work computationally
  • Statistics is the science of collecting, analysing, interpreting and presenting data.
    • β†’ framework for making sense of information in the face of uncertainty
    • β‡’ in DS, this provides the methods for drawing conclusions from data, it’s the β€œwhat does this mean?”

Mathematics is about certainty,
Statistics is about uncertainty.

1.1 Mathematics foundations

1.1.1. Linear Algebra

def: maths of data structures, it deals with vectors, matrices and operations on them.
β†’ datasets = matrices and all ML model computations are matrix operations.

  • Matrices
  • Eigenvalues

1.1.2. Calculus

def: maths of change, focusing on limits, functions, derivatives, integral, infinite series, gradients
β†’ it’s related to how ML models learn, gradients point the direction to update the model parameters to reduce errors.

> Gradients ❓
def: generalisation of derivatives that rpz the slope and direction of steepest ascent of multi-dim functions

  • Gradient Descent
  • Stochastic Gradient Descent

> Optimisation ❓
def: xxx

  • xxx

1.1.3. Probability Theory

def: maths of uncertainty, it quantifies how likely events are; foundation for stats that deals with the analysis of random phenomena and uncertainty quantification
β†’ it’s the foundation for making predictions and dealing with noisy, real-world data

> Probability Distributions

def: mathematical functions that provide the probability of occurrence of different possible outcomes in an experiment

Common ones:

  • Normal distribution (a.k.a. Gaussian)
  • Uniform distribution
  • Binomial distribution
  • Poisson distribution
  • Exponential distribution

They can be either (or both):

  • Continuous: describe measurable quantities (value in a range) e.g. total value of money
  • Discrete: describe countable outcomes (whole numbers) e.g. number of coins

>> Normal Distribution (Gaussian)
def: the classic β€œbell curve”

  • Symmetrical, continuous
  • Mean = Median = Mode
  • Parameters: Mean () and Standard Deviation ()
  • The most common distribution in natural phenomena, e.g. height distribution, test scores

>> Uniform Distribution
def: every outcome has equal probability

  • Flat and constant probabilities
  • Can be continuous or discrete
  • e.g. dice rolls, lottery draws, random number generators

>> Binomial Distribution
def: counts successes in a fixed number of yes/no trials

  • Discrete, two possible outcomes
  • Parameters: Number of Trials (n) and Probability of Success (p)
  • e.g. coin flips, A/B test conversions

>> Poisson Distribution
def: counts events happening in a fixed interval of time/space

  • Discrete, events are independent
  • Parameters: Average Rate ()
  • e.g. website visits per hour, customer arrivals per hour, system failures per day

>> Exponential Distribution
def: models time between events in a poisson process

  • Continuous, β€œmemoryless”
  • e.g. time between customer arrivals, time between earthquakes
> Bayes Theorem

def: probability of an event based on prior knowledge of conditions related to the event

  • start with an initial belief
    prior
  • see new data
    • evidence ,
    • and its likelihood )
  • then update to get a revised belief
    posterior

Example:

  • P(Fire) = 1%
  • P(Smoke) = 10%
  • P(Smoke|Fire) = 90%
  • What’s the probability of P(Fire|Smoke)?
    • = 9%

>> Naive Bayes
cf. >> ii. Bayesian ML in 4. Machine Learning

>> Bayesian Inference
def: statistical framework that treats probability as a degree of belief that gets updated as new data arrives

  • Traditional approach (frequentist): β€œthere’s a 95% probability that data would look like this if my hypothesis were true”
    • β†’ Hypothesis inform Data
  • Bayesian approach: β€œthere’s a 95% probability that my hypothesis is true given this data”
    • β†’ Learn as you go, update beliefs with new evidence
    • β‡’ Data inform Hypothesis

1.2. Statistics foundations

1.2.1. Descriptive Stats

def: methods to summarise and describe the main features of a dataset quantitatively

for more

> Mean
def: arithmetic average of a set of values β†’ sum all values, divide by number of values
βž– sensitive to outliers

> Median
def: middle value separating the higher half from the lower half of a data set β†’ robust measure of central tendency
βž• unaffected by outliers

> Mode
def: value that appears most often in a dataset
βž• useful for categorical data β†’ most common category

> Variance
def: measure of dispersion that rpz how far each number in the set is from the mean, thus from every other number in the set

> Standard Deviation
def: the sq root of the variance β†’ variance in the same units as the data

> Covariance
def: joint variability of two random variables β†’ direction of linear relationship

  • positive = same trend
  • negative = opposite trend

> Correlation
def: standardised covariance β†’ measure the strength and direction of linear relationship between two variables ([-1, 1])
βž• easier to interpret cause normalised

  • positive and close to 1 = strong positive relationship
  • 0 = not linear relationship

with std

> Skewness vs. Kurtosis

  • Skewness: measures asymmetry of a distribution
    • positive skew: tail on the right, mean > median (order: mode, median, mean)
    • negative skew: tail on the left, mean < median (order: mean, median, mode)
    • zero skew: symmetric distribution e.g. normal curve (mean, median, mode similar)
  • Kurtosis: measures the tail heaviness and peak sharpness relative to a normal distribution
    • high: heavy tails, sharp peak β†’ more outliers
    • low: light tails, flatter peak β†’ fewer outliers
    • normal distribution: kurtosis=3
    • β‡’ tail thickness / outlier proneness

> QQ plot
def: Quantile-Quantile plot is a visual tool to check if a dataset follows a theoretical distribution (usually the normal distribution, i.e. it’s a normality visual check tool)
β†’ plot data’s quantiles against the quantiles of a theoretical distribution

  • Straight diagonal line β†’ distribution matches the theoretical one
  • S-shaped curve β†’ skewness (positive or negative)
  • Points curved above/below line at ends β†’ heavy or light tails (kurtosis difference)
  • Outliers β†’ individual points far from the line at ends

1.2.2. Inferential Stats

def: methods to draw conclusions and make predictions about a population, based on a sample data
β†’ which includes hypothesis testing, confidence intervals, estimations, …
β†’ it’s about going beyond the data you have to understand something bigger

i. Hypothesis Testing

def: a process within inferential stats where you can make an assumption (the null hypothesis) and then use sample data to see if there is enough evidence to reject that assumption

  • Null Hypothesis (): default assumption that there is no effect or difference, nothing’s happening, might be pure luck
  • Alternative Hypothesis (): hypothesis that there is an effect or a difference, something else is happening, not exactly pure luck

πŸ”‘ basically to determine the effect of chance on the differences

There are 3 common statistical tests:

  • t-tests: for comparing means (2 groups)
  • Chi-square tests: for categorical data
  • ANOVA: for comparing means of multiple groups (sort of a more advanced t-test)
TestDefinitionUseExample
t-testsCompares means between two groups.Used to compare the average values of two groups.For instance, to see if the average sales differ between two regions.
Chi-squareExamines frequency distribution.Used when analyzing frequencies across categories.For example, to check if the distribution of customers between two stores is different.
ANOVACompares means between three or more groups.Used to compare the average values across more than two groups.For instance, you can compare average sales across multiple cities, such as New York, Chicago, and Los Angeles.
> t-tests

def: statistical test used to determine if there’s a significant difference between the means of two groups

  • t-value or t-score: ratio of the difference between the mean of the two sets and the variation that exists within them
    • t-score >>> (critical t-score)
      • β†’ groups are different, can reject
    • t-score <<<
      • β†’ groups are similar
  • degrees of freedom: values in a study that have the freedom to vary
  • t-distribution or t-table: reference table that lists critical values of t, which define threshold for significance for certain stat tests
    • two-tailed tests: used when the is non-directional i.e. if the hyp states that a population parameter is not equal to a certain value (β‰ ) (e.g. one population is β‰ Β from the other)
    • one-tailed tests: used when the is directional i.e. if the hypothesis states that a population parameter is > or < something (e.g. one population is bigger than the other)

different types of t-tests:

  • paired (dependent): two dependent groups, can be same group but β‰  conditions
  • equal variance or pooled (independent): two different populations
  • unequal variance (independent): two different populations
  • one-sample: one group being compared against a standard reference value

depending on

  • similarity of the sample records
  • number of data records in each sample set
  • variance in each sample set

e.g.

  • t-score = 2.24
  • degrees of freedom = 24
  • alpha level or level of significance, (generally)
  • method 1: compare to t-value from the t-distribution (one-tailed or two-tailed)
    • critical t-score: 2.064
    • β‡’ t-score > critical t-score
    • β‡’ can reject !
  • method 2: calculate p-value
    • if p-value << 0.05
    • β‡’ can reject
> Chi-Square Test

def: hypothesis test used to examine relationships between categorical variables (e.g. yes/no, categories) such as whether two variables are independent (chance) or related.
β†’ non-parametric test (e.g. no assumption of normal distribution) since categorical vars can be nominal, ordinal, or just groupings β€” not continuous values, hence no normal distribution

examples:

  • chi-square test checks if the distribution of disease (yes/no) is dependent of smoking status (smoker, non-smoker)
  • flavour preference (categories) differs between males or females
  • left-right handed vs. nationality

>> independence test
def: used to test if 2 categorical variables are related to each other, or independent

>> goodness of fit
def: used to test if the frequency distribution of a categorical variable is different from your expectations (hypothesis)

> ANOVA

def: ANalysis Of VAriance, a collection of stat tests used when comparing the means of three or more groups to see if at least one group’s mean is different from others.
β†’ experiments where you want to compare multiple groups on a numeric outcome
β†’ variance within the group vs. between the groups
β†’ determine the influence of independent variables on the dependent variable in a regression study

  • ANOVA coefficient = F-statistic = Between group Variance / Within group Variance = MS_between / MS_within
    • MS = Mean sum of squares
  • if F >> : between group variance big, within group variance small β‡’ at least one of the group means is significantly β‰ Β from others
    • β‡’ can reject β€” something must be going on!
  • if F << : between group variance small and within group var big β‡’ dispersed, homogenous, not significantly β‰ Β 
    • β‡’ cannot reject β€” maybe due to chance?

more details:

  • arithmetic mean for each group :
  • overall mean
  • sum of squares for each group
  • sum of squares between group
  • sum of squares within group
  • total sum of squares
  • mean sum of squares
    • degree of freedom between groups
  • mean sum of squares
    • degree of freedom within groups
  • F-statistic =
    β†’ you can calculate the p-value after this, remember that p-value <<< means that can be rejected

some assumptions need to be met:

  • independence of observations
  • homogeneity of variances
  • normal distribution

examples:

  • crop yields vs. 3Β β‰ Β fertilisers
  • reduce blood pressure vs. 4Β β‰ Β drugs
  • exam test scores vs. 3 β‰ Β teaching methods
    β‡’ are the differences in mean between the groups significant enough to be due to the different testing variables? or is it more likely due to chance (i.e. not high enough)?

>> one-way ANOVA

  • 1 independent variable (i.e. a single factor)
  • β‰₯ 2 groups
  • goal: determine if a significant β‰ Β exists between the means of the groups.
    β†’ simpler to interpret

example:

  • 1 independent variable = teaching method
  • 3 groups of student
  • 1 dependent variable = exam score
  • null hypothesis : mean exam scores across all 3 teaching methods are equal
  • alternative hypothesis : at least one group’s mean significantly differs
  • β‡’ ANOVA tells if the variation in exam scores can be due to differences in teaching methods or if it’s likely due to chance

>> two-way ANOVA

  • 2 independent variables
  • each with β‰₯ 2 groups
  • goal: analyse how both indep var influence the dependent var (i.e. the result)
    β‡’ deeper insight into how different factors together can impact outcomes

example:

  • 1 dependent variable = exam score
  • 1/2 independent variable = does the teaching method affect exam score?
  • 2/2 independent variable = does the study method affect exam score?
  • interaction = does effectiveness of teaching method depend on study method used?

note: Post-Hoc Tests after ANOVA to determine which groups are different

ii. P-values

def: probability of obtaining test results at least as extreme as the results actually observed, assuming the null hypothesis is true (i.e. β€œnothing special is happening” until you find good evidence to say otherwise)
β†’ i.e. how likely it is to see your data happening purely by chance
β†’ i.e. an indicator of stat significance, but not the size of importance of an effect
β†’ i.e. helps decide if the observed patterns in a data are likely due to random chance or reflect a real underlying phenomenon
β†’ i.e. how surprising your data would be if the null hypothesis was true

  • p-value < 0.05 (5%)
    • β†’ β€œpurely by chance” <<<
    • β‡’ observed data is unlikely under the null hypothesis
    • β‡’ there is enough evidence to reject the null hypothesis in favour of an alternative hypothesis β†’ something else is going on
  • vs. p-value >>
    • β†’ β€œpurely by chance” >>>
    • β‡’ data is consistent with the null hypothesis
    • β‡’ there is insufficient evidence to reject it
    • β†’ well nothing is going on then

iii. Confidence Intervals

def: range of values that is used to estimate an unknown population parameter (e.g. mean) with a certain level of confidence
β†’ instead of giving just one number, it’s a range where the true value is likely to be
β†’ sense of reliability, uncertainty and precision of your estimate from sample data

  • confidence interval = range where true value likely falls
  • confidence level (e.g. 95%) = how sure you are about this range

Concrete example:

  • goal: estimate student height at a school
  • sample data: measure height of 30 students β†’ Mean estimate
  • Standard Error:
  • Margin of Error for 95% (i.e. z-value = 1.96, related to normal distribution):
  • β‡’ confidence interval would be the Mean estimate Β± ME with a confidence lvl of 95%

iv. Stats Cheatsheet

… for finance and investing apps.

source: https://www.investopedia.com/terms/a/anova.asp

TestPurposeWhen to UseApplications in Finance/Investing
ANCOVACompares the arithmetical means of two or more groups while controlling for the effects of a continuous variableβ€’ Normal distribution
β€’ Comparing multiple independent variables with a covariate
β€’ Analyzing investment returns while controlling for market volatility
β€’ Evaluating the effectiveness of financial strategies while accounting for economic conditions
ANOVACompares the means of three or more groupsβ€’ Data is normally distributedβ€’ Comparing financial performance across different sectors or investment strategies
Chi-Square TestTests for association between two categorical variables (can’t be measured on a numerical scale)β€’ Data is categorical (e.g., investment choices, market segments)β€’ Analyzing customer demographics and portfolio allocations
CorrelationMeasures the strength and direction of a linear relationship between two variablesβ€’ Data is continuousβ€’ Assessing risk and return of assets, portfolio diversification
Durbin-Watson TestChecks if errors in a prediction model are related over timeβ€’ Time series dataβ€’ Detecting serial correlation in stock prices, market trends
F-TestCompares the variances of two or more groupsβ€’ Data is normally distributedβ€’ Testing the equality of variances in stock returns and portfolio performance
Granger Causality TestTests for a causal relationship between two time seriesβ€’ Time series dataβ€’ Determining if one economic indicator predicts another
Jarque-Bera TestTests for normality of dataβ€’ Continuous dataβ€’ Assessing if financial data follows a normal distribution
Mann-Whitney U TestCompares medians of two independent samplesβ€’ Data is not normally distributedβ€’ Comparing the financial performance of two groups with non-normal distributions
MANOVACompares means of two or more groups on multiple dependent variables simultaneouslyβ€’ Data is normally distributed
β€’ Analyzing multiple related outcome variables
β€’ Assessing the impact of different investment portfolios on multiple financial metrics
β€’ Evaluating the overall financial health of companies based on various performance indicators
One-Sample T-TestCompares a sample mean to a known population meanβ€’ Data is normally distributed, or the sample size is largeβ€’ Comparing actual versus expected returns
Paired T-TestCompares means of two related samples (e.g., before and after measurements)β€’ Data is normally distributed, or the sample size is largeβ€’ Evaluating if a financial change has been effective
RegressionPredicts the value of one variable based on the value of another variableβ€’ Data is continuousβ€’ Modeling stock prices
β€’ Predicting future returns
Sign TestTests for differences in medians between two related samplesβ€’ Data is not normally distributedβ€’ Non-parametric alternative to the paired t-test in financial studies
T-TestCompares the means of two groupsβ€’ Data is normally distributed, or the sample size is largeβ€’ Comparing the performance of two investment strategies
Wilcoxon Rank-Sum TestCompares the medians of two independent samplesβ€’ Data is not normally distributedβ€’ Non-parametric alternative to the independent t-test in finance
Z-TestCompares a sample mean to a known population meanβ€’ Data is normally distributed, and the population standard deviation is knownβ€’ Testing hypotheses about market averages

1.2.3. Time Series Analysis

def: statistical methods for analysing data points collected in chronological order to identify patterns, trends and make forecasts.

πŸ’‘ Datapoints are not independent β€” each point depend on previous ones β†’ temporal dependence

  • Time series: set of observations recorded over time (can be daily, monthly, etc.)
> Key Components Time Series

def: series = trend + seasons + cycles + error/noise/residue

  • Trend: Persistent long-term increase or decrease in the mean of the series
    • β†’ slowest moving part of a series, i.e. the largest time scale of importance
    • time-dependent concept
  • Seasonality: Regular, predictable/periodic patterns that repeat (e.g. daily, weekly, yearly) in the mean of the series
    • β†’ often driven by the cycles of the natural world, or by conventions of social behaviour surrounding dates and times
    • time-dependent concept
    • e.g. seasons, time of year, day of week dependent
  • Cyclical: Irregular, long-term fluctuations (e.g. economic cycles)
    • β†’ patterns in a time series associated with how the value in a series at one time depends on values at previous times, but not necessarily on the time step itself
    • serial-dependent concept
    • e.g. volcano eruptions, epidemics, animal populations
  • Noise/Random: Unexplained variation

Some modeling tools for each of the component:

>> Trend modeling tools
  • Moving Average
  • Engineering Trend
>> Seasonality tools
  • Seasonal indicators
  • Fourier features
>> Cycle tools
  • Lags
> Essential Concepts Time Series
  • Stationarity: A time series whose statistical properties (mean, variance) don’t change over time
    • tests: xxx
  • Autocorrelation: How a series correlates with its own past values
  • Decomposition: Breaking a series into Trend + Seasonality + Residual components
> Common Time Series Models & Methods

… generally for forecasting.

>> ARIMA
def: AutoRegressive Integrated Moving Average, the classical statistical model

notes: and all the derivatives…

>> Exponential Smoothing
def: weighted averages where recent observations get more weight

>> LSTM
def: Long Short Term Memory
β†’ DL approach for complex temporal patterns

>> Prophet
def:

> Time Series β€” misc

>> Applications

  • Stock market forecasting
  • Demand/sales prediction
  • Weather forecasting
  • IoT sensor monitoring
  • Web traffic analysis

>> Special considerations

  • No random splitting: must use time aware train/test splits β†’ cf. > Time Series Split
  • Cross-Validation: use techniques like β€œrolling window” validation
  • Feature Engineering: lags, rolling average, seasonal indicators, …

2. Coding

def: programming languages specifically suited for data manipulation, stat analysis, ML implementation across different scales and domains

2.1. Python

def: versatile, high-level programming language renowned for its simplicity and extensive ecosystem of data science libraries (incl. numpy, pandas, scikit-learn, …)

More

2.1.1. numpy

def: fundamental package for scientific computing in python, providing support for large, multi-dim arrays and matrices along with mathematical functions

  • ndarray: N-dim array object providing efficient and operations for numerical data (i.e. matrix❓)
  • vectorise: element-wise array operations that avoid explicit loops, significantly improving computational performance through optimised C implementations
  • broadcasting: arithmetic operations with arrays of different shapes ❓❓

2.1.2. pandas

More

def: fast, powerful, flexible open-source data analysis and manipulation tool

  • DataFrame: 2D labeled data structure with columns, holding any data type
  • Series: 1D labeled array, can hold any data type

Data Processing: methods to prepare data for analysis

  • missing data
  • remove duplicates
  • correct inconsistencies

2.1.3. ML/DL libraries

i. scikit-learn

def: comprehensive ML library featuring simple and efficient tools for data mining and analysis, built on numpy, scipy and matplotlib.

note: ah bah si en fait, c’est bien cette library que j’utilise pour toute la partie ML xd

❓ this deserves its own note

  • pipeline
  • gridsearchCV

ii. PyTorch

def: research-friendly DL framework from Facebook, more intuitive for many and popular in academia

iii. TensorFlow / Keras

def:

  • TensorFlow (Google): powerful, production-level DL framework, can be complex, but very scalable
  • Keras: user-friendly API that runs on top of TensorFlow β€” easiest way to start building NNs

2.1.4. Visualisation libraries

i. matplotlib

def: fundamental plotting lib for python, comprehensive and highly customisable 2D plotting library

❓ deserves its own note too

  • figures and axes
  • charts and plots

ii. seaborn

def: built on top of matplotlib, provides stats visualisations with nice defaults and simpler syntax

2.2. R

def: specialised programming language and environment for stat computing and graphics

2.3. SQL

def: Structured Query Language, standard language for managing and querying relational DB systems. β†’ it’s a language, not a product

some definitions about databases:

  • Relational DB (RDB): organise data into structured tables composed of rows and columns
    • each row = record, uniquely identified by a primary key
    • each column = field, attribute
    • tables are linked through relationships using foreign keys
    • βž• data integrity, data consistency (ACID properties: Atomicity, Consistency, Isolation, Durability)
    • β‡’ SQL is the primary language used to manip and query RDB
  • Non-Relational DB (NoSQL): store data in formats other than tables, such as documents (MongoDB), key-val pairs (Redis), wide-column stores (Cassandra), or graphs (Neo4j).
    • βž• flexibility with unstructured or semi-struct data
    • βž• scale horizontally
    • βž– not fully enforce ACID compliance
  • NewSQL DB: combine scalability of NoSQL with ACID transactions of RDB, providing both high performance and consistency.

a bit further:

  • DB Management System (DBMS): full software sys that users and apps interact with for data management
    • β†’ tools and interfaces to create, update, query data
    • β†’ overall DB ops: user management, transaction control, integrity checks, concurrency, backups, query parsing
    • e.g. MySQL, PostGreSQL, Microsoft SQL Server, Oracle
  • DB Engine: a.k.a. storage engine, it’s a core component or subsystem within a DBMS responsible specifically for data handling, storage and retrieval.
    • β†’ executes CRUD (Create, Read, Update, Delete) ops
    • β†’ manages low-level tasks: indexing, caching, transaction support
    • e.g. SQLLite, InnoDB, MyISAM
      • β†’ can have its own tweaks in querying language right?
Relational DBNoSQL DB
DocumentWide-colGraphKey-value
MySQLMongoDBCassandraNeo4jRedis
PostgreSQLElasticSearchBigTableNeptuneDynamoDB
OracleCosmosDBHBaseMemcached
MS SQLCouchDB

cf. 7.1. Data Eng Cycle for more

More

code: SQL 101

2.3.1. MySQL

def: open-source RDB Management System (RDBMS) that uses SQL
β†’ popular for webapps, easy to use and widely supported

2.3.2. PostGre

def: advanced open-source RDB Management System (RDBMS)
β†’ standards compliance, extensibility, powerful features like JSON support and robustness
β†’ suitable for complex queries and large-scale apps

2.3.3. SQLLite

def: lightweight, serverless, file-based DB engine.
β†’ embeddable in apps, used for mobile, desktop, and testing env
β†’ not designed for heavy concurrent loads but very convenient

2.4. SE for DS

> Version control Git
def: git is a system for tracking code changes, collaborating with others, reverting mistakes, etc.
β†’ essential for reproducibility

  • GitHub / GitLab: platforms for hosting and managing Git repositories

More

code: git 101

> APIs and Deployment
def: Application Programming Interface, a set of rules allowing different applications to communicate

  • FastAPI / Flask: python framework for building web APIs.

> Basic Scripting
def: CLI & Scripting used for automating tasks and running pipelines from the command line

More

code: CLI-bash 101

3. Data101, EDA

def: foundation of all data work, involving core concepts, processes and methodologies for

  1. collecting,
  2. cleaning,
  3. processing and
  4. understanding data
  5. before analysis.

Main data disciplines, focus on distinct aspects of the data lifecycle:

  • Data Engineering (cf 7. Data Engineering)
    def: build and maintain the infrastructure and pipelines that collect, clean, transform, and store data for analysis and use.
  • Data Science
    def: analyse and model data to extract insights, build predictive models and support decision-making using stats, ML, and domain knowledge.
  • Data Analytics (cf. 3.6. Data Visualisation Tools)
    def: focus on interpreting processed data through visualisation, reporting, and descriptive stats to inform business actions
  • Machine Learning Engineer (cf. 6. MLOps, cloud)
    def: bridge data science and software engineering by deploying and optimising ML in production env.
  • AI Engineering (cf. 5. Deep Learning, AI)
    def: build, test and deploy AI models and systems, combining ML, SE and data handling to create intelligent apps.
  • Data Governance
    def: ensure data quality, security, privacy and compliance through policies and processes

Some data handling terms:

  • Data Wrangling (or munging): broadest term to refer to entire process of taking raw, messy data and transforming into clean, structured format.
    β†’ include cleaning, enriching, transforming, merging data from various sources.
  • Data Pre-processing: subset of wrangling that specifically prepares the data for ML models. the focus is on making the data digestible for algorithms.
    • Feature Engineering: create better inputs for ML models, key part of data pre-processing
  • Data Processing: general term for any operation on data, it can be synonymous with wrangling, or refer to large-scale data transformation in data engineering (ETL)
  • Data Cleaning: correct the imperfections in data, which is a critical sub-task within wrangling
  • Data Mining: discover hidden patterns in data, this comes after data is prepared, and typically makes use of ML techniques
  • ETL/ELT: create data pipelines from sources to storage, can be considered as a part of data engineering or a form of large-scale processing.
  • Data Modeling: design how data is structured and organised (e.g. schema design for DB), ensuring data is stored and accessed efficiently and meaningfully β†’ more DB/data engineering design, often involved early in pipeline
  • Data Analysis: answer questions with data (human-led) β‡’ the overarching goal with all of these data related processes

3.1. Data Types

def: classification of data based on its structure, organisation, format

> Structured Data

def: highly organised data with a predefined schema, typically stored in relational DB with rows and columns
β†’ SQL tables, CSV files, spreadsheets, …

> Unstructured Data

def: data without a predefined organisational structure, which requires specialised processing techniques β†’ no predefined structure
β†’ text documents, images, videos, audio files, social media posts, …

> Semi-structured Data

def: data that doesn’t conform to rigid structure but contains organisational properties like tags and markers
β†’ json, xml, email formats

3.2. Data Preprocessing

def: crucial data cleaning and preparation phase where raw, messy data is transformed into a clean, structured format suitable for analysis (or specifically ML models).

There are common steps to address in this process (not all covered here):

  1. πŸ”œ data acquisition and import (DB, csv, APIs)
  2. πŸ”œ data integration (combine from sources, formats)
    cf. 7.1.1. Data Generation (Collection & Integration) in the 7. Data Engineering chapter
  3. 🟒 data cleaning (na, duplicates, outliers, inconsistencies)
  4. 🟒 data transformation (type conversion, norm/stand num, encode cat)
  5. πŸ”œ data reduction (reduce dim, feature selection or extraction)
    cf. > Dimensionality reduction for some examples
  6. πŸ”œ feature engineering (new meaningful features)
    cf. 3.4. Feature Engineering in this same chapter
  7. πŸ”œ data splitting (train, validation, test)
    cf. 3.5. Data Splitting in this same chapter

🟒: covered here / done
βœ”: covered already
πŸ”œ: covered later
⏭: skipped / omitted

3.2.1. Missing data handling

def: deal with incomplete data points including deletion, mean/med/mode imputation, or advanced methods like KNN imputation and predictive modeling.
na: not available, i.e. missing value

  • Imputation: general operation of filling missing data pointsβ€” anywhere, can be inside or outside range of existing data
  • Interpolation: estimate unknown values that fall within the range of known data points
    β†’ guess what’s between data, assuming continuity
  • Extrapolation: predict values outside the range of observed data
    β†’ forecast into the future or to estimate for unobserved areas, beyond
    β†’ can also be backward
> mean, median, mode imputation
  • Mean imputation: replace na values with the mean of the available data
    • βž• preserve overall distribution
    • βž– reduce variance
  • Median imputation: replace with the median value
    • βž• more robust to outliers, preserve central tendency
  • Mode imputation: the most frequent value
    • βž• ok for nominal data
    • βž– introduce bias for small datasets
> KNN imputation

def: use k-nearest neighbours algo to impute na values based on similar instances
βž• preserve relationships between variables
βž– computationally heavy

3.2.2. Outlier detection

def: identify and handle anomalous data points that deviate significantly from other observations, which can skew analysis results or model performance, represent errors, rare events, etc.

Casual methods to detect outliers:

> z-score method

def: a.k.a standard score, identify outliers as data points that fall beyond a certain number of std from the mean of the dataset
β†’ assuming normal distribution of data
βž– skewed distributions

> IQR method

def: use interquartile range to detect outliers as points below or above a range
βž• robust to non-normal distributions
βž• skewed distributions

  • low bound:
  • high bound:

with:

  • Q1, Q3 = 25%, 75%

  • = measure of statistical dispersion that captures stable sense of data spread in the middle 50% of a dataset (without the outliers, i.e. not the full range)
> more basic?

percentile / quantile

3.2.3. Normalisation

def: scaling techniques that transform numerical features to a common scale while preserving relationships

> min-max scaling

def) normalisation: rescale data to a fixed range (usually [0, 1], sometimes [-1, 1]).
β†’ use min-max
βž• features have different scales
βž• work well with distance-based models (k-NN, NN)
βž– sensitive to outliers

> standard scaling

def) standardisation (z-score scaling): transform data to have and , no fixed range
β†’ keep shape of data distribution
βž• less sensitive to outliers
βž• work well with model assuming normally distributed data (SVM, PCA), or gaussian data

> robust scaler

def: scale data using stats that are robust to outliers, such as median and IQR β€” no fixed range
βž• robust to outliers
βž• robust to skewed distributions

3.2.4. Encoding (cat vars)

def: convert categorical text data into numerical format that ML algorithms can process.
β†’ choosing the right encoding method depends on

  • data type
  • number of categories (cardinality)
  • algorithm
> label encoding

def: assign a unique integer to each category
βž• ordinal data (i.e. ordered)
βž– nominal data (i.e. no intrinsic order)

> ordinal encoding

def: assign integers based on the order or ranking of categories
βž• ordinal data (i.e. ordered)

> one-hot encoding

def: create binary columns for each category, value of 1 or 0 indicates presence or absence
βž• nominal data (i.e. no intrinsic order)
βž– high-dimensionality if category number high

> target encoding

def: replace categories with mean of the target variable for that category
βž• high-cardinality features (i.e. lots of categories)
βž– overfitting

> binary encoding

def: represent categories as binary digits, splitting across multiple columns
βž• high-cardinality features (i.e. lots of categories) β†’ reduce feature space

3.3. EDA

def: Exploratory Data Analysis, a systematic initial investigative process of analysing datasets to understand, summarise and visualise main characteristics, uncover patterns, spot anomalies, and insights.

  1. ⏭ understand the problem and the data (clarify the business/research q, data available, domain-specific constraints)
  2. ⏭ import and inspect data
  3. βœ” data preprocessing steps (data processing part 1)
    cf. 3.2. Data Preprocessing
  4. 🟒 explore data and variable characteristics (stats, distribution)
  5. 🟒 visualise data (plots)
  6. 🟒 examine relationships between variables (correlation)
  7. βœ” detect and handle outliers (data processing part 2)
    cf. 3.2.2. Outlier detection
  8. πŸ”œ data transformation and feature engineering
    cf. 3.4. Feature Engineering
  9. ⏭ communicate finding
    cf. 7.1.4. Data Serving for more hands-on use of the data analysis

🟒: covered here / done
βœ”: covered already
πŸ”œ: covered later
⏭: skipped / omitted

3.3.1. Stats summary

def: calculate descriptive stats (mean, med, mode, std, quartiles)
β†’ to understand data distribution (central tendency and spread),
β†’ detect patterns or anomalies
β†’ identify skewness

Some more definitions (in addition to cf. 1.2.1. Descriptive Stats from 1. Maths and Statistics)

  • Variance: measures the average of the squared differences between each data point and the mean β€” a measure of spread or dispersion
    β†’ how spread out the values are from the mean, in squared unit!
    β†’ less intuitive to interpret
  • Std: square root of the variance
    β†’ same unit as the data β‡’ more interpretable
  • Central tendency: measure that identifies the center of a data distributionβ€” can be mean, med, mode
  • Spread: i.e. dispersion, how much data values vary around the central tendencyβ€” can be range, variance, std
  • Normally distributed: symmetrical, bell-shaped curve where data is evenly distributed around the mean
  • Skewed: asymmetrical, can have a longer tail on the right (positive skew) or on the left (negative skew)
  • Multi-modal: two or more peaks or modes, indicating multiple values or clusters within the data

3.3.2. Data Viz (charts, plots)

def: use visual tools to gain intuitive understanding
β†’ spot trends, patterns
β†’ outliers, anomalies
β†’ dependent variables

cf. 2.1.4. Visualisation libraries in 2. Coding
cf. 3.6. Data Visualisation Tools in the same chapter

> basic charts

def: fundamental graphical rpz used to display categorical and numerical data relationships

Bar chart
def: rectangular bars with length # values they rpz
βž• comparing categorical data across different groups

  • grouped bar chart
  • stack bar chart

Pie chart
def: circular stat graphic divided into slices = numerical proportion
βž• show parts of a whole relationship

Line chart
def: connect data points with lines
βž• show trends over time

> statistical plots

def: specialised viz designed to rpz data distributions, relationships, statistical properties for analytical purposes

Histogram
def: distribution of numerical data, using bars/bins to show frequency counts

Scatter plot
def: cartesian coordinates, display values for 2 variables
βž• reveal correlation patterns

Box plot
def: summarise data distribution showing median, quartiles, outliers

Density plot
def: smoothed visualisation of distribution, showing probability density

Violin plot
def: combine box plot and density plot to visualise distribution shape

> specialised maps

def: advanced visual rpz for more complex data relationships

Heatmap
def: graph rpz where values are depicted as colors in a matrix format
βž• complex correlation matrices, or density distributions

Confusion matrix
def: table layout of the performance of a classification algo
β†’ shows true vs. predicted classifications

3.3.3. Correlation

def: examine relationships β€”strength and directionβ€” between two variables using correlation coefficients
β†’ to identify potential predictors (which var move together)
β†’ and multi-collinearity (when predictor variables are too highly correlated with each other)

More on multi-collinearity:

  • def: two or more predictors in a model have a very high correlation
  • how to detect: examine correlation matrix (pearson correlation > .8 β‡’ very high)
  • issue with high corr: can cause instability or distortions in regression and other models

How to measure correlation:

  • categorical variables: cross-tabulations and group comparisons
  • continuous variables: correlation coefficients like pearson’s r ([-1, 1]) to quantify the linear association
    • ~ +1: strong positive linear relationship (variables increase together)
    • ~ -1: strong negative linear relationship (one increases when the other decreases)
    • ~ 0: little to no linear relationship
  • note: but there are other correlation coeffs to measure other types of relationships
    • pearson: correlation
    • spearman: rank correlation
    • kendall: tau rank correlation

3.4. Feature Engineering

def: process of creating new features or transforming existing ones to improve ML model performance by better representing the underlying patterns in the data
β‡’ improve model performance (reduce overfitting by avoiding irrelevant or redundant features)
β†’ accuracy (because better inputs)
β†’ interpretability (by focusing on key predictive variables)
β†’ efficiency (by reducing dimensionality and computational load)

This step is actually crucial, often iterative, relying on domain expertise, experimentation and evaluation to refine features for optimal model performance.

e.g. add day, month, year from time-based features (feature extraction)

> feature creation
def: generate new features based on domain knowledge or by combining existing features
β†’ e.g. interaction terms, polynomial features

> feature transformation
def: apply transformations like log or binning to make features more suitable for modeling
β†’ binning: a.k.a. data discretisation or bucketing, transforms continuous numerical data into discrete intervals or β€œbins”

> feature extraction
def: reduce dimensionality using PCA, or extract key characteristics from complex data

> feature selection
def: choose the most relevant features using filter methods (correlation, chi-square), wrapper methods (recursive elimination), or embedded methods (Lasso, tree-based importance)

> feature scaling
cf. 3.2.3. Normalisation in 3.2. Data Preprocessing previously

3.5. Data Splitting

def: strategies for partitioning datasets into training - validation - testing subsets to properly evaluate model performance (without data leakage) and prevent overfitting.

  • Data leakage: information that wouldn’t be available at the time of prediction is mistakenly used during model training (e.g. preprocessing steps done on the entire dataset before splitting, future info such as outcomes leaks into training features, …)
    β†’ cf. Data Preparation & Engineering in 8.1. Data challenges
  • Overfitting: when a ML learns not only the underlying patters in the training data but also the noise and random fluctuations
    β†’ amazing perf on training data but very poor on new, unseen data
    β†’ basically fails to generalise
    β†’ cf. Model Development in 8.2. Model challenges

β†’ purpose: ensure models are tested on unseen data to assess generalisation and avoid overfitting

  • Training set: used to learn model parameters
  • Validation set: used to tune hyper-parameters and select models
  • Test set: used for final eval to estimate real-world performance

Some other best practices:

  • Randomise data before splitting (except time series)
  • keep test set completely separate until final evaluation
  • use cross-validation techniques for robustness
  • use stratification to handle class imbalance

3.5.1. Train-Test split

def: split data once into training and test subsets
β†’ typical ratios:
- train: 70-80 %
- test: 20-30 %

if tuning the model parameters requires validation apart from testing, the data can be split into 3 distinct sets.

3.5.2. Cross-Validation

def: resampling procedure used to evaluate ML models on limited data samples by partitioning data into complementary subsets.

cf. 4.2.2. Cross-Validation Eval in 4. Machine Learning for more (kind of redundant, but still relevant)

> K-Fold Cross-Validation

def: split data into K equal folds, using K-1 folds for training and 1 fold for testing, rotating through all folds.
β†’ model trained k times,
β†’ each fold serves once as test data
β†’ performance averaged over folds
βž• suitable for small datasets

> Stratified K-Fold

def: variation that preserves original class proportions in each split
βž• important for imbalanced datasets

> Time Series Split

def: specialised method for temporal data that respects time ordering (split chronologically), using past data for training and future data for testing.
βž• important for time-dependent data (e.g. stock prices)

3.6. Data Visualisation Tools

def: tools that create interactive dashboards and reports for business stakeholders
e.g. Tableau, PowerBI, Looker, Metabase

cf. 7.1.4. Data Serving where BI is discussed for more hands-on, interpretation and use of data for business decision-making support

❓ can go further

4. Machine Learning

4.0. Intro

def: ML is a subfield of AI focused on developing algo that enable computers to learn patterns from data, and make predictions or decisions without being explicitly programmed for every task. β†’ no hard-coded rules
β†’ algo improve automatically by experience (data) vs. hard-coded instructions
β†’ ML models learn relationships within data to predict outcomes or classify information

β‡’ blend of computer science, stats (aaaaand domain expertise)
sikes exactly what data science also is 😐
let’s say:

  • ML = CS and Maths
  • DS = CS, Maths/Stats and Domain Expertise

Goals of Machine Learning:

  • enable computers to automatically learn patterns from data
  • automate complex decision-making tasks without explicit programming
  • make accurate predictions or decisions on unseen data
  • continuously adapt and improve performance (and accuracy) as more data becomes available

Some definitions related to CS:

  • code: actual lines of instructions written in a programming language
  • algorithm: step-by-step procedure or set of rules to solve a problem or perform a task (e.g. sorting, searching algo)
  • program: collection of algorithms and instructions written in a programming language to perform specific functions.
  • script: type of program, often shorter and interpreted rather than compiled, usually automating tasks.
  • model: (in ML) mathematical representation trained on data that makes predictions or decisions.
  • function: standalone, independent, reusable piece of code designed to perform a specific tasks β€” called by name (function())
  • method: function tied to an object/class in OOP (object-oriented programming) β€” called on the object to operate on its data a.k.a. its attributes (object.method())

4.1. ML Models

def: different algo approaches and architectures for learning patterns from data, categorised by their learning methodology and application domain

4.1.1. Supervised Learning

def: learning (i.e. trained) from labeled data where the desired output is known
β†’ learn a mapping from inputs to outputs
β†’ so it can accurately predict labels for new, unseen data

πŸ’‘ Core Idea: Learn from known input-output pairs

Some examples:

  • image recognition
  • fraud detection
  • stock price prediction

> Linear models

def: models that assume linear relationship between input variables and the target (what we want to predict)

  • Linear relationship: output changes at a constant rate as the input changes, they are connected linearly, model can draw a straight line/plane/hyperplane through the data points
    • for 1 feature (simple linear regression):
    • for multiple features (multiple linear regression):
    • with
      • = target
      • = input feature
      • = slope
      • = intercept (when x=0)
      • = feature weight (how important)
    • e.g. house_price = 200 x house_size
>> i. Linear Regression

def: predict continuous values (e.g. house prices)
β†’ by fitting a linear equation to observed data

Example:

  • predict house prices (continuous)
  • based on features: size and location

β†’ cf. > Regularisation in ML sometimes needed when too many features involves and model is too simple

>> ii. Logistic Regression

def: despite its name, not really a regression but used for classification i.e. discrete categories β‡’ binary or multi-class
β†’ estimate probabilities of discrete outcomes using logistic sigmoid function

Example:

  • predict spam or not (binary classification)
  • based on features: word frequencies, sender characteristics, presence of links, etc.

>>> ii. a. Sigmoid Function
def: math function to map any real value number into probabilities ([0, 1]), for binary classification or activation function (NN)

β†’ S-shaped curve [0, 1]

> Classification

def: predict discrete categories β‡’ again, binary or multi-class

>> i. K-Nearest Neighbours (KNN)

def: classify data points based on labels of the k closest points in the feature space, relying on distance metrics.

note: different from >> i. K-Means in > Clustering that is an 4.1.2. Unsupervised Learning technique

  • KNN = Supervised classification
>> ii. Bayesian ML

def: use bayes’ thm to update the probability estimate for a hyp as more evidence/data becomes available.

cf. 1.1.3. Probability Theory for some background theory

>>> ii.a. Naive Bayes Classifiers
def: based on bayes’ thm with the naive assumption that features are independent
β†’ calculate the probability of each class given the data, and predicts the class with the highest posterior probability

>> iii. Discriminant Analysis

def: statistical method that models the difference between classes based on feature distributions

>>> Linear Discriminant Analysis (LDA)
def: find linear combination of features that best separates two or more classes
β†’ assume normal distributions of predictors and equal covariance among classes
βž• works well when class distributions are gaussian
βž• helps reduce dimensionality

> SVM

def: Support Vector Machines, supervised ML algo used for classification and regression tasks
β†’ find optimal hyperplane that best separates classes in the feature space
β†’ by maximising the margin i.e. distance between hyperplane and support vectors of each group
βž• effective in high-dim spaces
βž• robust to outliers with soft margin
βž• memory efficient (only sv matter in defining the model)
βž• both binary and multi-class classification

Some definitions:

  • Hyperplane: decision boundary that separates different categories of data β€” a line in 2D space, a plane in 3D or more
  • Support vectors: critical data points lying closest to the decision boundary (and directly influence it)
  • Margin: distance between hyperplane and the nearest data points from each class (sv)
  • Soft margin: allow for some misclassifications (slack variables) to handle noise and improve generalisation
  • Kernel: mathematical function that implicitly transforms data β†’ higher-dim space; this allows SVM to handle non-linear data by finding more complex boundaries in the original space.
    • Kernel trick: calculate the similarity (= dot product) between pairs of data points in the transformed high-dim space without explicitly computing their coordinates.
  • High-dimension: number of features >>> number of samples

πŸ’‘ At the core of SVM:

  • Kernel trick allows SVM to transform data β†’ higher-dim space where classes become linearly separable without explicitly computing coordinates
  • β†’ can use different kernels for this:
    • Linear Kernel: no transformation (ok for data that is linearly separable)
    • Polynomial Kernel: data β†’ polynomial feature spaces β‡’ curved boundaries
    • Radial Basis Function (RBF) / Gaussian Kernel: data β†’ infinite-dim space β‡’ complex boundaries
    • Sigmoid Kernel: similar to neural activation functions

There are also different types of SVM:

  • Linear SVM
  • Non-linear SVM
  • One-class SVM
  • Support Vector Regression (SVR)
  • Multi-Class SVM

> Decision Trees

def: supervised ML algo used for both classification and regression tasks
β†’ models decisions and their possible consequences in a flowchart-like tree structure
βž• intuitive, easy to interpret
βž• works with both numerical and categorical data
βž• needs little data preprocessing
βž• can capture non-linear relationships
βž– prone to overfitting if not controlled
βž– unstable to small changes in data
βž– can create biased trees if some classes dominate

Some definitions on the structure of the tree:

  • Root node: represent the entire dataset and is the starting point of the tree
  • Decision nodes: internal nodes, nodes where the data is split (based on attribute tests or feature values)
  • Branches: represent the outcomes of the tests (leading to further nodes or leaves)
  • Leaf nodes: terminal nodes, represent final predictions or class labels

Definition of Core Concepts:

  • Purity: metric describing how homogeneous the data within a node is (how well the split separates classes)
    • β†’ so a node is β€œpure” when all its data points belong to the same class (for classification) or have similar target values (for regression)
    • β†’ split decisions aims to create child nodes that are as pure as possible i.e. splits that reduce impurity the most
  • Pruning: process of removing sections of a tree (branches) that provide little power for prediction, to prevent overfitting, improve generalisation and simplify the model.
    • Pre-pruning: set criteria such as max depth, min sample per leaf, min impurity decrease to stop tree growth before it’s too detailed and overfitting.
    • Post-pruning: grow the full tree then trim back branches to limit complexity, often based on validation data performance.
>> i. metrics for splitting

def: the overall idea if to make the split as β€œdecisive” as possible by decreasing impurity, entropy or variance within the node for each split.

>>> i.a. Gini Index (measure of impurity)

def: measure how mixed or impure a dataset is
β†’ gini = [0, 0.5] = [pure, impure]
β†’ measure likelihood of incorrect classification if randomly classify it according to the class distribution in the dataset

Gini = 0 is the lowest and best possible outcome for each branch i.e. when everything in the node is the same class.

>>> i.b. Information Gain (based on entropy)

def: measure how much entropy decreases after a split
β†’ entropy = overall disorder or unpredictability

>>> i.c. SSE or MSE

def: Sum of Squared Errors or Mean Squared Error can be used for regression, to measure variance within a node, aiming to reduce it after splitting.

>> ii. how does the tree work
  • the tree recursively splits the dataset based on features that maximise the purity
  • the splitting continues until
    • the node is pure i.e. all data points belong to one class
    • max tree depth is reached
    • min number of samples in a node is too low to split further
    • further splits yield no meaningful reduction in impurity or improvement (e.g. impurity gain ~ 0)
    • no remaining features to split

> Ensemble Learning

def: combine multiple models (= β€œlearners”) to improve performance over individual models

Core concept:

  • leverage collective intelligence
    • combine, average outputs, vote among their predictions
    • β†’ reduce errors, improve accuracy, limit overfitting
  • compensate for each other’s mistakes
    • β†’ more robust overall

Type of learners:

  • Base learners: individual model in an ensemble learning model
  • Weak learners: a base learner that performs slightly better than random guessing (which is bad)
    • examples:
      • weak = shallow decision trees
      • base/strong = fully grown decision trees
>> i. Bagging

def: a.k.a. bootstrap aggregating, build multiple models (usually same type) on different subsets of training data (bootstrap samples) and combine their predictions
β†’ Bootstrap: sampling technique where multiple datasets are created by random sampling with replacement from the original data (i.e. can be duplicates)

>>> i. a. Random forest
def: ensemble learning method using bagging with decision trees
β†’ many uncorrelated trees on bootstrapped samples, using random subsets of features at each split
β†’ aggregates tree predictions for improved accuracy and reduced overfitting

>> ii. Boosting

def: sequentially build models that learn from mistakes of previous models, emphasising harder cases to improve overall accuracy
β†’ final prediction is a weighted combination of all models

>>> ii.a. AdaBoost (Adaptive Boosting)
def: sequentially trains weak learners (often decision trees) where each model focuses on the errors of its predecessor, combining them weightedly for improved accuracy

>>> ii.b. Gradient Boosting Machines (GBM)
def: sequentially build learners by optimising a loss function using gradient descent method

  • ii.b.1. XGBoost (Extreme Gradient Boosting)
    • def: designed for speed and performance, emphasising regularisation and efficient parallel processing
      • parallel processing: optimised and efficient distributed computing
      • regularisation: additional regularisation terms (penalty) in the objective function that control model complexity and prevent overfitting
  • ii.b.2. CatBoost
    • def: specifically designed to handle categorical features, reducing overfitting with ordered boosting
      • ordered boosting: permutation-driven technique that prevents target leakage and overfitting
  • ii.b.3. LightGBM
    • def: optimised for even faster training with a novel leaf-wise tree growth strategy and particularly good for large datasets
>> iii. Stacking

def: different models (possibly different types) are trained, and a meta-model is used to combine their predictions

Common example:

  • Base learners of different types are stacked, e.g. combining
    • decision trees
    • NN
    • linear models
  • Trained on the same dataset
  • Then combine predictions using a logistic regression or gradient boosting model as the meta-learner
    • trained on these outputs

> Neural Networks

def: stack of connected layers that progressively extract more meaningful patterns from data
β†’ inspired by structure and functioning of human’s brain, neurons (or nodes)

cf. 5. Deep Learning, AI when the layers become deep, i.e. numerous

>> The basic unit: The Neuron (or Node)

  • takes multiple inputs
  • weights them by importance
  • adds them up
  • applies an activation function (e.g. sigmoid, tanh, ReLu, …) to decide β€œhow much to fire/activate” (i.e. how important)
  • sends output to neurons in the next layer

β†’ activation determines how much and in what way a neuron contributes to the final decision

>> The structure: Layers

  • input layer: where data enters (one neuron per feature)
  • hidden layers: where the magic happens, these layers find the patterns
  • output layer: produce the final prediction (e.g. probability of each class)

cf. > Network Layers in 5. Deep Learning, AI

>> The Learning Process

>> Hierarchical Learning

  • first hidden layer learns simple patterns (e.g. edges in images, basic word combination in text)
  • second hidden layer combines those to learn more complex patterns (e.g. shapes, phrases)
  • third hidden layer combines those to learn even more complex patterns (e.g. object, sentences)

β‡’ similar to an assembly line where
basic components β†’ assembled parts β†’ assembled complex units β†’ … β†’ final product

>> Importance of Activation Functions
cf. > Activation functions in 5. Deep Learning, AI

but basically, without them, NNβ€”no matter how many layersβ€” would just be fancy linear regression.
β†’ the non-linearity introduced by activation functions is what allows NN to learn complex, CURVED patterns instead of straight lines

in fact, each layer would just be doing output = (weight x input) + bias
which is a linear transformation,
stacking them would just be a big linear transformation.

but activation functions like ReLu, sigmoid, tanh are non-linear,
β†’ so they β€œbend” the data at each layer
β‡’ results in a model that is more flexible and capable of learning curved decision boundaries vs. straight lines

4.1.2. Unsupervised Learning

def: learning by finding patterns in unlabelled data β€” without pre-existing labels
β†’ model find underlying patterns, relationships, structure without predefined outputs

πŸ’‘ Core Idea: Discover hidden patterns or groupings

Some examples of application:

  • customer segmentation
  • anomaly detection
  • recommendation systems
  • EDA in various domains

note: another type of unsupervised learning that’s not discussed here, association rules methods = find rules that describe relationships between variables in large dataset (e.g. apriori algoritm)

> Clustering

def: group similar data points into clusters based on similarity

>> i. K-Means

def: partitioning method that consists in assigning data points to a fixed number K of exclusive clusters based on proximity (feature similarity) to cluster centroids

How?

  • choose K initial centroids (often randomly selected points in the data space)
  • assign each datapoint to the nearest centroid based on a distance metric (commonly euclidean)
  • recalculate centroids as the mean of all points assigned to each cluster
  • repeat iteratively until convergence β†’ centroids stabilise

note: different from >> i. K-Nearest Neighbours (KNN)

  • KNN (supervised classification): assigns labels based on nearest neighbours
  • vs. K-means (unsupervised clustering): grouping data points into clusters

>>> Elbow Method
def: heuristic used to determine the optimal number of clusters K by finding the β€œelbow” point in the within-cluster sum of squares (WCSS) plot

  • WCSS: measures how tight the clusters are

>>> Silhouette Score
def: a performance metric that measures of how similar an object is to its own cluster compared to other clusters ([-1, 1])
β†’Β used as a unsupervised clusteringΒ validationΒ step, β€œhow meaningful are these clusters?”

cf. 4.2.1. Performance Metrics for more on the topic of metrics

>>> Fuzzy K-Means
def: allow datapoints to belong to multiple clusters with varying degrees of membership, expressed as probabilities rather than hard labels.
β†’ i.e. soft cluster membership

>> ii. Hierarchical clustering

def: build a tree-like structure (dendogram) by iteratively merging or dividing clusters, either agglomerative (bottom-up) or divisive (top-down)

  • agglomerative: start with individual points and merge the closest clusters iteratively
  • divisive: start with all points in one cluster and split recursively
>> iii. DBSCAN (Density-Based Spatial Clustering)

def: group points based on data density β†’ identify clusters as dense regions separated by sparser areas
βž• good for detecting clusters of arbitrary shape and spotting outliers, noise

>> iv. GMM (Gaussian Mixture Models)

def: use probabilistic models assuming data = mixture of several Gaussian distributions
β†’ assign soft cluster memberships vs. hard assignments

> Dimensionality reduction

def: simplify data by reducing number of features (dimensions) while preserving important information

>> i. Principal Component Analysis (PCA)

def: reduce the dimensionality of data by finding the principal components (PC) that capture the most variance
β†’ linear transformation technique that converts possibly correlated variables into linearly uncorrelated PC

  • principal components: new axes formed as linear combinations of the original variables that capture the most variance i.e. spread/diversity in the data
    • first PC: captures the max variance possible along a single axis
    • each subsequent PC: captures the max remaining variance while being orthogonal (i.e. uncorrelated) to the previous ones.
  • β€œcapturing the most variance”: these components rpz directions in the data where the points spread out the most
    • β†’ thus carrying the most info about the differences in the data
>> ii. UMAP

def: Uniform Manifold Approximation and Projection, non-linear dimensionality reduction techniques for visualising high-dimensional data
β†’ transforms high-d data into embeddings
β†’ preserve both local (i.e. similarities among nearest neighbours) and global (i.e. distances/relationships between clusters) structure in the data
β†’ faster β‡’ scale well to large datasets
β†’ clustering in low-dimensional embeddings

note: seems like it’s the new upgraded and challenger of the OG pioner t-SNE, but pretty much better.

>> iii. t-SNE

def: t-Distributed Stochastic Neighbour Embedding, also a non-linear dimensionality reduction techniques for visualising high-dimensional data
β†’ transforms data into 2D or 3D embeddings
β†’ preserve local structure BUT can distort global relationships
β†’ visually distinct clusters
β†’ slower on large datasets
β†’ β€œstochastic” because the same data can give visually different results on different runs

note: seems like it’s the OG method, but now outperformed by upgraded method UMAP.

>> iv. Autoencoders (NN)

def: a type of NN trained to reconstruct their input, they consist of:

  • an encoder: compress input data into a lower-dim latent rpz
  • a decoder: reconstruct original input from this compressed encoding
    β†’ by learning to minimise reconstruction error, they effectively learn compact and meaningful rpz of data
    β‡’ useful for dimensionality reduction and noise reduction

4.1.3. Semi-Supervised Learning

def: hybrid of (small) labeled and (large) unlabelled data for training
β†’ guide learning process with the labeled data
β†’ still extract useful structure from the unlabelled data
β†’ improve the model performance overall
βž• real-world problems where labeled data isn’t easily accessible or available

πŸ’‘ Core Idea: Learn from a little labeled + lots of unlabeled data

> Self-Training

def: model is trained on small labeled data, then predicts (pseudo-)labels for the unlabeled data, which are then used (add to training set) to retrain the model iteratively.

> Co-Training

def: two or more models teach each other by labeling data for each other based on their predictions

> Label Propagation

def: uses graph-theory β†’ creates a similarity-graph where

  • labeled nodes have fixed labels,
  • unlabeled nodes propagate labels from their neighbours
  • works great when you can define similarity between points

> Consistency Regularisation

def: deep-learning based technique, force model predictions to be stable under perturbation (transformation, noise)
β†’ β€œdata augmentation”

the core idea: model should produce similar outputs for

  • nearby points in the data space
  • same input under different transformation (augmentation, dropout, rotation, etc.) or noise

4.1.4. Reinforcement Learning

def: type of ML where an autonomous agent learns to make decisions by interacting with an environment, getting feedback through rewards and penalties
β†’ the goal is to maximise cumulative rewards (←> min penalties) over time
β†’ no labeled input-output pairs

πŸ’‘ Core idea: Trial and Error process

The step-by-step process:

  • agent observes current state of the environment
  • takes an action
  • receive feedback, in the form of reward or penalty
  • transitions to a new state

note: Deep RL combines RL with NN β‡’ can solve high-D and complex tasks

> Markov Decision Process (MDP)

def: most RL problems are modeled as an MDP, defined by these key components:

  • Agent: the decision-maker or learner
  • Environment: the system or world with which the agent interacts
  • State (s): a representation of the current situation of the environment the agent is in
  • Action (a): the choices available to/taken by the agent to transition between states
  • Reward (r): the feedback signal from the environment based on the action taken β€” indicates how good the action was in that state (can be a reward or a penalty)
  • Policy (Ο€): the strategy the agent follows to decide its next action
    • mapping from states to actions / β€œwhat actions lead to rewards”
    • basically the agent’s brain
  • Value function (V(s) or Q(s,a)): the expected cumulative future reward.
    • not about immediate gratification but long-term success (damn it’s better wired than a hooman in theory)
  • the GOAL of the agent: learn aΒ policy (Ο€)Β that maximizes theΒ cumulative future reward.

> Categories of RL algorithms

>> i. Model-based vs. Model-free

>>> i.a. Model-based
def: the agent learns a model of the environment’s dynamics (i.e. the probability of transitioning to a new state and the rewards for doing so)
β†’ the agent can plan by simulating future states within its internal model
e.g. Dyna-Q, MuZero

  • What they learn: A model of the environment dynamics
  • How they work: Learn transition probabilities P(s’|s,a) and reward function R(s,a)
  • Planning: Can simulate future states before taking actions

>>> i.b. Model-free
def: most common approach, the agent doesn’t learn a model of how the environment works
β†’ it learns directly which actions are good or bad through trial and error
e.g. > Q-Learning, Policy Gradients

β†’ can be value-based or policy-based… or both

>> ii. Value-based vs. Policy-based vs. Actor-critic

>>> ii.a. Value-based
def: agent learns a Value Function Q(s,a), which estimates the quality of an action in a state
β†’ the policy is implicit: always choose the action with the highest value
βž• excellent for discrete action spaces (e.g. left/right/jump)
e.g. > Q-Learning, Deep Q-Networks

  • What they learn: A value function (V(s) or Q(s,a))
  • How they work: Learn which states or state-action pairs are most valuable
  • Policy: Implicit - choose the action with highest value
  • What gets updated during training: Q-Table (or value estimates) with Bellman Equation
    β†’ adjust expectation for a state-action pair based on what actually happened

>>> ii.b. Policy-based
def: agent directly learns the optimal Policy Ο€ without needing a value function.
β†’ outputs a probability distribution over actions
βž• excellent for continuous action spaces (e.g. steering a car) or stochastic policies

  • What they learn: The policy directly (Ο€(a|s)) β€” i.e. what to do directly, the actions
  • How they work: Learn the probability distribution over actions for each state
  • Value function: Not learned explicitly
  • What gets updated during training: The probability distribution over actions
    β†’ increase or decrease policy probability of an action (~ instincts) depending on outcome

>>> ii.c. Actor-critic
def: hybrid approach that combines the best of both Value-based and Policy-based
β†’ the Critic: measures how good the action taken was (value-based)
β†’ the Actor: updates the policy based on the Critic’s feedback (policy-based)
βž• actually the foundation for most modern, state-of-the-art IRL algo

  • What they learn: Both policy (actor) AND value function (critic)
  • How they work:
    • Actor suggests actions (like policy-based)
    • Critic evaluates those actions (like value-based)
  • What gets updated during training: Bothβ€” Actor improves its actions while Critic improves its predictions in a feedback loop (~ understanding of the world)

> Exploitation vs. Exploration

def: RL involves balancing between

  • Exploitation: choosing known actions that yield high rewards (i.e. make best decision given current knowledge)
  • Exploration: trying new actions to discover better rewards (i.e. gather more info by trying new things)

It is a trade-off because:

  • too much exploitation: might never find optimal strategy
  • too much exploration: will never reap the rewards of what is learnt
  • β‡’ need a balance

> Q-Learning

def: model-free value-based RL algo
β†’ no model of the env

Goal of Q-Learning:

  • agent learn the best actions in various states
  • β‡’ maximum cumulative rewards

How?

  • algo builds a Q-table
    • in which each entry has a Q-value representing
      • expected future rewards for a specific action in a given state
  • the agent interacts with the env
  • the Q-values get updated using a learning rule
    • based on receiving rewards and new states
  • over time, the agent discover the optimal policy
    • = strategy of choosing actions that yields the highest long-term reward

Step-by-step:

  • start in a state,
  • select an action,
  • observe rewards and next state
  • update Q-val for the state-action (Q(s,a)) pair using the Bellman equation (learning process)
    • adjust Q-val based on observed rewards and highest Q-val for next possible actions
  • repeat and refine Q-table through exploration and exploitation
    • until agent learns which actions are best in each state

Bellman equation

where:

  • learning rate
  • discount factor (how much we care about future rewards)
  • immediate reward
  • estimate of the best future rewards from the next state

βž• Advantages

  • Trial and error
  • Self-improvement and autonomous learning
  • Simple and efficient

βž– Disadvantages

  • Slow learning
  • Expensive in some environments
  • Curse of dimensionality
  • Limited to Discrete actions

4.1.5. Deep Learning

def: using multilayer Neural Networks (NN) for complex data like images, speech or text

πŸ’‘ Core Idea: Learn hierarchical representations automatically

  • hierarchical representations: learning patterns in layers, a bit like feature recognition assembly line

cf. > Neural Networks in 4.1.1. Supervised Learning
cf. next chapter 5. Deep Learning, AI for more

4.2. Model Evaluation

def: process of assessing how well a trained model will perform on unseen data for a given task

πŸ’‘ Does a trained model generalise well to new, unseen data?

4.2.1. Performance Metrics

def: the performance metrics depend on the problem type (classification, regression, clustering)
β‡’ some metrics can be more important than others depending on the application

> Metrics for Classification

  • True Positive (TP): correctly positive predicted (in fact a P (1))
  • True Negative (TN): correctly negative predicted (in fact a N (0))
  • False Positive (FP): wrongly positive predicted (actually a N (0))
    • Type I error: detect an effect that is not present
  • False Negative (FN): wrongly negative predicted (actually a P (1))
    • Type II error: fail to detail an effect, that is present
>> Accuracy

def: proportion of correct predictions (TP and TN) out of all predictions
βž– not good for imbalanced datasets

>> Precision

def: ratio of TP to All Positive predictions (TP + FP)
β‡’ measure quality of positive predictions
β†’ precision is best when FP is low, i.e. not many False Alarms (Type I)

β€œfrom what you got, how much of it is actually right… doesn’t tell if you got them ALL though”

>> Recall (Sensitivity)

def: ratio of TP detected among all Actual Positives (TP + FN)
β‡’ measure model’s ability to find ALL positives
β†’ recall is best when FN is low, i.e. not many Miss Out (Type II)

β€œdid you get them all from the pool? doesn’t tell how β€œprecise” you were with the ones you got though…”

>> Rates
  • TPR = True Positive Rate = Sensitivity = Recall = Hit-Rate

  • TNR = True Negative Rate = Specificity / Selectivity

  • FPR = False Positive Rate = Probability of False Alarm / Fall-Out

  • FNR = False Negative Rate = Probability of Detection / Miss-Rate

>> Confusion matrix

def: shows TP/TN/FN/FP
β†’ diagonal intense = goooood

>> F1-Score

def: harmonic precision and recall

>> ROC/AUC

def:

  • ROC (receiver operating characteristic curve): visualise trade-offs between y=TPR (recall, sensitivity) and x=FPR at various thresholds of classification (usually .5, but can be tweaked depending on goals)
  • AUC (area under the curve): measure overall separability / discriminatory power of model (i.e. ability to distinguish between classes)

β‡’ ideally ROC β†’ (y=1)
β‡’ ideally AUC β†’ 1

>>> Precision-Recall AUC
def: for imbalanced classification problems, more informative than classic ROC-AUC

  • balanced classification: TPR(recall)-FRP
  • imbalanced classification: TPR(recall)-Precision
    • Precision = TP / All positives = TP / TP + FP

> Metrics for Regression

>> Mean Absolute Error (MAE)

def: average of absolute errors (with error = predicted - actual)

>> Mean Squared Error (MSE)

def: average of squared errors
β†’ punishes large errors more heavily

>> Root Mean Squared Error (RMSE)

def: square root of MSE
β†’ interpretable in the original data units

>> Score

def: a.k.a. coefficient of determination, evaluate the goodness of fit of a regression model β€” the predictive power [0, 1]
β†’ provide the proportion of variance in the dependent variable that is explained by the independent variables in the model

  • : the model doesn’t explain any variable (eq. to predicting the mean) β†’ bad
  • : the model perfectly explains all the variance in the target variable β†’ best

> Metrics for Clustering

>> Silhouette Score

def: how similar points are to their own cluster vs. other clusters
β†’ used by cf. >> i. K-Means

>> Calinski-Harabasz Index

def: ratio of between-cluster to within-cluster dispersion

4.2.2. Cross-Validation Eval

def: resampling procedure used to select and evaluate ML models.
β†’ can be used for:

  • model selection: choosing between different models or hyperparameters
  • model evaluation: getting a reliable estimate of model performance

cf. 3.5.2. Cross-Validation in 3. Data101, EDA (kind of an intro, redundant but it’s okay)

naive approach:
β†’ use a single train/test split

  • performance estimate depends heavily on which random split you get
  • might be lucky or not with the test set
  • waste of data potential by not using it all for training or evaluation

β‡’ cross-validation solves this by using data more efficiently and providing a more robust performance estimate
βž• less bias, more reliable
βž• useful when data is limited
βž• robust performance, less overfitting

> i. k-Fold CV

def: most common method,

  • dataset is split into equal parts (= folds),
  • model is trained on folds and tested on the remaining fold
  • repeat process times, each time rotating i.e. using a different fold as the test set
  • final evaluation metric = average performance across all trials

> ii. Stratified k-Fold CV

def: variation of k-fold CV, but this preserves the same class distributions/proportions in each fold as in the full dataset
β‡’ super useful for imbalanced classification problems to ensure each fold is representative

> iii. Leave-One-Out CV (LOOCV)

def: special case of k-fold CV where = number of datapoints

  • each fold = a single datapoint used as the test set
  • model is trained on all remaining points

βž– computationally costly

note: Leave-P-Out CV is the generalisation of this, where datapoints are left out as the test set each time, iterations over all possible combinations of points
β‡’ heavily computationally demanding

> iv. Time Series CV

def: designed for time-dependent/temporal data, where training sets respect temporal order (no future data leaks into training)
β†’ typically grows the training window forward and tests on subsequent periods to mimic forecasting scenarios
β‡’ basically past data for training, future data for testing

4.2.3. Fundamental Eval Concepts

> Bias-Variance Trade-Off
  • bias: error from overly simplistic assumptions β†’ underfitting because the model can’t capture the underlying trend
  • variance: error from excessive sensitivity to noise and small fluctuations in the training data β†’ overfitting because the model memorises the training data instead of learning the generalisable pattern
  • the trade-off: increase a model’s complexity typically reduces bias but increases variance, and vice versa β€” the goal is to find the sweet spot of model complexity to minimise both error (i.e. total error)
High BiasHigh Variance
What it meansThe model is too simple and misses patterns in the data.The model is too complex and learns the noise in the data.
ProblemUnderfittingOverfitting
PerformanceBad on training data AND bad on test data.Excellent on training data, but bad on test data.
> Overfitting

model performs well on training data but poorly on test data
β†’ model memorise the noise, fit too closely to training data (weak generalisation), too complex model

  • Detection: Large gap between training and validation performance
  • Solutions: Regularisation, simpler models, more data, dropout, early stopping
> Underfitting

model performs poorly on both training and test data
β†’ model too simple to capture underlying patterns in data, bad in training and testing

  • Detection: Poor performance everywhere
  • Solutions: More complex models, better features, longer training
> Regularisation in ML

def: add penalty on the model training to manage complexity and prevent overfitting (by relying too heavily on particular features or patterns in training data)

methods to mitigate overfitting, esp. with high number of features and simple model such as Regression (cf. >> i. Linear Regression) is needed:

  • L1 Regularisation (Lasso)
    • Lasso: Least Absolute Shrinkage and Selection Operator
  • L2 Regularisation (Ridge)

βš β— too high regularisation can lead to underfitting

>> i. Lasso Regression (L1 Regularisation)
def: adds a penalty to the loss function = absolute value of the magnitude (sum of coeffs)
β†’ encourage sparsity (lots = 0) by shrinking some coeff exactly to 0
β‡’ perform feature selection by effectively removing less important features
βž• useful when only a subset of predictors are truly relevant, huge number of features
βž• more interpretability and simplicity (fewer features)
βž– multi-collinearity features

with regularisation param controlling penalty strength (tradeoff between bias and variance)
and MSE = Mean squared error

>> ii. Ridge Regression (L2 Regularisation)
def: adds a penalty to the loss function = squared magnitude (squared sum of coeffs)
β†’ shrink coeff towards 0, but not = 0 β‡’ keeps all features but reduces their influence if less important
β‡’ can’t perform feature selection
βž• handles multi-collinearity
βž• more model stability and accuracy
βž– less interpretability

note: cf. > Regularisation techniques in DL/AI in 5.2. Learning & Optimisation for more complex Deep Learning methods

4.3. ML Applications

ML applications can be broadly split into 2 categories:

  • Perception & Understanding: β€œwhat’s happening?”
    • classifying images, detecting spam, understanding speech, etc.
  • Action & Decision-Making: β€œwhat should we do?”
    • recommendation sys, self-driving cars, optimising supply chain, etc.

But also some key Overarching Themes:

  • Automation (replace repetitive and manual tasks)
  • Personalisation (tailoring exp, content, products to users)
  • Optimisation (making systems more efficient)
  • Augmentation (assisting human experts)

Across all the fields:

  • Healthcare
  • Finance
  • Transportation
  • Retail and e-commerce
  • Cybersecurity
  • Manufacturing and logistics
  • Customer service
  • Robotics
  • …

or by applications:

> Computer Vision (CV)

def: teaching machines to β€œsee and understand” visual data
object: image, video, faces …
types:

  • classification, categorisation, labeling
  • detection, localisation
  • segmentation, understanding, grouping/clustering
  • recognition, identification, verification
  • generation, synthesis

examples:

  • image and video classification β†’ categorise / label / identify
  • image segmentation β†’ pixel-level understanding / grouping
  • facial recognition β†’ identify and verify
  • object detection β†’ locate and classify

> Natural Language Processing (NLP)

def: teaching machines to β€œunderstand, read and write” human language
object: text, document, sentiment, language, speech, …
types:

  • translation
  • analysis, understanding
  • extraction
  • recognition, conversion
  • summarisation
  • synthesis, generation
  • identification, classification
  • chatbot, virtual assistants, conversational AI

examples:

  • machine translation β†’ google translate
  • sentiment analysis β†’ analysis and understanding of tone
  • text summarisation
  • named entity recognition β†’ identify, extract, classify
  • text generation
  • speech recognition β†’ conversion speech to text

> Speech & Audio Processing

def: teaching machines to β€œhear, interpret and generate” sounds and speech
object: voice, command, speaker
type:

  • assistant
  • recognition
  • identification and verification
  • detection and classification

> Predictive Analytics & Forecasting

def: use historical (past) data to predict future outcomes and trends

examples:

  • demand forecasting
  • predictive maintenance
  • financial forecasting β†’ predict stock price, market trends, credit risk
  • healthcare prognosis β†’ predict outcomes or disease progression

> Recommendation Systems

def: algorithms that suggest relevant items to users based on their preferences and behaviours.
object: content, product, user, feed, …

  • personalised recommendations
  • collaborative filtering
  • profiling

examples:

  • content recs β†’ netflix, youtube, news feed
  • product recs β†’ amazon

> Robotics & Control

def: programming physical systems to perceive their environment and take intelligent actions autonomously

  • Human-Robot interaction and collaboration
  • Autonomous navigation and path planning
  • Manipulation and grasping using RL

> Anomaly & Fraud detection

def: identify rare, unusual patterns or events that deviate significantly from the norm

  • fraud detection in finance and cybersecurity
  • fault detection in manufacturing and infrastructure

> GenAI

def: AI generation of β€œnew” content that is similar to (but not entire copy) of its training data

  • art generation β†’ images, music, videos
  • code generation β†’ Github copilot, …
  • synthetic data generation β†’ to train ML

5. Deep Learning, AI

def: Deep Learning is a subset of AI/ML that use NN with many layers (β€œdeep” architectures) to automatically learn hierarchical representations of data β‡’ for more advanced complex patterns
β†’ multilayered artificial neural networks inspired by human brain to analyse and learn from large and complex datasets (like image, text, sound)
βž• complex patterns, large datasets (actually needed for good performance)
βž– small datasets, black box (poor interpretability)

cf. > Neural Networks in 4.1.1. Supervised Learning for an introduction to NNs!

  • Hierarchical representations: learning patterns in layers, a bit like feature recognition assembly line, one layer = 1 feature / step

  • AI: umbrella concept of machines doing intelligent tasks

    • broad field of CS focused on creating systems capable of performing tasks that typically require human intelligence (reasoning, learning, problem-solving, understanding language, perception)
  • NN: foundation of DL, neural networks are AI/ML models inspired by the structure and functioning of the human brain, neurons, they process data through layers and learn patterns to solve complex tasks

    • interconnected layers of units called artificial neurons or nodes (or perceptrons) β†’ more about Network Layers
    • each node receives inputs and processes them using a mathematical function called an activation function, then passes the output to neurons in the next layer
    • NN learn by adjusting strengths/weights of connections between neurons during training
      • β‡’ enable them to recognise patterns and make predictions from data
  • Feed-forward vs. Back-propagation

    • cf > Training Cycle Forward vs. Backward propagation
    • note: mostly all NN use both for training phase:
      • forward passes to compute predictions (needed for error calculation too)
      • followed by backward passes to learn optimal parameters (use gradients)

5.1. Core architectures

β†’ NN architectures used in DL but each optimised for different types of data and tasks

  • FFNs
  • RNNs (feedback loop (i.e. memory capacity) β†’ sequential data β†’ text/time series)
  • CNNs (convolutional layers + kernels β†’ spatial patterns β†’ image)
  • Transformers (self-attention β†’ long-distance relationships/context β†’ language)

5.1.0. Feed-forward Neural Networks (FNNs)

def: simplest form of NN where information flows in one direction, from input to output.
β†’ no cycle or loop in the network architecture

note: the layers are called β€œFeed-forward” layers, but they still use backpropagation for the learning phase in order to update the weights!

> Multi-Layer Perceptron (MLP)

def: type of FNN consisting of fully connected neurons with a nonlinear kind of activation function
β†’ each neuron in one layer connects to every neuron in the next
β†’ used in various fields: image recognition, NLP, speech recognition
βž– but cannot exploit spatial or sequential structure of data

5.1.1. Convolutional Neural Networks (CNNs)

def: specialised NN for processing grid-like data such as images, using convolutional layers to learn spatial patterns followed by pooling layers that downsample spatial dimensions

  • purpose: image and video data
  • key idea: uses filters to detect spatial patterns (edges, shapes, objects, textures)
  • applications: image classification, object detection, computer visions tasks, medical imaging

> Convolution layers
def: layers that apply convolution operations to extract spatial features through learnable filters or kernels

> Pooling layers
def: layers that reduce spatial dimensions while retaining important features through operations like max pooling or average pooling

5.1.2. Recurrent Neural Networks (RNNs)

def: NN specialised in sequential data processing, they have loops in their architecture that allow information to persist across sequence steps, which enables the network to maintain context / internal memory
β†’ recurrency to retain temporal context

  • purpose: sequential data (text, time series, speech)
  • key idea: has memory to process sequences step-by-step
  • variants: LSTM (Long Short Term Memory), GRU (handle long-term dependencies better by controlling the information flow)
  • applications: text generation, time series forecasting, speech recognition

> Vanishing gradient problem

def: challenge in deep NN where the gradients (which guide learning by adjusting weights) become very small as they are back propagated through layers.
β†’ when the gradients shrink too much, especially in early layers, those layers learn very slowly or stop learning altogether

Why this happens?

  • gradients get multiplied repeatedly by values < 1 (e.g. derivatives of activation functions) causing them to shrink exponentially as they move backward through many layers

Some techniques to address this:

  • ReLU activations
  • Residual connections (i.e. skip connections❓)
  • Careful weight initialisation
  • Batch normalisation

> Long Short-Term Memory Networks (LSTM)

def: specialised types of RNNs designed to address the vanishing gradient problem in traditional RNN.
β†’ incorporate gated mechanisms to better capture long-range dependencies in sequential data
β‡’ particularly effective for tasks like speech recognition, machine translation, sentiment analysis

> Gated Recurrent Units (GRU)

def: ❓

5.1.3. Transformers

def: NN architectures that use self-attention mechanisms to process sequential data, revolutionising NLP tasks
β†’ evaluate importance of all parts of the input sequence simultaneously vs. sequentially like RNNs
β‡’ basically parallel processing for more efficiency and scalability + performance

  • purpose: model NLP and beyond
  • key idea: uses β€œattention” to weigh importance of different input parts
  • application: BERT, GPT models, machine translation, text summarisation

> Attention Mechanism
def: NN technique that allows model to dynamically focus on the most relevant parts of input data when processing it β€” vs. treating all input element equally
β†’ assigns different β€œattention weights” to various components of the input, based on importance for specific task
β‡’ better context understanding, improving performance in lots of NLP tasks

πŸ”‘ different importance weights to each component of the input β†’ help dynamic focus

> Self-Attention
def: specific form of attention mechanism that relates different positions within a single sequence to each other
β†’ the model can weigh the importance of each element in that sequence wrt others
β‡’ compute attention scores among all elements in the sequence simultaneously thus capture long-range dependencies and context more effectively than traditional sequential processing like RNNs

πŸ”‘ the weights of each element is wrt others in a sequence β†’ capture contextual dependencies

5.2. Learning & Optimisation

> Core Concept: Gradient Descent

def: the core concept, iterative optimisation algo used to minimise the loss function by adjusting model parameters (weights and biases)
β†’ calculate the gradient (direction and steepness) of the loss and take a step β€œdownhill” towards the minimum

> Training Cycle: Forward vs. Backward propagation

  • Forward propagation: process where input data flows through network, layer by layer, to generate an output prediction
    β†’ pure calculation from input to output
    β‡’ make predictions
  • Backward propagation: algorithm that calculates how to update each weight based on error (calculate the gradient of the loss function wrt each weight)
    β†’ β€œhow much did each weight contribute to the final error?”
    β‡’ learning from mistakes + tuning

> Loss Functions & Optimisers

  • Loss function (cost function): mathematical function that measures the β€œwrongness” of the model’s predictions compared to the true labels
    • e.g. MSE for regression
  • Optimisers: algorithms that update the model’s weights based on the gradients computed during backprop β€” they decide how to take a step downhill
    • SGD (Stochastic Gradient Descent)
    • Adam (Adaptative Moment Estimation)

> Regularisation techniques in DL/AI

def: methods to prevent overfitting (model memorises training data, but fails on new data)

  • Dropout: randomly β€œdrop out” / turn off neurons during training β†’ force network to not rely on any single neuron
  • Batch Normalisation: stabilise training by normalising layer inputs (or basically outputs from previous layer, )
  • Early Stopping: stop training when validation performance stops improving and starts degrading
  • Weight Decay: L1/L2 Regularisation adds a penalty to the loss function based on the magnitude of the weights, encouraging simpler models

5.3. Key Components & Techniques

> Activation functions

def: non-linear functions applied to a neuron’s output to determine whether and how strongly it should β€œfire” (activate)
β†’ they introduce non-linearity, allowing the network to learn complex patterns

  • ReLU (Rectified Linear Unit): β€” most common default choice, simple, efficient, helps mitigate the vanishing gradient problem
  • Softmax: used typically in the final output layer for multi-class classification, it converts a vector of raw scores β†’ probability distribution where all values sum to 1

> Network Layers

  • input layer: entry point for the feature data
  • hidden layers: layers between input and output where the complex feature learning happens; the β€œdeep” in deep learning : D
  • output layer: produces the final prediction (e.g. a class probability or a continuous value)
  • note: 1 node / neuron = 1 feature (e.g. 1 pixel BRUHHH)

> Transfer Learning

def: cornerstone technique where a model developed for one task is reused as the starting point for a model on a second task
β†’ basically leveraging pre-trained models vs. training from scratch
β‡’ fine-tune a pre-trained model on your specific (often smaller) dataset

5.4. Adv Models & Practical Aspects

> Adv Model Types

>> Autoencoders
def: unsupervised NN used for learning efficient data codings (dim reduction) and denoising
β†’ compress input into latent-space rpz and reconstruct output from this rpz

>> GANs
def: Generative Adversarial Networks, framework where 2 NNs are trained in competition

  • a Generator, which creates β€œfakes” (but really it’s synthetic data samples that are indistinguishable from real data)
  • a Discriminator, which spots fakes (vs. the real data)

β†’ widely used for generating realistic images, videos and other types of data

> Practical aspects

>> Hardware & Frameworks
def: DL is computationally intensive:

> Interpretability
def: XAI or Explainable AI, the field concerned with making the β€œblack box” decisions of deep learning models understandable to humans
β†’ techniques include SHAP, LIME and saliency maps which are crucial for building trust and debugging models in sensitive domains like healthcare and finance

6. MLOps, cloud

6.1. MLOps

def: MLOps is the practice of managing and automating the entire ML lifecycle, from data prep and model training to deployment (production), monitoring and maintenance
β†’ streamline ML project management: make ML dev and deployment faster, more scalable, more reliable, collaborative between data scientists, engineers and IT ppl

key components (automation of these!):

  • data ingestion and versioning
  • data preprocessing and feature engineering
  • model training, hyper-parameter tuning
  • model validation, testing
  • model packaging and deployment (via CI/CD pipelines)
  • continuous production monitoring and alerting
  • model retraining or updating triggered by monitoring
  • governance and compliance throughout the lifecyle
  • automation of workflows for scalability and repeatability

advantages:
β†’ faster model updates
β†’ better model governance: compliance with regulations, transparency (explainability), ethics, security, accountability etc.
β†’ risk reduction
β†’ continuous improvement

version/source control

def: track, store and manage different versions of datasets, scripts, models

  • git or similar tools
automated model testing and validation

def: crucial step before deployment, automation ensures consistent QA for accuracy, robustness, performance

  • code: unit testing
  • ML pipelines: integration tests
    ❓
CI/CD

def: continuous integration, delivery and deployment is a software dev practice
β†’ automation of steps to build, test, deploy models rapidly and reliably
β‡’ frequent, reliable, error-free model updates!

  • CI: integrates changes continuously, + automated tests to ensure QA
  • CD: automates packaging models + ensure production ready (manual approval for release tho)

tools:

  • Jenkins
  • GitHub Actions
  • Azure DevOps
orchestration

def: coordination and automation of ML lifecycle tasks and workflows to run smoothly
β†’ manages workflows and pipelines
β†’ orchestration tools ensure tasks happen in the right order, handle dependencies, manages failures automatically, streamline collab between teams

tools: (cf. iii. Data Ingestion for data pipeline tools)

  • Apache Airflow
  • Prefect
  • Kubeflow Pipelines
monitoring and observability

def:

  • monitoring tracks the health and perf of ML models in prod continuously (data drift = real time data change over time) β†’ catch anomalies, trigger alerts
      • also more generally tracking resources and workflows to detect and rectify any ops issue in the pipeline
  • observability refers to how well you can understand what’s happening inside your ML (logs, metrics, events)

tools:

  • Prometheus
  • Grafana
  • ELK Stack (Elasticsearch, Logstash, Kibana)
  • Could-native monitoring services

6.2. Cloud Computing

def: cloud refers to delivering computing resources (servers, storage, DB, ML tools) over the internet vs. local machines (so basically… data centers)
β†’ flexible and on-demand access
β†’ scalable infrastructure and service
β†’ storage, data processing (big data), MLOps
β†’ cost efficient: pay-as-you-go models (depends on usage) i.e. no large upfront cost

at the core, foundation of cloud computing:

  • distributed computing: model where multiple independent computers work together on a shared task through network communication, each handling a part of the workload
    • β‡’ cloud systems are essentially large-scale distributed systems.

other core advantages of cloud computing:

  • elasticity: instantly scale resources up/down based on demand β†’ you only pay for what you use and can handle sudden workload spikes smoothly
  • managed services: cloud providers offer fully managed platforms for DB, ML, analytics, security, etc. β†’ less ops complexity and speed up dev
  • global reach: deploy apps across multiple geo regions β†’ low latency, redundancy
  • security and compliance: heavily secured and compliant with local regulations

note: edge computing is a distributed computing model that brings data processing and storage closer to the location where the data is generated (e.g. near sensors and devices vs. centralised cloud data centers)
β†’ the proximity helps reduce latency (lags, delays, response time) and thus more efficient real-time performance
β†’ enhanced data privacy and security too

6.2.1. Cloud concepts

virtualisation

def: abstracts physical hardware (servers) into virtual machines (VM) or containers
β†’ better hardware/resource usage by running multiple isolated env on the same physical machine

  • VMs: acts like a separate, indep computer with its own CPU, memory, storage etc.
    β†’ run with its own OS (manages hardware, run programs)
    β†’ resource-heavy but strong isolation
  • Containers: packaged and virtualised apps
    β†’ share the host OS kernel
    β†’ but still isolate apps and dependencies
    β†’ + lightweight and faster to start
containerisation

def: containers package apps (+ everything they need to run = code, libs, deps) in a portable, lightweight, consistent unit.

>> docker
def: build, package and run containers on a single machine or host.
β‡’ simple installation, small-scale apps

>> kubernetes
def: orchestration platform that manages many containers across multiple machines
β†’ handles deployment, scaling, networking and health
β‡’ production-grade, large-scale, distributed apps

serverless computing

def: allows running code without managing servers
β†’ cloud provider auto handles scaling, availability and infrastructure concerns
β†’ pay-as-you-go

Infrastructure as Code (IaC)

def: manages cloud resources/infrastructure using config files (json, yaml)
β†’ repeatable deployments and controllable versions

tools:

  • Terraform
  • AWS Cloud Formation

6.2.2. Cloud providers

AWS

def: Amazon Web Services

Google Cloud

def: Google’s suite of cloud computing services

Microsoft Azure

def: Microsoft’s cloud computing service for building, testing, deploying and managing applications and services

7. Data Engineering

def: discipline focused on designing, building and maintaining the infrastructure and systems that enable efficient collection, storage, processing and delivery of data.
β†’ core purpose: transform raw data into usable formats for data scientist, analysts, business users

7.1. Data Eng Cycle

The 4 steps of the Data Engineering Lifecycle:

  1. Data Generation
  2. Data Storage & Management
  3. Data Ingestion
  4. Data Serving

7.1.1. Data Generation (Collection & Integration)

def: collect/extract data from various sources and unify it in consistent formats
β†’ the possible sources

  • Database: organised, structured collection of electronic data that is stored, managed and accessed by a Database Management System (cf. 2.3. SQL for more info on database)
  • API: Application Programming Interface, sets of protocols, routines and tools that enable different software apps to communicate with each other.
    • allow interactions with a service or platform through defined set of rules and endpoints
    • β†’ data exchange and functionality use without the need to understand / access the underlying code
  • Logs: files that record events, activities, system operation over time
    • β†’ historical record of what has happened within a sys including timestamps, event details, performance data, errors, user actions.
  • Mobile Apps, or IoT, …

7.1.2. Data Storage (& Management)

def: implementing scalable storage solutions such as data warehouses or lakes for future processing and analysis β€” ensuring data accessibility, security, and governance
β†’ data (digital information) can be stored on physical (hard drives) or cloud-based media (cloud platforms)

> how’s data managed ie what’s a β€œschema”?
  • a schema turns raw data into structured information by enforcing a consistent format, i.e. it’s like a template or a structure that defines how data is organised
    • schema-on-write: define the schema before loading the data β†’ inflexible but data is clean and reliable (warehouse)
    • schema-on-read: apply the schema when reading the data β†’ flexible but can lead to garbage in, garbage out (lake)
    • star schema: intuitive, simplest and most common way to model/structure data in a data warehouse, it uses a central fact table connected to multiple dimension tables, forming a shape like a star
      • β†’ separate what you measure (facts) vs. how you describe it (dimensions)
    • snowflake schema: another way of organising data where the dimension tables are split into smaller sub-dimensions to keep data more organised and detailed
      • β†’ structure is normalised (i.e. hierarchical vs. denormalised = flat dimensions)
> how’s data stored ie in what kind of architecture?
  • data warehouse: centralised repo for storing structured, processed and filtered data that is optimised for analysis and reporting
    • data: structured and semi-structured / data is cleaned, transformed and modeled (often into a star schema)
    • schema: schema-on-write
    • users: business and data analysts, ppl running sql queries for BI dashboard and reports
    • purpose: Business Intelligence (BI), reporting
    • cost: typically more expensive than massive storage
  • data lake: vast, centralised repo that stores raw, unprocessed data in its native form, at any scale.
    • data: all, (semi-)(un)structured
    • schema: schema-on-read
    • users: data scientists, engineers
    • purpose: advanced analytics, ML, data discovery
    • cost: typically cheaper than the warehouse
  • data lakehouse: modern architecture that get the best of both world
    • low-cost, flexible storage of a data lake
    • management, performance and ACID transactions of a data warehouse (so BI tools can query it directly)

some examples of Data Cloud Platforms for each structure:

Data WarehouseData LakeData Lakehouse
SnowflakeAmazon S3
(Simple Storage Service)
Databricks Delta Lake
Google BigQuery(MS) Azure Data Lake Storage (ADLS)Snowflake
Amazon RedShiftGoogle Cloud StorageOnehouse

7.1.3. Data Ingestion

def: collect, import data files from various sources into a database for storage, processing and analysis
β†’ goal is to clean, transform and store data in an accessible and consistent central repo to prepare it for use within the organisation

the different types of data ingestion:

  • batch: process data in large, scheduled chunks/batches β†’ ok for non-time-sensitive and repetitive tasks e.g. monthly reports
  • streaming (real-time): handle data as it arrives β†’ time sensitive tasks e.g. fraud detection
  • hybrid: both depending on case β‡’ more flexibility for diverse business needs
> Data Pipelines

def: series of automated processes that transport and transform data from various sources to a destination for analysis and storage
β†’ typically involve data ETL into DB, lakes, warehouses.

> ETL Process

def: Extract, Transform, Load

  • E: extract raw data from various sources
  • T: transform raw data (e.g. process and clean) into structured data ready to be stored or analysed
  • L: load the clean data into the data storage solution (warehouse or lake)

some tools used during the data pipeline: (cf. MLOps components orchestration for similar concept)

  • Apache Airflow: open-source tool that helps schedule, organise and monitor workflows β€” can automate data pipeline
  • Prefect: open-source orchestration engine that turns python functions into production-grade data pipelines β†’ can build and schedule workflows in python
> Big Data Tools

cf. 7.2. Big Data

> MLOps & Cloud Concepts

cf. containerisation
cf. CI/CD
cf. monitoring and observability
cf. Infrastructure as Code (IaC)

7.1.4. Data Serving

def: last step of the data engineering process, once the data is stored in the data architecture and transformed into coherent and useful format, basically β†’ provide data to end-users for decision-making and operational purposes

  • Data Analytics: broader discipline of data, focus on interpreting processed data through visualisation, reporting, and descriptive stats β€” but also can delve into diagnostic, predictive and prescriptive analytics
  • Business Intelligence: under data analytics, solely focus on descriptive stats, it’s more so about reporting, dashboard, data visualisation β‡’ monitoring and reporting
    • Tableau: powerful, visual, drag-and-drop dashboards
    • Microsoft) Power BI: good for excel integration
    • Google) Looker: use modeling layer called β€œLookML”
    • Qlik Sense: associative analytics engine
    • Streamlit: open-source python framework to build interactive web apps for DS and ML
> Reverse ETL

def: the reverse ETL is the process of extracting data from a data architecture then transforming it to fit requirements of operational systems and then loading it into those operational systems.

  • operational systems: different from operating system (OS), they are software that run the day-to-day core operations of a business (CRM, ERP, E-commerce platform, …)

so basically, vs. ETL (cf. > ETL Process):

  • traditional ETL: Extract data from various sources, Transform to fit Data Warehouse requirements, Load data into Data Warehouse for analysis
    • Production Apps (raw data) β†’ ETL Pipeline β†’ Data Warehouse (analysed and enriched data)
    • β‡’ β€œhow can we analyse our business?”
  • reverse ETL: Extract data from Data Warehouse, Transform to fit Ops Systems requirements, Load data into Operational Systems
    • Data Warehouse β†’ Rev ETL Pipeline β†’ Business Apps (CRM, Marketing tools, etc.)
    • β‡’ β€œhow can we use our analysis to run our business?”

an example of flow:

  • traditional ETL
    • Shopify β†’ ETL β†’ Data Warehouse β†’ Tableau Dashboard
  • reverse ETL
    • Customer LTV Score from (Data Warehouse) β†’ Reverse ETL β†’ Salesforce β†’ Sales team take actions based on those insights

β‡’ it closes the loop between data analysis and business operations

  • data team does the complex analysis in the warehouse
  • rev ETL allows non-tech team to use those with their tools to make impact on business
>> Business Apps

def: specialised software that each department in a company uses to do their job

note...

slightly off topic section, but i don’t know where to put this for now; still interesting to see and understand the big picture.

>>> i. Data Infrastructure Apps
def: move and manage data between systems

  • Segment: it is a Customer Data Platform (CDP) β€” main job is to collect, clean, and control customer data from everywhere and send it to all other tools
    β‡’ collect and route customer even data
  • Fivetran / Stitch: ETL tools that sync data from apps (like Salesforce) to the DW
    β‡’ ETL sync data from biz apps to DW
  • Hightouch / Census: Reverse ETL tools that sync data from the W back to business apps
    β‡’ reverse ETL from DW to biz apps
  • Airflow / Prefect: workflow orchestration tools that schedule and manage data pipelines

>>> ii. Customer-Facing Operations (β€œexternal”)

  • CRM: Customer Relationship Management, central system for all customer data and interactions
    • Salesforce: giant, highly customisable market leader
    • HubSpot: an all-in-one platform combining CRM, marketing, sales, and service β€” often seen as a more user-friendly alternative to Salesforce for growing companies
  • Marketing Automation: email campaigns, lead nurturing
    • HubSpot
    • Marketo (Adobe): Enterprise-level marketing automation
    • Mailchimp: Famous for email marketing, now expanding into broader marketing platforms.
  • Customer Support:
    • Zendesk, Intercom, Freshdesk

>>> iii. Internal Operations Apps

  • ERP: Entreprise Resource Planning, central nervous system of a company, basically manage core processes like finance, inventory, manufacturing and HR.
    • SAP, Oracle NetSuite, Microsoft Dynamics
  • HR & People Ops: Workday, BambooHR
  • Finance: QuickBooks, Xero
  • Productivity & Collaboration:
    • Slack / Microsoft Teams: communication
    • Asana / Jira: project and task management (Asana is general, Jira is very popular with software/engineering teams)
    • Google Workspace / Notion: all-in-one workspace for notes, docs, and wikis

>>> iv. Analytics & BI Apps

  • BI Tools:
    • Tableau, Power BI, Looker, …
  • Product & Analytics: track user behavior inside app/website to understand how ppl use and interact with the product
    • Amplitude: product & user behaviour analytics app, understand how users interact with the product β†’ user journey and events, informs what drives retention, conversion // complex, cross-platform user journey analysis
      • data focus: user events (clicked buttons, completed levels, upgraded plans), funnels, retention, cohort analysis, A/B testing
    • Mixpanel: very similar to amplitude, track specific user actions and build funnels to analyse conversion and retention // intuitive interface and strong funnel/reporting capabilities
    • Google Analytics: slightly different, more for marketing & acquisition analytics, understand where the website traffic comes from and what users do at a high level (not so specific and event-based)
      • data focus: page views, traffic sources, demographics, session data
> Data Governance

def: overall management of availability, quality, usability, integrity and security of data in an organisation β€” in compliance with legal and regulatory requirements
β†’ it’s about establishing rules and processes for handling data

>> Data Quality
def: ensure data is accurate, complete and reliable

>> Data Catalog
def: a β€œlibrary catalog” β€” what data exists, where it is, what is means
β†’ i.e.data discovery

>> Data Lineage
def: track where data comes from and how it moves/transforms through systems
β†’ i.e. data lifecycle

>> Data Ownership
def: accountability, stewardship and decision-making rights over a data asset
β†’ quality, security, management, maintenance, …

>> Access control
def: determine who can see and use what data

>> Compliance
def: ensure data handling meets legal and regulatory requirements

> Data Privacy

def: a critical subset of data governance (the overall system), focuses on the proper handling of personal and sensitive data β€” how it is collected, stored, shared and used in compliance with laws and individual rights.

>> Consent
def: getting permission from individuals to collect and use their data

>> Right to Access/Deletion
def: laws like GDPR give individuals the right to see what data you have on them and request its deletion

>> Data Minimisation
def: only collect data that you absolutely need

>> Major Regulations

  • GDPR: General Data Protection Regulation, the lord EU law
  • CCPA/CPRA: California Consumer Privacy Act, california state law
  • (bit off-topic) EU AI Act: world’s first comprehensive legal framework for AI passed by the EU β€” law that regulates AI systems based on their potential risk to health, safety and fundamental rights.
  • some others depending on industry, data type, regions…

7.2. Big Data

def: big data refers to the extremely large and complex datasets that are too big or diverse to be handled by traditional data processing methods.
β†’ characterised by the 5Vs:

  • volume: massive amounts of data (e.g. petabyte = >1M gb)
  • velocity: high speed at which the data is generated and processed
  • variety: different types of data (structured, unstructured, semi-structured)
  • veracity: accuracy and trustworthiness
  • value: useful insights and benefits extracted from the data

β†’ large-scale data processing

Other core principle: FAIR data principle

  • Findability: easy to find data for both humans and machines
  • Accessibility: know how to access the data (authentication and authorisation)
  • Interoperability: data usually need to be integrated with other data, apps, workflows, etc.
  • Reuse: ultimate goal of FAIR is to optimise reuse and replication of data

7.2.1. Hadoop

def: open-source framework for distributed storage and processing of large datasets across clusters of computers

  • core includes
    • HDFS (hadoop distributed file system): splits and stores data across multiple machines (low-cost servers/computers i.e. commodity hardware)
    • MapReduce : programming model to allow parallel data-processing (disk-based)
  • fault tolerance: if some parts of the sys fail, it’ll still keep working properly without interruption
  • scalability: ability of a sys to handle growing amount of work by adding more resources (e.g. more servers to manage more data, or more users without performance loss)
  • horizontal scaling: increasing sys capacity by adding more machines to a network vs. making a single machine more powerful = vertical scaling β€” basically distributes workload across many servers
  • batch processing

7.2.2. Apache Spark

def: fast, flexible, in-memory data processing engine (often used alongside Hadoop)
β†’ in-memory = data processing directly in a computer’s RAM (memory) instead of slower storage/hard drives (disk) β‡’ ++ SPEED
β†’ RAM is limited in storage vs. disk

  • in-memory >> traditional disk-based processing like MapReduce (β†’ Apache Spark faster than just Hadoop)
    • offers batch processing, stream processing, ML, graph computations
      • β†’ unified analytics platform

7.2.3. Kafka

def: platform for real-time data streaming and messaging

  • high-throughput: capacity to process a large volume of data in a given time
  • fault tolerance
  • process data streams continuously and real-time

7.2.4. NoSQL

def: type of DB designed for flexibility and scalability of unstructured or semi-structured data
vs. traditional relational DB

Different data models supported:

  • MongoDB is a popular NoSQL db that stores data as JSON-like formats (document data model)
  • Cassandra (wide-column data model)
  • key-value,
  • graph,
  • …

7.2.5. Hive

def: high-level data query and scripting tools built on Hadoop, allows data analysis without deep programming knowledge with SQL-like queries

8. Data Science Challenges

def: common obstacles, limitations, and practical considerations encountered throughout the data lifecycle from collection to deployment.

8.1. Data challenges

def: issues related to acquiring, cleaning, and preparing high-quality data from diverse sources, ensuring data is reliable and fit-for-purpose for analysis and modeling.

Data Acquisition & Quality

def: ensuring the collection of relevant, accurate, and complete data while overcoming issues like data silos, missing values, inconsistencies, and privacy compliance.

> Finding Data
data is often siloed across different company departments or doesn’t exist
and it can be difficult to merge data from diverse, isolated systems with inconsistent formats and accessibility problems

> Bias & Ethical concerns in Data
historical data can contain human and societal biases β†’ biased models
β‡’ address fairness to avoid biases embedded in training data (cause results in models too)

> Missing & Noisy Data
real-world data is messy, with incorrect (inaccurate) entries, missing values (incomplete), outdated or duplicated data, inconsistencies
β†’ these have to be thoroughly cleaned to avoid distorting the analysis

> Labeling
for supervised learning, labeling data is often expensive, time-consuming, requires expert knowledge

> Unstructured Data
processing and extracting meaningful insights from unstructured formats like text, images, logs, …

> Imbalanced Datasets
classes in classification problems are not represented equally, which can cause models to be biased toward majority classes

β‡’ Some solutions:

  • SMOTE: Synthetic Minority Over-Sampling Technique that creates artificial examples of the minority class to balance the dataset
  • Class weighting: adjust loss function to give more importance to minority classes during model training

> Privacy & Security
protecting sensitive information amid growing regulations and increasing cyber threats

Data Preparation & Engineering

> 80% Rule
cleaning, transforming, feature engineering can take up to 80% of the project time

> Data Leakage
when information from the future / test set β€œleaks” into the training process, which leads to over-optimistic and useless models

e.g.

  • Preprocessing (like scaling) done on the entire dataset before splitting
  • Using future information to predict the past (sequential data, time-based data)

β‡’ Solution: Always split data first, then preprocess using only training statistics

8.2. Model challenges

def: focusing on designing, training, evaluating, and interpreting models that generalise well, avoid bias, and balance accuracy with explainability and computational constraints.

Model Development

def: designing and training models that balance complexity and performance, while addressing issues like overfitting, bias, reproducibility and computational efficiency.

cf. 4.2.3. Fundamental Eval Concepts

> No one size fits all β€” Choosing the Right Model
no single model is best for every problem β†’ choosing the right algorithm is non-trivial

> Overfitting
model performs well on training data but poorly on test data
β†’ regularisation, simpler models, more data, dropout, early stopping

> Underfitting
model performs poorly on both training and test data
β†’ more complex models, better features, longer training

> Bias-Variance Trade-off
tradeoff between too simple and too complex model
β†’ basically balance between over/underfitting

> Reproducibility
getting different results from the same code due to random seeds, different software versions or hardware
β†’ it’s important to ensure that the results can be reliably reproduced across teams and deployments

> Computational cost and efficiency
def: increasingly critical challenges in model development due to the growing complexity and size of modern deep learning models

  • Resource Intensive Training: large models require GPU resources, long training times β†’ can be costly and inaccessible for smaller teams or organisations
  • Inference Latency and Throughput: need for models to perform quickly and at scale esp. for real-time apps like autonomous driving or recommendation systems β†’ efficient architectures and model compression techniques are necessary
  • Algorithmic Efficiency: research to reduce computational requirements while preserving accuracy β†’ e.g. transformers variants or lightweights CNNs
  • Trade-offs Between Accuracy and Efficiency: balance between model complexity and feasibility

> Keep up with Rapid Tool Evolution
adapting quickly to new algo, framework, AI tech

Evaluation & Interpretation

def: assessing model accuracy and generalisation with appropriate metrics and ensuring interpretability and explainability for trustworthy decision-making.

> Choosing the Right Metric
accuracy is often misleading β†’ just like choosing the right model, choosing the right metric that aligns with business goals is also critical

> The Black Box Problem
many powerful model (like deep learning) are difficult to interpret, making it hard to explain why a decision was made
β†’ striking a good balance between model performance with the ability to explain decisions for transparency and trust is essential

> Concept Drift
statistical properties of the target variable change over time, causing the model performance to decay

8.3. Deployment challenges

def: managing the transition from development to production, including experimentation, validation, monitoring, and ongoing maintenance to ensure models deliver consistent, real-world value.

From Prototype to Production

def: controlled experimentation and validation (during rollout) before full-scale deployment

> Jupyter Notebook to Production Gap
a model working well in research notebook is very different from a reliable, scalable production system

> A/B Testing
def: controlled experiment where two versions (A and B) of a feature, webapp, app, other solution are randomly presented to users to determine which version performs better based on chosen metrics (clicks, conversions, engagement, …)
β†’ evidence-based decision-making: validate changes using real-world user data vs. intuition and assumptions
β‡’ optimise UX, marketing strategies, product features, …

step-by-step:

  • formulate a hypothesis about which change might improve outcomes
  • create 2 versions: A(control) and B(variant)
  • randomly assign users to each group
  • collect data on outcomes for both groups
  • use stats analysis (often hypothesis testing, cf. i. Hypothesis Testing)
    to determine whether observed differences are significant and not due to chance

> Infrastructure & Scalability
building pipelines for data ingestion, model serving, monitoring that can handle real-world load
β†’ Tool Fragmentation: managing multiple platforms needed for different parts of the data science pipeline, which can complicate workflows
β†’ Scalability: challenges in maintaining model performance and response times as data volume, velocity and user load increase in production

Some solutions:

  • Distributed Computing: using multiple machines or processors to handle large-scale data processing and model training through parallelisation (cf. 6.2. Cloud Computing)

> Operationalisation
integrating models into business processes with continuous monitoring for performance degradation or bias (model drift)

Monitoring & Maintenance

def: continuous tracking of model performance, detecting drift, managing versioning, and updating models to maintain accuracy and relevance in production.

> Performance Decay
models need to be continuously monitored and retrained as data and the world changes
(cf. monitoring and observability in MLOps)
β†’ concept / data / model drift lead to performance decay

Some solutions:

  • data drift detection: identify when the stats properties of input data change over time, potentially degrading model performance
  • concept drift detection: identify when the relationships between inputs and outputs change, requiring model retraining or updating

> Versioning
managing versions of data, model code, and the trained model artifacts themselves

8.4. Other challenges

def: encompassing cross-cutting issues like aligning technical work with business goals, navigating ethical and regulatory requirements, and fostering collaboration across stakeholders.

Tech-Business Relationship

  • understand the business context, align tech solution
  • collaborate closely and efficiently
  • communicate complex results and limitations to non-tech stakeholders

Ethical & Regulatory Challenges

  • Responsible AI: implement safeguards for secure, ethical and transparent AI use
  • Bias Mitigation: proactively detect and reduce biases through ethical reviews and audits
  • Governance and Compliance: meet regulatory requirements and maintain thorough documentation for accountability

Appendix

notes of stuff you wanna include?

  • 🟣 time series stuff β†’ WIP!

  • 🟣 Maths

  • 🟣 coding

  • AWS SageMaker

  • next:

    • git β‡’ SE section in Coding
    • data visu β‡’ in EDA, Data subsection β‡’ ok but just the placeholder honestly, will have to go slightly deeper
    • bias and variance β†’ Bias-variance tradeoff
    • data processing, preprocessing, wrangling β†’ Data101
    • finish 4.2.3. Fundamental Eval Concepts
    • add more to data science challenges
    • A/B testing β‡’ DS Challenges
    • ETL β†’ Data engineering section
    • type of data… and dbase? β‡’ Data engineering section
    • big rabbit hole into business apps
  • done:

    • Markov decision process
    • k-nearest neighbours vs… the other one?
    • regressions?
    • AI for sure
    • sql coding
    • variance vs. std
    • CI/CD
    • Docker
  • check all the links

  • clean up the cf. and ref :(

  • review all quickly

  • churn? ROI? (aren’t they business terms) β†’ general glossary created

  • GANs or GenAI, diffusion β†’ AI engineering notes probably here :)

β†’ there are still some stuff to review and complete!
❓: to review later -- make sure to check them after

sources: