Algorithms for Data Analysis

Télécom Saint-Etienne
2nd year, Data Engineering Apprenticeship Engineer
24 hours
License CC BY-NC-SA 4.0 - Jordan Frécon-Deloire - 2022

This course is partly inspired by the one written by Charlotte Laclau.

The present course aims to develop a solid understanding of various algorithms and techniques used for processing and analyzing data. This includes gaining proficiency in implementing and evaluating algorithms, selecting appropriate algorithms for specific tasks, understanding algorithmic complexity and efficiency, and applying algorithms to solve real-world data analysis problems. The goal is to equip students with the necessary knowledge and skills to effectively and efficiently analyze large datasets, extract valuable insights, and make data-driven decisions.

Prerequisites

It is necessary to have prior knowledge on both the fundamentals of mathematics (analysis, linear algebra, probability, statistics) and computer science. As such, it is advised to be proficient in

Basic concepts of probability and statistical inference
Vector manipulation and matrix multiplication
Basics of calculus of variations
Linear algebra (eigenvalue - eigenvector, hyperplane, projection, etc.)
Python programming and vector manipulation using NumPy

Educational Goals

This course is designed to provide a quick introduction to data analysis algorithms for data engineers. As such, it meets different educational objectives:

Understand the differences between artificial intelligence and machine learning
Know which model / algorithm to choose depending on the task
Learn how to train, validate and test machine learning models
Discover the basics of numerical optimization
Master data analysis Python librairies
Apply the most popular supervised learning algorithms (linear regression, SVM , decision trees, etc.)
Use two unsupervised learning algorithms (K-means, PCA)

Course Program

The course is given in 10 sessions of 1.5 or 3 hours (indicated by )

Session 1. Data analysis with Python

- Tutorial class on the concepts & librairies of data analysis ( Slides)

Data analysis; Types of data; Data representation
Differences between AI and ML
Python librairies: Numpy, Pandas, Scikit-learn

- Lab exercise: Video games sales ( Jupyter notebook)

Collecting and cleaning data
Exploratory data analysis
Data visualization

Session 2. Introduction to machine learning

- Tutorial class on machine learning ( Slides)

Formal definition; Supervised learning; Unsupervised learning
What is learning; Hypothesis space; Generalization
Training, validating and testing models
Hyperparameters; Data splitting techniques; Regularization

Session 3. ML for non ML Practionners

Lab exercise: Diamonds market ( Jupyter notebook)

1D linear regression
Link between statistics and ML
Grid-search algorithm

Session 4. Panorama of numerical optimization algorithms

- Tutorial class on numerical optimization ( Slides)

Principle
Convex vs. nonconvex
Gradient descent; Impact of the step-size

- Presentations by students

Random search; SGD; Adam; BFGS; Coordinate descent; Linesearch
Newton's method; Genetic algorithm; Branch-and-bound algorithm

Session 5. Fundamental supervised models

- Tutorial class on fundamental supervised models ( Slides)

K-nearest neighbors
Linear regression
Ridge regression

Session 6. Train and use simple supervised models to study penguins

- Lab exercise: Penguins of antartica ( Jupyter notebook)

Visualize meaningful features
Cross-validation; K-fold cross-validation
KNN Classification; Linear and KNN Regression

Session 7. Advanced supervised models

- Tutorial class on advanced supervised models ( Slides)

Decision trees
Logistic regression ( Animation)
Support vector machine ( Animation)

Session 8. Forecast loan repayment success with supervised classification models

- Lab exercise: Loan prediction ( Jupyter notebook)

Model and visualize decision trees
Impact of the tree depth on the decision boundaries
Comparison with logistic regression and SVM

Session 9. Real-time camera classification using neural networks

- Tutorial class on neural networks ( Slides)

Linear layers; Neurons; Activation function; Softmax layer
Backpropagation
Adversarial attacks

- Presentations by students

CNNs, RNNs, GNNs, GANs
Autoencoders; Attention mechanisms
Transfer learning; Interpretability; Ethical considerations

- Lab exercise: Live camera ( Jupyter notebook)

Train and evaluate multi-layer perceptrons
Prevent overfitting with early stopping
Hyperparameters and architecture search

Session 10. Apply unsupervised learning algorithms to predict digits

- Tutorial class on unsupervised learning ( Slides)

K-means
Principal component analysis

- Lab exercise: Digits recognition ( Jupyter notebook)

Representing and clustering data in 2D
Choosing PCA components and K-means clusters

Course Materials

Slides Introduction Machine Learning Numerical Optimization Fundamental Supervised Models Advanced Supervised Models Neural Networks Unsupervised Learning

License

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License

Creative Commons License