Algorithms for Data Analysis

This course is partly inspired by the one written by Charlotte Laclau.

The present course aims to develop a solid understanding of various algorithms and techniques used for processing and analyzing data. This includes gaining proficiency in implementing and evaluating algorithms, selecting appropriate algorithms for specific tasks, understanding algorithmic complexity and efficiency, and applying algorithms to solve real-world data analysis problems. The goal is to equip students with the necessary knowledge and skills to effectively and efficiently analyze large datasets, extract valuable insights, and make data-driven decisions.

Prerequisites

It is necessary to have prior knowledge on both the fundamentals of mathematics (analysis, linear algebra, probability, statistics) and computer science. As such, it is advised to be proficient in

• Basic concepts of probability and statistical inference
• Vector manipulation and matrix multiplication
• Basics of calculus of variations
• Linear algebra (eigenvalue - eigenvector, hyperplane, projection, etc.)
• Python programming and vector manipulation using NumPy

Educational Goals

This course is designed to provide a quick introduction to data analysis algorithms for data engineers. As such, it meets different educational objectives:

• Understand the differences between artificial intelligence and machine learning
• Know which model / algorithm to choose depending on the task
• Learn how to train, validate and test machine learning models
• Discover the basics of numerical optimization
• Master data analysis Python librairies
• Apply the most popular supervised learning algorithms (linear regression, SVM , decision trees, etc.)
• Use two unsupervised learning algorithms (K-means, PCA)

Course Program

The course is given in 10 sessions of 1.5 or 3 hours (indicated by )

- Tutorial class on the concepts & librairies of data analysis ( Slides)
• Data analysis; Types of data; Data representation
• Differences between AI and ML
• Python librairies: Numpy, Pandas, Scikit-learn

- Lab exercise: Video games sales ( Jupyter notebook)
• Collecting and cleaning data
• Exploratory data analysis
• Data visualization

- Tutorial class on machine learning ( Slides)
• Formal definition; Supervised learning; Unsupervised learning
• What is learning; Hypothesis space; Generalization
• Training, validating and testing models
• Hyperparameters; Data splitting techniques; Regularization

Lab exercise: Diamonds market ( Jupyter notebook)
• 1D linear regression
• Link between statistics and ML
• Grid-search algorithm

- Tutorial class on numerical optimization ( Slides)
• Principle
• Convex vs. nonconvex
• Gradient descent; Impact of the step-size

- Presentations by students
• Random search; SGD; Adam; BFGS; Coordinate descent; Linesearch
• Newton's method; Genetic algorithm; Branch-and-bound algorithm

- Tutorial class on fundamental supervised models ( Slides)
• K-nearest neighbors
• Linear regression
• Ridge regression

- Lab exercise: Penguins of antartica ( Jupyter notebook)
• Visualize meaningful features
• Cross-validation; K-fold cross-validation
• KNN Classification; Linear and KNN Regression

- Tutorial class on advanced supervised models ( Slides)
• Decision trees
• Logistic regression
• Support vector machine

- Lab exercise: Loan prediction ( Jupyter notebook)
• Model and visualize decision trees
• Impact of the tree depth on the decision boundaries
• Comparison with logistic regression and SVM

- Tutorial class on neural networks ( Slides)
• Linear layers; Neurons; Activation function; Softmax layer
• Backpropagation

- Presentations by students
• CNNs, RNNs, GNNs, GANs
• Autoencoders; Attention mechanisms
• Transfer learning; Interpretability; Ethical considerations

- Lab exercise: Live camera ( Jupyter notebook)
• Train and evaluate multi-layer perceptrons
• Prevent overfitting with early stopping
• Hyperparameters and architecture search

- Tutorial class on unsupervised learning ( Slides)
• K-means
• Principal component analysis

- Lab exercise: Digits recognition ( Jupyter notebook)
• Representing and clustering data in 2D
• Choosing PCA components and K-means clusters