Machine learning for data science - CHE00045M

Department: Chemistry
Credit value: 20 credits
Credit level: M
Academic year of delivery: 2023-24
- See module specification for other years: 2024-25 2025-26 2026-27

Module summary

Data Science is the science and craft of extracting information from, and testing hypotheses against, data. A universe of statistical and machine learning techniques exist to help us do this. This module explores that universe through a series of lectures and ‘hands-on’ Python/PyTorch/Scikit-Learn workshops that are structured to give you a strong foundation stretching from design through development and deployment - it equips you to build robust, reliable, and scalable data pipelines that integrate machine learning models for your own data science projects.

Module will run

Occurrence	Teaching period
A	Semester 1 2023-24

Module aims

How can data help us to answer scientific questions? The aim of this module is to familiarise you with different machine learning problem domains (e.g. supervised, unsupervised, and reinforcement learning) and give you an appreciation of the kind of machine learning models available, in addition to ‘hands-on’ experience designing, developing, and deploying robust, reliable, and scalable data pipelines that integrate these models in Python/PyTorch/Scikit-Learn.

You will learn how to preprocess and partition data, select and/or design useful features, and implement, evaluate, and improve machine learning models. You will also learn to work with deep learning models, e.g. convolutional, graph, and recurrent neural networks; you will learn how to use these models with structured data, and get ‘hands-on’ experience implementing them in Python/PyTorch. On completion of this module, you will come away with a strong foundation of applications-focused knowledge and practical skills stretching across the whole data science pipeline; you will be able to design, develop, deploy, and evaluate your own (deep) machine learning solutions.

Module learning outcomes

Students will be able to:

Distinguish different machine learning problem types: supervised vs. unsupervised vs. reinforcement; classification vs. regression.
Carry out data preprocessing, partitioning, and feature selection.
Implement supervised and unsupervised machine learning algorithms using `scikit-learn`.
Select and appraise alternative machine learning algorithms.
Evaluate the performance of a machine learning algorithm and implement techniques to improve it.
Implement deep learning algorithms to work with structured and unstructured data in the domains of image classification and natural language processing.
Design, develop, and deploy components across the data exploration, preprocessing, and prediction pipelines, constructing ‘end-to-end’ solutions.

Module content

Machine learning problem domains.
Data preprocessing: categorical and continuous data.
Supervised learning: e.g., (non)linear and logistic regression, support vector machines (SVMs), decision trees.
Unsupervised learning: e.g., principal component analysis (PCA), t-distributed stochastic neighbour embedding (t-SNE), clustering.
Model (cross-)validation and evaluation.
Model improvement: hyperparameter optimisation.
Neural networks: multilayer perceptrons (MLP); convolutional (CNN), graph (GNN), and recurrent (RNN) neural networks.
Generative machine learning models: autoencoders and generative adversarial neural networks (GANNs).

Indicative assessment

Task	% of module mark
Essay/coursework	50.0
Essay/coursework	50.0

Special assessment rules

None

Additional assessment information

Machine learning classification and regression exercises.

2× computer programs (25%).

50%

Deep learning exercise.

Written report (25%) + computer program (25%).

50%

Indicative reassessment

Task	% of module mark
Essay/coursework	50.0
Essay/coursework	50.0

Module feedback

Feedback will be provided through workshops, online exercises and a formative assessment. Feedback on summative work will be provided within 25 working days of the assessment.

Indicative reading

Introduction to data science: a Python approach to concepts, techniques, and applications
Laura Igual, Santi Segui´. Springer 2017
Python for data analysis: data wrangling with Pandas, NumPy, and IPython
Wes McKinney. O'Reilly 2017
The hundred-page machine learning book
Andriy Burkov. Andriy Burkov 2019
Machine learning engineering
Andriy Burkov. True Positive Ltd. 2020
Machine learning with PyTorch and Scikit-Learn
Sebastian Raschka, Yuxi Liu, Vahid Mirjalili. Packt Publishing Ltd. 2022