Accessibility statement

Data analysis and machine learning - CHE00045M

« Back to module search

  • Department: Chemistry
  • Module co-ordinator: Prof. Kevin Cowtan
  • Credit value: 20 credits
  • Credit level: M
  • Academic year of delivery: 2023-24
    • See module specification for other years: 2024-25

Module summary

Data Science is the science and craft of extracting information from, and testing hypotheses against, data. A universe of statistical and machine learning techniques exist to help us do this. This module explores that universe through a series of lectures and ‘hands-on’ Python/PyTorch/Scikit-Learn workshops that are structured to give you a strong foundation stretching from design through development and deployment - it equips you to build robust, reliable, and scalable data pipelines that integrate machine learning models for your own data science projects.

Module will run

Occurrence Teaching period
A Semester 1 2023-24

Module aims

How can data help us to answer scientific questions? The aim of this module is to familiarise you with different machine learning problem domains (e.g. supervised, unsupervised, and reinforcement learning) and give you an appreciation of the kind of machine learning models available, in addition to ‘hands-on’ experience designing, developing, and deploying robust, reliable, and scalable data pipelines that integrate these models in Python/PyTorch/Scikit-Learn.


You will learn how to preprocess and partition data, select and/or design useful features, and implement, evaluate, and improve machine learning models. You will also learn to work with deep learning models, e.g. convolutional, graph, and recurrent neural networks; you will learn how to use these models with structured data, and get ‘hands-on’ experience implementing them in Python/PyTorch. On completion of this module, you will come away with a strong foundation of applications-focused knowledge and practical skills stretching across the whole data science pipeline; you will be able to design, develop, deploy, and evaluate your own (deep) machine learning solutions.

Module learning outcomes

Students will be able to:

  • Distinguish different machine learning problem types: supervised vs. unsupervised vs. reinforcement; classification vs. regression.

  • Carry out data preprocessing, partitioning, and feature selection.

  • Implement supervised and unsupervised machine learning algorithms using `scikit-learn`.

  • Select and appraise alternative machine learning algorithms.

  • Evaluate the performance of a machine learning algorithm and implement techniques to improve it.

  • Implement deep learning algorithms to work with structured and unstructured data in the domains of image classification and natural language processing.

  • Design, develop, and deploy components across the data exploration, preprocessing, and prediction pipelines, constructing ‘end-to-end’ solutions.

Module content

  • Machine learning problem domains.

  • Data preprocessing: categorical and continuous data.

  • Supervised learning: e.g., (non)linear and logistic regression, support vector machines (SVMs), decision trees.

  • Unsupervised learning: e.g., principal component analysis (PCA), t-distributed stochastic neighbour embedding (t-SNE), clustering.

  • Model (cross-)validation and evaluation.

  • Model improvement: hyperparameter optimisation.

  • Neural networks: multilayer perceptrons (MLP); convolutional (CNN), graph (GNN), and recurrent (RNN) neural networks.

  • Generative machine learning models: autoencoders and generative adversarial neural networks (GANNs).

Assessment

Task Length % of module mark
Essay/coursework
Data analysis problem : Deep learning exercise
N/A 50
Essay/coursework
Data analysis problem : Machine learning classification and regression exercises
N/A 50

Special assessment rules

None

Additional assessment information

Machine learning classification and regression exercises.

2× computer programs (25%).

50%

Deep learning exercise.

Written report (25%) + computer program (25%).

50%

Reassessment

Task Length % of module mark
Essay/coursework
Data analysis problem : Individual data analysis problem
N/A 50
Essay/coursework
Data analysis problem : Individual data analysis problem
N/A 50

Module feedback

Feedback will be provided through workshops, online exercises and a formative assessment. Feedback on summative work will be provided within 25 working days of the assessment.

Indicative reading

  • Introduction to data science: a Python approach to concepts, techniques, and applications
    Laura Igual, Santi Segui´. Springer 2017

  • Python for data analysis: data wrangling with Pandas, NumPy, and IPython
    Wes McKinney. O'Reilly 2017

  • The hundred-page machine learning book
    Andriy Burkov. Andriy Burkov 2019

  • Machine learning engineering
    Andriy Burkov. True Positive Ltd. 2020

  • Machine learning with PyTorch and Scikit-Learn
    Sebastian Raschka, Yuxi Liu, Vahid Mirjalili. Packt Publishing Ltd. 2022



The information on this page is indicative of the module that is currently on offer. The University is constantly exploring ways to enhance and improve its degree programmes and therefore reserves the right to make variations to the content and method of delivery of modules, and to discontinue modules, if such action is reasonably considered to be necessary by the University. Where appropriate, the University will notify and consult with affected students in advance about any changes that are required in line with the University's policy on the Approval of Modifications to Existing Taught Programmes of Study.