Linguistics as Data Science - LAN00058I

Department: Language and Linguistic Science
Credit value: 20 credits
Credit level: I
Academic year of delivery: 2026-27
- See module specification for other years: 2023-24 2024-25 2025-26

Module summary

This module will familiarise students with the essential tools of data science as they apply to linguistics. Students will develop skills and knowledge in data management, visualisation and statistical modelling through the analysis of linguistic data sets in the R statistical software environment.

Module will run

Occurrence	Teaching period
A	Semester 2 2026-27

Module aims

Data science is a way of using evidence to better understand the world around us, through a combination of scientific methods, statistics, and subject-specific knowledge. This module will familiarise students with the essential tools of data science as they apply to linguistics. Students will develop skills and knowledge in data management, visualisation and statistical modelling through the analysis of linguistic data sets in the R statistical software environment. An additional goal of this module is to foster quantitative literacy in general, helping students become critical consumers of arguments based on numbers, regardless of whether those arguments are made in linguistics or other fields such as politics or economics.

The module directly feeds into advanced research-oriented modules that students typically attend in their third year.

Module learning outcomes

Students who complete this module will be able to:

understand and critically evaluate quantitative arguments and statistical analyses in linguistics and elsewhere;
perform a wide variety of data-related tasks in the R statistical software environment;
create, manage and manipulate tidy data sets;
design and produce professional and informative visualisations;
build statistical models that can be used to make predictions and evaluate hypotheses;
present quantitative results following established conventions in the field of linguistics.

Module content

This module will help students learn to do this using linguistic data. It will break with the conventional format of modules of statistics that tend to spend an inordinate amount of time on learning about and being able to run simple statistical tests (such a T-tests and chi-squared tests). Rather, we take a holistic approach based on principles of data science. Data science says that understanding and analysing evidence is more than simply running a statistical test. It also involves tidying data sets (i.e. spreadsheets) to remove errors, managing data sets, being able to describe patterns in a data set, and visualisation using graphs.

Indicative assessment

Task	% of module mark
Groupwork	20.0
Open Exam (7-day week)	80.0

Special assessment rules

None

Additional assessment information

Summative group project (20%): Group project presented in the form of a wiki and group-work diary
Students will be assigned to small groups of around four or five members. Each group will choose a project from a short list of three or four options. Each project has an associated data set. The groups will be required to critically evaluate previous studies on the topic, particularly in terms of methods, analysis and statistics. The project must include an analysis of the data set itself and students must present visualisations of the data (i.e. figures/graphs) as well as the results of statistical modelling. The final project will be submitted in the form of a wiki (max 1500 words) along with a ‘group-work’ diary outlining who did what at each stage of the project.

This assessment is designed to test students’ abilities to understand and evaluate the data analysis and statistical methods employed in previous research, in order to make them ‘informed consumers’ of data. The project will also test practical skills using R. We aim to make sure that progress towards the group project is incremental throughout the Spring term by setting weekly milestones (based on lecture and lab content) describing where the students should be up to by a given point. This means that workload for this assessment is spread across a number of weeks.

Summative open exam (80%): 7-day open exam

The main summative element of the module will be a 7-day open exam. The exam will be in two parts. The first will be a quiz-style question and answer format. In the second part, students will be provided with a data set and asked to tidy up and visualise the data and run statistical analyses. In this section students will be required to submit a written report, as well as the tidy data set and R code. The final mark will be based on a combination of all of these elements.

Indicative reassessment

Task	% of module mark
Open Exam (7-day week)	100.0

Module feedback

Feedback will be provided within 25 days of submission.

Indicative reading

Field, A. P. (2000) Discovering Statistics using SPSS for Windows: Advanced Techniques for the Beginner. London: Sage.

Hudson, T. (2015). Presenting quantitative data visually. In L. Plonsky (Ed.) Advancing Quantitative Methods in Second Language Research. New York: Routledge. pp. 98-125.

Larson-Hall, J. (2016) A Guide to Doing Statistics in Second Language Research using SPSS and R. London: Routledge.

Levshina, N. (2015) How to do Linguistics with R: Data Exploration and Statistical Analysis. Amsterdam: John Benjamins.

Wickham, H. and Grolemund, G. (2016) R for Data Science: Import, Tidy, Transform, Visualise, and Model Data. Beijing: O’Reilly.