Linguistics as Data Science - LAN00058I

Department: Language and Linguistic Science
Credit value: 20 credits
Credit level: I
Academic year of delivery: 2022-23
- See module specification for other years: 2023-24 2024-25 2025-26

Module summary

This module prepares students for the analysis of linguistic data sets (from small to very big data) using quantitative methods and serves as a foundation for advanced third year modules that rely on independent research. This includes data tidying, visualisation and statistics.

Module will run

Occurrence	Teaching period
A	Spring Term 2022-23 to Summer Term 2022-23

Module aims

Data science is a way of using evidence to better understand the world around us, through a combination of scientific methods, statistics, and subject-specific knowledge. This module will familiarise students with the essential tools of data science as they apply to linguistics. Students will develop skills and knowledge in data management, visualisation and statistical modelling through the analysis of linguistic data sets in the R statistical software environment. An additional goal of this module is to foster quantitative literacy in general, helping students become critical consumers of arguments based on numbers, regardless of whether those arguments are made in linguistics or other fields such as politics or economics.

The module directly feeds into advanced research-oriented modules that students typically attend in their third year. It is currently a prerequisite for Advanced Topics in Phonetics and Phonology (L13H) and Methods in Language Variation and Change (E/L06H).

Module learning outcomes

Students who complete this module will be able to:

understand and critically evaluate quantitative arguments and statistical analyses in linguistics and elsewhere;
perform a wide variety of data-related tasks in the R statistical software environment;
create, manage and manipulate tidy data sets;
design and produce professional and informative visualisations;
build statistical models that can be used to make predictions and evaluate hypotheses;
present quantitative results following established conventions in the field of linguistics.

Module content

Data and statistics are all around us. They are used and misused to build narratives and to support agendas. Being able to understand and critically evaluate data and statistics is, therefore, a key skill in today’s society.

This module will help students learn to do this using linguistic data. It will break with the conventional format of modules of statistics that tend to spend an inordinate amount of time on learning about and being able to run simple statistical tests (such a T-tests and chi-squared tests). Rather, we take a holistic approach based on principles of data science. Data science says that understanding and analysing evidence is more than simply running a statistical test. It also involves tidying data sets (i.e. spreadsheets) to remove errors, managing data sets, being able to describe patterns in a data set, and visualisation using graphs.

The first few weeks of the module are devoted entirely to descriptive statistics, data visualisation and data exploration, giving students a sense of what they can do with data and enabling them to communicate statistical results in an elegant and efficient way. The rest of the module focuses on modelling. All statistical techniques rely on models of reality. Models are a way of generalising patterns to the wider population and making predictions about future behaviour (for example, weather reports are inherently statistical and rely on data). In this module, we will start with simple models and gradually introduce more complexity in order to deal with different types of linguistic data. Throughout we will use examples from phonetics, sociolinguistics, syntax, psycholinguistics and language acquisition

This structure is a much better reflection of current practices in data analysis, and it allows us to give students up-to-date skills and knowledge that will serve them well not just in third-year modules but also in their future career.

Indicative assessment

Task	% of module mark
Groupwork	20.0
Open Exam (7-day week)	80.0

Special assessment rules

None

Additional assessment information

In this section we provide a more detailed overview of the assessments in the proposed module. Assessment will consist of two pieces of summative work:

Summative group project (20%): Group project presented in the form of a wiki and group-work diary in Week 10 (T2)
Students will be assigned to small groups of around four or five members. Each group will choose a project from a short list of three or four options. Each project has an associated data set. The groups will be required to critically evaluate previous studies on the topic, particularly in terms of methods, analysis and statistics. The project must include an analysis of the data set itself and students must present visualisations of the data (i.e. figures/graphs) as well as the results of statistical modelling (by week 10, they will be able to run simple models). The final project will be submitted in the form of a wiki (max 1500 words) along with a ‘group-work’ diary outlining who did what at each stage of the project. Submission of the project will be in Week 10 of T2.

This assessment is designed to test students’ abilities to understand and evaluate the data analysis and statistical methods employed in previous research, in order to make them ‘informed consumers’ of data. The project will also test practical skills using R. We aim to make sure that progress towards the group project is incremental throughout the Spring term by setting weekly milestones (based on lecture and lab content) describing where the students should be up to by a given point. This means that workload for this assessment is spread across a number of weeks.

Summative open exam (80%): 7-day open exam (set Wednesday Week 4, T3)

The main summative element of the module will be a 5-day open exam. The exam will be in two parts. The first will be a quiz-style question and answer format. In the second part, students will be provided with a data set and asked to tidy up and visualise the data and run statistical analyses. In this section students will be required to submit a written report, as well as the tidy data set and R code. The final mark will be based on a combination of all of these elements.

Formative assignments

For this module, formative assessment will be in the form of small biweekly exercises. Some of these will be directed tasks where students have to create a specific graph or run a specific model, while some of them will be mini-quizzes run through the VLE. Students will be assessed on a number of elements as part of the formative assessment. This will include factual information (quiz-style questions similar to those in summative II), quality of R code, tidy datasets and appropriate statistical analyses. The exercises for the formative assessments will align with the weekly topics in the lectures and the labs.

Indicative reassessment

Task	% of module mark
Open Exam (7-day week)	100.0

Module feedback

In line with University policy, students will receive written feedback on summative work along with a mark on the University scale within 20 working days of submission, or 25 working days for work submitted in Summer Week 5.

Formative exercises will be mostly small directed tasks such as creating a specific graph or running a specific model, and mini-quizzes managed through the VLE. Students will be given written (often automated) feedback and a mark on the University scale for the formative assessments. We will provide feedback within a reasonable time scale for it to be useful for the summative assessments.

Throughout the module students will be given other forms of feedback. Students will receive feedback on lab exercises in the form of corrections and model answers, as well as email feedback as required. Students will also receive a range of oral feedback, including group-level advice for the purposes of the group summative assessment in T2. Students will be encouraged to seek individual oral feedback during staff open hours.

Indicative reading

Wickham, H. and Grolemund, G. (2017). R for Data Science. O'Reilly, Sebastopol, CA.

Freedman, D., Pisani, R., Purves, R. (2007). Statistics. W. W. Norton & Company, New York, NY.

Levshina, N. (2015). How to do linguistics with R: Data exploration and statistical analysis. Amsterdam/Philadelphia: John Benjamins.