Organising and documenting your data
Good organisation of data means that you'll be able to find what you want when you need it.
Documenting your data will help you understand your data and how you derived it, and will help other researchers interpret your results. Although this page concentrates on digital data, you should remember to extend the same care and attention to data held in physical form.
It is important at the planning stage to consider what format your files will be held in. This will be determined by a number of factors including:
- The nature of your research and its discipline-specific context
- What software you have prior experience of and what is available to you
- The hardware you are using to collect your data (you may have no choice in what file formats are produced by a particular instrument or piece of equipment)
- The workflows you intend to put in place to collect and analyse your data
- The digital storage space that is available to you - your choice of file format may have an impact on the size of your final dataset.
However, it is also useful to give some thought to the life of your research data after your project is complete. Your data may have a use beyond your project, others may wish to consult it to verify your findings or inform further research. You should also consider:
- What formats are easiest to share
- What formats are easiest for others to reuse (both now and in the future)
- What formats are at least risk of obsolescence as new versions are produced
- What formats include appropriate levels of metadata (many formats have metadata embedded in them that is essential for future reuse of the data)
- If you are planning on depositing your files in a data archive or repository, what file formats will they accept.
You may find that you need to use one format for your own data recording and analysis and another for data preservation at the end of a project.
Future proof your file formats
We do not know which data formats will still be readable in 10 or 20 years time but the following guidance may help inform decision making:
|Use ASCII formats rather than binary formats||ASCII formats can be opened in a basic text pad application and the text within them should be readable. If your data can be viewed successfully in a text editor it should be readable into the future.||
A comma separated values (csv) file can be opened within a text editor and the data can be inspected within the text editor or can be imported into a range of other software packages.
An Excel spreadsheet (xlsx) can't be viewed within a text pad and future users of the data are reliant on having Microsoft Office or another package that can read these files.
|Use file formats with open specifications rather than closed formats||
Some file formats have open published specifications - documentation that describes the format in detail. This is true of many of the popular standards currently in use.
If a file format has a published specification it would be possible for a developer to re-engineer a piece of software to read that file format even if the original creating application was no longer accessible.
|TIFF is a good format for long term preservation of digital image files. The TIFF image specification was published in 1992 and is available for anyone to consult|
|Use uncompressed rather than compressed formats||Compression is often introduced into file formats in order to make the data smaller and easier to store, however this can lead to data loss. Uncompressed versions of files should be maintained where possible.||Uncompressed TIFF is the de facto standard for storage of image files. Use of other formats that include compression (for example JPEG) will result in a lower image quality.|
|Adopt widely used formats not obscure ones||Ubiquitous file formats that are in very wide circulation are more likely to continue to be readable for some time. Even if the creating application becomes obsolete, it is likely that the critical mass of users will ensure future software providers will supply the necessary import routines.
For cutting edge research projects where the use of new techniques for data creation and analysis necessitate the use of obscure and niche file formats this will not be a major consideration.
|Popular file formats include Microsoft Office files, PDF files, TIFF, JPEG and PNG images.|
|Keep a copy of your files in their original format as produced by the creating application||
Original files from the creating application often contain more information than those that are subsequently derived from them.
Derived files may be of poorer quality due to the introduction of compression or the loss of other information such as metadata.
Derived files may also be harder to extract information from.
|An original Microsoft Word document is easier to preserve for the long term than a PDF version of the same file.
If your images were originally produced in TIFF format, the TIFF originals will be uncompressed and may also include valuable metadata that is not translated across into subsequent image formats.
|If you want to use PDF files, consider using PDF/A||PDF/A is the archival version of the PDF standard. Based on PDF 1.4 it reduces the complexity of some of the more recent versions of PDF and ensures that all necessary information is embedded into the file.||PDF/A isn't reliant on certain system fonts being present on the device on which it is accessed so this makes it more likely to be readable in the future.|
Further useful sources of guidance on formats
- UK Data Service File formats and software
- Archaeology Data Service Guides to good practice
Files names and structures
A well organised directory structure and clear, meaningful and consistent file naming practice is essential to identify files and find information quickly and accurately. Make use of directories to help organise your files into groups and structures that are meaningful to you and your colleagues. Good file and directory naming improves searching, helps you and others distinguish documents from one another, allows documents to be sorted into a logical order, and makes it easier to interpret documents and information from their file name.
Agree the adoption of a standard file naming convention at the start of a project. Check if any existing conventions are already in use
Think about ordering the elements within a filename logically and to enable sorting
Think about how to structure your files into meaningful folders before the number of files proliferates
Structure your folders in a logical hierarchy
Consider the use of YYYY-MM-DD date conventions if appropriate, this is important if you want to order your files into date order.
Consider how you are going to approach version control to manage multiple versions of the same file. Including clear version information in a file name will help.
Find out more about good practice in this area by consulting the University's Records Management guidance for naming files and folders and version control.
The UK Data Service's Organising data pages provide useful information on file naming, file structures and version control.
Documentation and metadata
Documentation and metadata are the contextual information required to make data meaningful and to aid its interpretation both now and in the future.
Documentation should include information on who created data and why, a description of the data, methodology and methods, units of measurement, and definitions of codes/jargon. It may also include references to related data or to software/computer code1.
Metadata (or data about data) may be a more structured description of data which could be indexed and stored within a database or catalogue but it may also be information extracted from the instruments that you have used to collect your data (which record the settings that have been used).
Why do you need it?
- helps you to understand your own data when you need to come back to it
- enables you to find and use your own data quickly and easily
- helps in the sharing of data with others
- provides context to minimise the risk of misunderstanding or misuse
- is essential to the longer term preservation of data as a record of provenance, licensing and access arrangements
- is required to make data FAIR: Findable, Accessible, Interoparable and Re-useable.
When should you create it?
You need to think about the documentation you might need for your data at an early stage of your project and while data collection and analysis is being carried out. Documentation is not something that should be left until the end of the project. It is much easier to record relevant information about your dataset when it is fresh in your memory to ensure that key details are not forgotten.
What form should your documentation take?
Documentation can be in any form that is appropriate to your research and the dataset it describes. It may be in the form of a simple free text document (often called a readme file), an .xml file extracted from an instrument, a spreadsheet of captions for a set of images, or information embedded within the files themselves. The use of good file names and logical directory structures may be enough to adequately describe and give context to some types of data.
Whatever form your documentation takes, you need to ensure that it is accessible alongside your data if anyone needs to use it to interpret your results. For example, a readme file - typically a plain text file with the file name 'readme' to encourage users to read it before looking at the content - that is located at the root of your dataset.
Cornell University provides a Guide to writing "readme" style metadata.
Readme file template ( 2kb download), an example of a generic readme template that you can use to document your data.
What should you document?
There are no hard and fast rules here. Try to look at your data with a fresh pair of eyes and imagine trying to reuse it in five or 10 years time. Think about the level of information you might need in order to fully understand it. What seems obvious to you now may not be so clear to another researcher in the future.
The levels and types of documentation you will need will depend on a number of factors:
- Standards within your discipline or subject area. The DCC has guidance on metadata standards for different disciplines
- The nature of your research and the types of data that you are creating or collecting
- The reuse potential of your data. If you envisage your data being widely reused beyond the period of your research you may need a higher level of documentation to enable this.
- The retention period for your research data. If your data is to be kept (and remain usable) for a long period it is more likely that as time progresses, more documentation will be required to make sense of it. Knowledge that we may assume today, may not be so obvious in 20 years time.
You understand your research data better than anyone else, so you are best placed to make the final decision about the level of documentation needed.
The UK Data Service Document your data web pages give more detail on documentation and metadata.
1 If the software is essential to validating your research findings the software should be made available where possible (if the Terms and Conditions/licence of the software allow), with suitable documentation and an appropriate licence. A clear link should also be established between the data and the software. If it is not possible to make the software available, adequate information should be provided to enable its re-running by third parties. If you generate data as a result of running software code, then it may be helpful to provide a link to that code in the metadata. Further guidance is available from the Software Sustainability Institute.
University guidance on data management:
The University's Records Management Policy defines the framework within which records (and data) are managed across the University.
Records management guides (login required) a series of guides with tips and examples, to help support the maintenance and good management of records, including data (print and electronic).
The Data Security (PDF , 255kb) guide is particularly helpful, offering guidance on data acquisition, storage, data retention, access controls, mailing records, third party access/processing, back-ups, and the disposal data - including advice on the management of personal and sensitive data.
University of Edinburgh, Research Data MANTRA: