Research

Organising your data files

Good file and folder organisation will help you to locate, identify and retrieve your data quickly and accurately, thereby making it easier your you to manage your data. To achieve this, you need to do at least two things:

  • use folders to sort your files into a series of meaningful and useful groups;
  • use naming conventions to give your files and folders meaningful names according to a consistent pattern.

You should establish a file organisation scheme at the start of your project, to avoid having to apply one retrospectively:

  • If you are new to a group, or are working with a research facility, check whether there is an established procedure to follow.
  • If you are working within a research group, it is essential that the whole group agrees on a file organisation scheme so that everyone can find data within the group's shared storage area.
  • If you are working alone, it is still important for you to set up a scheme for yourself.

Once you have set up a file organisation scheme, you should document it: write down what should go in each folder and the naming conventions you are using, along with any codes or abbreviations you are using. Save it in a ‘readme’ file, preferably in plain text, and store it in the top level folder for your project where you (or anyone in your group) will be able to access it easily.

Consider scheduling a regular review of your file organisation scheme:

  • Make sure your files and folders conform to your scheme. It is easy to forget certain details, or to skip them when in haste, but if you tidy up your files regularly you can avoid problems.
  • Make sure your scheme is working for you. If you find you need to change or refine it, do not be afraid to do so. Just make sure you apply the changes consistently and update your ‘readme’ file accordingly.

Although these principles are aimed at digital files and folders, it is just as important to organise physical files, folders and other materials in a meaningful, consistent and documented fashion.

Structuring files and folders

There are many "right" ways of organising your files so think about what makes sense for your research.

If you are doing experimental work, for example, you might want to organise the results into folders by the date you did the experiment, or by a key experimental condition.

The following suggestions will help you to organise your data:

  • Use folders to group files with common properties: Think about how you might want to browse for your files in future. Are you more likely to want files from a particular day or a particular instrument? You should avoid grouping files by the individual responsible as this can cause confusion when group members leave or join.
  • Apply meaningful folder names: Ensure that you use clear and appropriate folder names that concisely convey the common property of the files inside.
  • Keep group numbers manageable: If you end up with only one or two files in each folder, you may find your structure tedious to navigate, but if you have hundreds it can be time consuming to look through them all for the file you want.
  • Structure folders hierarchically: Design a folder structure with broad topics at the highest level and specific folders within these. Try to avoid nesting folders too deeply, however, as this may cause problems with path lengths.
  • Separate current and completed work: You may find it helpful to move temporary drafts and completed work into separate folders. This will also make it easier to review what you need to keep as you go along.
  • Control access at the highest level: It is easier to set access permissions near the top of your folder structure than to control permissions for multiple deeply nested folders. This is particularly important if you need to grant someone access to only a subset of your data, in which case you could move these data to a new, higher-level folder.

Unlike with physical records, it is possible for digital folders to appear in more than one place in the hierarchy by means of shortcut links. This can help if different members of a group need the files to be organised in different ways, but the technique should be used sparingly.

Further guidance on structuring files and folders

You may find the following external guidance useful.

Naming files and folders

Naming conventions are rules that allow electronic and physical records to be named in a consistent and logical way.

Use of consistent and meaningful names will enable you to identify and distinguish between similar records, making data retrieval easier.

If you create large numbers of data files that would be difficult to rename individually, apply your naming convention at the folder level instead.

When you agree your naming convention, consider the following suggestions:

  • Keep names short but meaningful. If you use abbreviations, keep a record of what these are with the data, so that others can understand and use them.
  • Include dates in YYYY-MM-DD format, according to the international ISO 8601 standard. This makes alphabetical order coincide with chronological order and avoids confusion when national conventions vary.
  • Avoid using spaces. Use punctuation such as hyphens or underscores to separate words, particularly for files that will be available online.
  • Avoid using dots and special characters such as \ / : * ? " < > | as these may be reserved for the operating system.
  • Capture relevant information in file names rather than relying on basic file properties such as date of creation. This will allow processed data relating to a single experiment or study to be grouped together.
  • If you are repeatedly capturing the same information in a file name, consider grouping the files in a folder named with that information instead.
  • When personal names are used in file or folder names, use the family name followed by initials.

If you are likely to have multiple versions of the same file, see our guidance below on keeping track of versions.

Applying a file naming convention

Consider a folder containing the following files:

  • 12-03-22_Anonymised_Interview_Transcript_SubA.docx
  • 18.04.12-Transcript of Subject B Interview.docx
  • 22_03_12_Subject_A_recording.mp3
  • 220312_Raw_transcript_Subject_A.docx
  • Interview with Subject B on 18/04/12.mp3
  • interviews.rtf
  • SA52387.docx
  • Welcome. My name is X.docx

This is an extreme example with many problems. Dates are written in many different formats, and files are described inconsistently, meaning that related files are not grouped together. This makes it hard to tell at a glance which file is which. In addition, some have quite mysterious names and some are using characters which may cause problems.

Contrast that with the consistently named files below.

  • 2012-03-22_Subject-A_Audio.mp3
  • 2012-03-22_Subject-A_Transcript-anonymised.docx
  • 2012-03-22_Subject-A_Transcript-raw.docx
  • 2012-04-18_Subject-B_Audio.mp3
  • 2012-04-18_Subject-B_Transcript-raw.docx
  • Interview-plan.docx
  • Readme.rtf
  • Summary.docx

Underscores are used to separate blocks of information, while hyphens are used within blocks. Note how materials from the same day and concerning the same subject are grouped together, and that corresponding files in the different groups are easy to spot.

Further guidance on file naming conventions

You may find the following external guidance useful.

Keeping track of versions

As you work with your data it is important to distinguish between different versions or drafts of your files. Version control can help you to easily identify the current version of your data so that you avoid working on older or outdated copies. If you are working with others it can also help to link versions of the data to the time and author of the change.

There are a number of ways that different versions of data can be managed:

File naming: A simple method of version control is to create a duplicate copy and then update version information to create a unique file or folder name.

  • Successive versions can be numbered sequentially using one to three integers:
    • If you only expect to generate a small number of versions, a single integer may suffice (i.e. 1, 2, 3).
    • If you expect a moderate number of revisions, a two-integer scheme will be more useful (e.g. 1-0, 1-1, 1-2, 2-0, 2-1). The first number is used for major revisions and the second for minor edits.
    • For more complex files, a three-integer scheme might be needed (e.g. 1-0-0, 1-0-1, 1-1-0, 2-0-0). The first number is increased if the change might 'break' references from other files. The second and third numbers are increased for additions and fixes, respectively, that would not 'break' any incoming references.
    • Major version number 0 (e.g. 0-1, 0-2-1) is sometimes used to indicate an incomplete initial draft.
  • If you are working as part of a group it may help to include the initials of the person who made the change e.g. v1-0jm, v1-1ke, v2-0gb.

Version control tables: These are included within documents and can capture more information than using file naming conventions. Version control tables typically include the new version number, date of the change, person who made the change and the nature or purpose of the change.

Version control systems: There are many automated systems available that can store a repository of files and monitor access to them, logging who made what change and when. Version control systems are particularly useful for collaborative development of code or software. Computing Services provide an institutional GitHub service. Please contact Computing Services via the IT help form to arrange access.

Further guidance on version control

You may find the following external guidance useful.

Example case of organising data files

The principal investigator (P.I.) of a large multi-institutional project was faced with the issue that each partner would hold an overlapping subset of the project data, with files shared with other partners as needed. In order to ensure consistency and coordination across the partner institutions, the P.I. drafted a folder structure and file naming convention.

The research workflow would involve taking detailed measurements of samples, and analysing the raw data in a variety of ways to generate different characterisations. The data might therefore need to be accessed by sample, characterisation technique, or characterisation purpose. The raw data would also need to be kept separate from derived data, to protect the raw data from inadvertent changes and permit timely sharing between partners.

The convention chosen was as follows. Within the project folder, subfolders were created for each work package ('WP1', 'WP2', etc.), for raw data ('Raw_Data'), and for characterisation templates ('Templates'):

  • Within the work package folders, derived data files were organised into folders named after the sample number. These folders also contained filesystem shortcut links to corresponding raw data folders.

  • Within the top raw data folder, files were organised in a two-level folder structure, with the first level folders named after the session identifier (consisting of a data type code, an underscore, and the date in YYYYMMDD[HHMM] format) and the second level folders after the sample number, for example SEM_20150609\SEM_20150609_AB123-4-5-6. The latter folder names had the session ID prepended so that the shortcut links mentioned above would have unique names even if the same sample was characterised across several sessions.

  • Templates were stored directly within the templates folder, named after the characterisation type to which they referred.

We acknowledge the work of the UK Data Service, the University of Glasgow, the University of Leicester and the University of Southampton in the development of this guidance.