Describing your data

Writing documentation to accompany research data is a vital step in managing it correctly. It can also be one of the hardest, but it can be made much simpler if you are able to work out in advance what you will want to say.

At the beginning of your project, think about the sorts of documentation you will need:

  • The archive to which you will submit your data will ask you a set of questions about them: find out what they will be in advance.
  • You may need to describe how you collected your data in a journal article, in enough detail to allow your results to be reproduced and verified.
  • You might need to remind yourself how you collected and organised your data if you come to re-use them in another project.
  • A peer researcher might need to know why and how you collected, organised and processed your data, in order to build on your research and use your data in new ways.

You may find it helpful to make a list of what you will need to record, and fill out the corresponding information as and when you know it, so you have it all in place when you are eventually asked for it. Indeed, some information is much easier to collect at the time you are working with the data than after the fact.

Choosing what information to include

When documenting your data, the aim is to provide enough information so that a fellow researcher who is familiar with your field, but not necessarily your work within it, should be able to understand the data, interpret them correctly, and use them in new research. You may find it easier to consider what you would need to know in order to use someone else's data in your research. Typically this will include the method used to collect the data and how they have been recorded, structured, processed or manipulated. You may also need to provide some broader context to explain the motivation for the design decisions you have taken and the significance of what you found.

More specifically, you may need to include some of the following elements:

  • details of the equipment used, such as the make and model of the instrument, the settings used, information on how it was calibrated;
  • the text of survey instruments used, including questionnaires and interview templates;
  • details of who collected the data and when;
  • key features of the methodology, such as the sampling technique, whether the experiment was blinded, how sample groups were subdivided;
  • legal and ethical agreements relating to the data, such as consent forms, data licences, approval documents or COSHH forms;
  • citations for any third-party data you have used;
  • details of the file formats and standard data structures used to record data and supporting information;
  • a glossary of column names and abbreviations used, explaining for example which measurement resulted in the given column and what units were used;
  • the codebook used to analyse and encode content;
  • the workflow used to process and manipulate data, including steps such as applying a statistical test or removing outliers;
  • details of the software used to generate or process the data, including version number and platform.

You may be recording some of this information in a lab notebook or research journal. If so, you may find it convenient to maintain an index file that links data files to the corresponding page numbers until you have an opportunity to transfer the information into a documentation file.

Choosing where to record the information

Depending on the context there are several places where the documentation can be placed:

  • Within the data file: Some file formats can record information in addition to the main data content. For example, the Observations and Measurements XML standard provides a way of recording sampling strategies and observation procedures as well as measurement values.
  • In a separate metadata file: Some disciplines have developed special file formats or data structures for recording supporting information. For example, the Agricultural Metadata Element Set (AgMES) provides a way of describing an agricultural dataset using the subject-predicate-object structure of the Resource Description Framework (RDF).
  • In a file mimicking a web form: In some cases, archives generate specialist metadata files from their submission forms. In such cases, you might find it handy to copy out the elements of the form into a file at the start of your project, fill out the details as you go, then copy your answers into the form when you submit your data.
  • In a readme file: Any information that cannot be recorded in a structured way (i.e. as the values of fields in a data or metadata file) can be recorded as free text within a readme file.
  • In a published journal article: Some of the information needed to understand data would normally be provided in a journal article reporting the research. In order to prevent duplication of effort, it is possible to refer to an article to provide more information about a dataset, but before doing so you should be sure that (a) the article provides sufficient detail, and (b) the article will be available on open access.

Writing a readme file

A readme file is a plain text file that is named 'readme' to encourage users it to read it before looking at the remainder of the content. It can contain documentation directly or instruct the reader where to look to find more information. Even though it is free text, the file should be structured into sections as an aid to the reader. The following are suggestions for what to include:

  • Citation information. Give the information someone would need in order to cite your data, such as the title of the dataset, the names of the people responsible for it, the year it was (or will be) released, the archive that holds (or will hold) the data, and an identifier. You could also include more detailed information such as the funding body, project name, or date and place of data collection.
  • Methodology. Describe how you collected your original data. If referring to a published article, this could simply be a statement such as, 'Full details of the methods used to create the dataset are provided in,' followed by the reference. Be sure to include a direct link to an open access copy as well as the DOI of the article. Otherwise, sufficient information should be given to enable another researcher to recreate the dataset or create a comparable one. Avoid reproducing the text of a published article verbatim if you have not retained copyright.
  • Third-party inputs. If you used third-party data, provide a data citation or, if they are not available from a repository, describe how you accessed them.
  • Workflow. Provide details of the steps you took to process the data, including preparatory steps such as data cleaning and reformatting. State the software, services or scripts you used, as well as where they can be found, how to install/invoke/run them, and any special settings they require.
  • Outputs. If your workflow generates auxiliary files as well as data files, explain which are which. Relate the outputs of your workflow to the data files you have or will submit for archiving.
  • Inventory of files. Give the names of the files in the dataset and give a short description of each and how they interrelate. You could also mention related data that were not selected for inclusion, such as auxiliary files generated by your workflow, or 'garbage' data from unsuccessful experimental runs.
  • File structure and conventions. Provide details of how to interpret your data files. For example, explain which measurement each column heading represents, the units of measurement used, and any abbreviations, coding or controlled vocabulary used.
  • Licence information. Give a short statement about the terms under which others may use the dataset. If necessary, the full text of the licence may be given in a separate plain-text file called 'license.txt'.
  • Relationships. If applicable, give links to related publications, datasets and alternative records.

The University of Bath Research Data Archive contains some examples of readme files you can look at for inspiration:

The University of Minnesota provides an example readme file template.

Writing a structured metadata file

Metadata is the information someone would need concerning some data in order to perform a specific task with them, such as discover them or preserve them. Metadata are most useful when they have been structured, that is, arranged as properties and values.

As a researcher, the main three types of metadata you will be asked to provide are contextual metadata, discovery metadata, and metadata for reuse.

You will provide contextual metadata when you create a record of a dataset in Pure. This helps to connect your data to your own profile, and to your project, funding body and publications.

You will provide discovery metadata when you complete a record in the University of Bath Research Data Archive. This helps other researchers to find your data, and as a result may help to increase the impact of your research.

The metadata you provide for reuse will depend on the field of your research:

  • Social scientists often package their data and metadata together using DDI or, if the data are strongly statistical in nature, SDMX.
  • Many types of biological and biomedical investigation have a corresponding Minimum Information standard, setting out what information would be needed to interpret the data unambiguously and reproduce the experiment.
  • Geospatial data are usually packaged in a format that complies with the standard ISO 19115. There are many profiles of this standard aimed at different communities; UK researchers are encouraged to use UK GEMINI, which is in turn compliant with the European INSPIRE Directive.
  • Some subject-specific data archives ask for data to be submitted in a particular format. For example, the NCBI Gene Expression Omnibus specifies a metadata set to be submitted along with data, and has developed the spreadsheet-based GEOarchive format for capturing it.

If you decide or are required to offer your data to a subject-specific data centre, you should contact them in the early stages of your project to discuss their metadata requirements. This can save a lot of additional work later on as some metadata can only be collected accurately at the point of data creation.

For more information about the data and metadata standards available for your subject area, see the following directories:

Using standardised vocabularies

As an aid to clarity, some subject areas have agreed on a common set of terminology to use when describing data. If metadata standards list the properties that need to be known, vocabularies help with providing useful values.

  • The NERC Vocabulary Server provides access to many different vocabularies in use in geoscience and oceanography.
  • The Open Knowledge Foundation runs the Linked Open Vocabularies service, which provides access to many different vocabularies that are suitable for use in Resource Description Framework (RDF) applications.

Further information about describing data