Research

Appraising and selecting data

Research data come in all shapes and sizes, and the value of a particular type of data can vary between disciplines, researchers and projects. For any given circumstance, though, it is important to distinguish between incidental data and those that have long-term value, as keeping everything can have negative consequences:

  • It can be harder to find, navigate and understand high value data if they are surrounded by incidental data.
  • Keeping data unnecessarily drives up the costs of storage, administration and dissemination, potentially reducing the period for which high value data can be kept.

In order to determine which data should be archived and which should be discarded, there are a number of factors you should consider.

Selecting data for retention

All data supporting a publication, including both quantitative and qualitative data, should be retained and made accessible where possible. These data should be sufficient to enable other researchers to reproduce or validate your findings. This is often a sub-set of the data generated over the course of a research project.

As well as preserving the data that support publications, you should also archive data with acknowledged long-term value. Determining the potential value of a dataset is a matter of judgement, but there are several areas that should be considered.

Reproducibility

If it would be impossible or unreasonably expensive to reproduce or recreate the data, they should be retained. Examples of data which can fall into this category include the following:

  • behavioural observations;
  • interview transcripts;
  • survey results, whether quantitative or qualitative;
  • sensor readings in natural conditions (e.g. weather readings or building monitoring where external conditions are a factor);
  • unique data related to a specific event;
  • data produced on equipment which is expensive to use or difficult to access;
  • data derived from sources which are expensive to access again.

As a rule of thumb, if a significant proportion of the funding for a project is for data generation, then the data are probably valuable and should be retained.

Importance

It is not always clear whether data have historical or future significance when selecting data for retention. Indicators that data might be valuable because of their importance include the following:

  • data have contributed to securing future funding for the research;
  • data underpinning practical applications of the research;
  • data relating to a landmark discovery or precedent.

Persistence

If the data are re-used regularly by your group or by others in your field, they should be archived to ensure that authoritative versions are clearly defined and that they are available for future users. Examples include the following:

  • models;
  • model inputs;
  • cumulative datasets (where each version should be archived);
  • longitudinal study data.

Some data are required by law to be held for a period of time.

Confidentiality

If the data are confidential, they should not be shared, but it may be valuable to retain them. Personal data collected for research purposes can be retained for future analysis, and there is an exemption in the Data Protection Act for this purpose. However, if no future analysis is planned, then the data should be securely disposed of. Some funders require you to keep personal data for a fixed period of time, before securely destroying them.

You must keep consent forms for as long as data are retained in a form where they could be linked to individuals. Once the data only exist in anonymised form, we recommend that you retain the wording of the consent forms (e.g. a blank form) but dispose of the completed forms.

If collaboration agreements include data retention and sharing permissions, they should be retained while you still hold the data.

Selecting data for disposal

Not all data are suitable for retaining after the end of a project.

Reproducibility

If the data are easily and cheaply reproduced, you may not need to retain them. Likewise, data which are so large that the cost of retaining them outweighs that of re-creating them should not be kept.

If you dispose of this kind of data, it is important to ensure you have a full and clearly described method for how to recreate the data.

Importance

If the data are superfluous or unsuitable for supporting published findings, then they should not be retained. Examples include the following:

  • temporary files created during a process, if the inputs, outputs and process or model are retained;
  • data which were the result of a known error in methodology and have not been reported – note that this is not the same as biased reporting of only positive findings or those which support a particular hypothesis.

Third-party data

If your research data have been obtained from a third-party, and there are no provisions in the licence or agreement for retaining or sharing the data, then they should be securely destroyed at the end of the project. You may be able to share derived datasets, such as those created through content mining.

Format Transfer

Some data are not suitable for retaining in their original formats, or might be more useful in another format. Some examples include the following:

  • survey forms – data can be transcribed into tabular format (usually an automatic process in online survey tools);
  • video or audio interviews – content can be transcribed into text format which makes analysis and anonymisation easier;
  • proprietary software formats – these should be saved in a more accessible format.

For more information about digitisation, see our guides to working with non-digital data and archiving or disposing of non-digital data.

The Data Protection Act requires that personal data should be securely disposed of once they are no longer required for the purpose for which they were collected. As mentioned above, however, personal data collected for research purposes can be retained if future analysis is felt to be likely.

Funding agreements, consent forms or collaboration agreements may contain additional stipulations about when sensitive data should be securely destroyed.

Next steps

You should register any data you have selected to keep in Pure. This ensures that we know that the dataset exists within the institution, and is important for compliance with several funders' data policies.

Example case of appraising and selecting data

Researchers in the Department of Architecture and Civil Engineering published a paper describing how the reaction of a chemical compound to exposure was studied both experimentally and through computer simulation. The underlying data consisted of scanning electron microscopy (SEM) images, specimen composition data obtained from energy-dispersive X-ray spectroscopy (EDS), X-ray photoelectron spectroscopy (XPS) results and PHREEQC simulation data.

The journal could accommodate a limited number of images and tabular data within the paper, but would not provide storage for supplementary data. The researchers therefore had to decide which data to submit the University of Bath Research Data Archive to comply with the EPSRC's expectations. Considering the nature of the techniques used, they judged that the data might be useful for validating their findings but would not be useful as the basis for future research. This influenced the selections they made:

  • The unpublished SEM images were deemed of negligible value and were not submitted.
  • The EDS data had been included in full in the paper, so were not submitted.
  • The VMS file from the XPS software was submitted, so that a peer researcher familiar with the technique and software could process it and verify that the spectra published in the paper were correct.
  • The transcripts of the PHREEQC simulations, including the model, input parameters and full output tables, were submitted. These would enable a peer researcher to re-run the simulations and confirm the summary data presented in the paper.

Further information about appraisal and selection