Research

Citing research data

When you use third party data in your research, you must acknowledge this in the resulting research outputs. It is a common condition of use, and helps to avoid charges of plagiarism. The ideal way to do this is to cite the data directly, just as you would an academic paper. If this is not possible, there are indirect methods you can use instead.

Formatting a reference to a dataset

The formatting of the reference will depend on the style of reference list imposed by the publisher.

If your document uses the University of Bath's Harvard style, format the reference as follows, omitting the version number if not provided:

  • Smith, M. and Jones, G. R., 2015. Title of dataset. Version 1. University of Bath. Available from: http://doi.org/10.15125/12345 [Accessed 1 March 2016].

Some style manuals have specific rules for referencing databases and datasets:

APA
Smith, M., & Jones, G. R. (2015). Title of dataset [Data set]. doi:10.15125/12345
Chicago

Footnote: Melville Smith and G. R. Jones, Title of dataset (accessed March 1, 2016), doi:10.15125/12345.

Reference list: Smith, Melville, and G. R. Jones. Title of dataset (accessed March 1, 2016). doi:10.15125/12345.

MLA
Smith, Melville, and G. R. Jones. "Title of dataset." University of Bath, 2015. Web. 1 March 2016. <http://doi.org/10.15125/12345>.

If your publisher or style manual does not provide an example reference for datasets, we recommend adapting the style in use for reports:

  • Put the names of the data creators where the authors would go, formatted likewise.
  • Put the title of the dataset where the report title would go.
  • Put the year that the version of the dataset you used was released in place of the publication year.
  • Put the name of the institution hosting the data, or the name of the data archive, as the publisher. You do not need to include the geographical location of the data archive.
  • Include an identifier for the dataset at the end of the reference. Use the DOI, if the dataset has one, following the convention used by your publisher to print DOIs. If the dataset has another identifier instead of a DOI, give both the scheme and the identifier, e.g. Accession E-MTAB-01234. If the dataset has no identifier at all, provide the URL of the archive page that describes it, in the usual way for printing URLs.

If you are unsure what to put for any of the above, many archives provide sample references that you can use to deduce the correct values.

Citing datasets indirectly

Unfortunately, some journals are hesitant to include direct citations of datasets, and may ask you to remove them. If this happens, check whether the data originators have published a data paper you can cite instead. As distinct from a regular paper, a data paper reports the availability of a dataset and provides documentation for it, without drawing scientific conclusions from it.

Otherwise, you should mention the dataset as part of your data access statement. In such cases you need not reproduce all the information from the reference. Instead, you should include the name of the data archive, a link to the archive's record for the dataset, and the dataset's identifier if this is not obvious from the link. You should also mention if the dataset has any access restrictions. If the dataset is dynamic, you should include the date and time you accessed it, or the version number if appropriate.

Citing a subset of data

Just as you may want to cite a particular passage from a textual work, you may need to cite a subset of data. For example, you may have queried a database and worked with the result set, or filtered out parts of a large dataset that were less relevant to your research. There are two ways of approaching this.

The first is similar to the approach you would take with a textual work. Cite the whole database or dataset, then provide the information the reader would need to extract the same subset. This could be the query you submitted to the database, and the date and time you submitted it. If there were multiple or complex steps involved, you may need to include this information in the supplementary data section instead.

The other approach is to archive a snapshot of the subset you actually used. Some archives give you this option when you query a dynamic database. Otherwise, you must make sure that the terms and licence conditions of the data you used allow you to archive your copy. For more information, see our guide to using third party research data.

Further information about citing research data

You may find the following external resources useful.