Citing data sources — why is it good and how to do it?

When was the last time you used data in your research or a classroom assignment? Be it primary or secondary, data are a pillar of scientific research. And just as sources such as books and journal articles are expected to be cited in your work, the data used in research also deserve proper credit.

Why cite data?

Employing proper data citation practices benefits both data producers and data users during the scientific research process. Data citation encourages collaboration by making it easier for researchers to find and use each other's datasets, thereby promoting a culture of sharing and reuse and fostering a more efficient research environment. These goals are the driving force behind the NIH Data Management and Sharing Policy in effect as of January 23, 2023, for all research supported by the National Institutes of Health.

For data users, data citation provides evidence that enables the reproducibility of the research by allowing other researchers to locate and access research data more easily. Such citation also increases transparency of that research and encourages the creation of high-quality datasets. Last but not least, citation encourages the reuse of data in the development of new research questions.

For data producers, data citation provides the creators of the dataset with appropriate credit, increases the findability of their research, and helps establish formalized standards for data to be recognized as a legitimate, citable scholarly contribution. In addition, data citation allows for tracking and measuring the impact of data, providing a more comprehensive view of the influence and importance of a particular dataset or data repository.

How to cite data?

Like citations to books and journals, the format for a data citation depends on which style guide your publisher is following. While different style guides and publications have their own particular formats for data citation, the following components are generally required:

  • Author(s)
  • Date of publication
  • Title of dataset
  • Version, when appropriate
  • Publisher or repository
  • Persistent locator/identifier (e.g., DOI)
  • Date accessed, when appropriate

The American Psychological Association (APA) style guide, used by many medical journals, follows this format:

AuthorLastName, AuthorFirstInitial OR Organization. (Year). Title of dataset (Version number) [Dataset]. Publisher. DOI or URL

Here's an example:

National Center for Health Statistics. (2023). Percentage of coronary heart disease for adults aged 18 and over, United States, 2019–2022 [Dataset]. Center for Disease Control and Prevention. https://wwwn.cdc.gov/NHISDataQueryTool/SHS_adult/index.html

Remember to adjust the details according to the specific information available for the dataset you are citing. If a DOI is available, it's preferable to use that, as it provides a stable and persistent link to the dataset. If no DOI is available, provide the direct URL.

For examples from other style guides, visit the data citation page from Columbia University Libraries.

Are you using citation management software for your references? All three of the main citation management tools — Endnote, Mendeley, and Zotero — include the reference type "dataset," but some datasets may still require manual entry or cleanup if they have not been tagged well by the data creator or the data repository.

Was this article helpful?
What made the article not helpful?