Date Published: March 26, 2014
Publisher: Public Library of Science
Author(s): Christopher W. Belter, Howard I. Browman.
Evaluation of scientific research is becoming increasingly reliant on publication-based bibliometric indicators, which may result in the devaluation of other scientific activities – such as data curation – that do not necessarily result in the production of scientific publications. This issue may undermine the movement to openly share and cite data sets in scientific publications because researchers are unlikely to devote the effort necessary to curate their research data if they are unlikely to receive credit for doing so. This analysis attempts to demonstrate the bibliometric impact of properly curated and openly accessible data sets by attempting to generate citation counts for three data sets archived at the National Oceanographic Data Center. My findings suggest that all three data sets are highly cited, with estimated citation counts in most cases higher than 99% of all the journal articles published in Oceanography during the same years. I also find that methods of citing and referring to these data sets in scientific publications are highly inconsistent, despite the fact that a formal citation format is suggested for each data set. These findings have important implications for developing a data citation format, encouraging researchers to properly curate their research data, and evaluating the bibliometric impact of individuals and institutions.
In recent years there has been increasing interest in, and use of, bibliometric indicators for the evaluation and ranking of research institutions. Bibliometric indicators feature prominently in global mixed-method ranking schemes such as the Academic Ranking of World Universities  and the Times Higher Education ranking . They also feature in national mixed-method research assessment exercises in the UK, Brussels, Italy, and Australia. Other global ranking schemes are based solely on bibliometric indicators , . Bibliometric indicators are often recommended to supplement – or even replace ,  peer review in evaluating research institutions.
In consultation with NODC, I selected three highly-used data sets for this analysis: the World Ocean Atlas and World Ocean Database (WOA/WOD), the Pathfinder Sea Surface Temperature (PSST) data set, and the Group for High Resolution Sea Surface Temperature (GHRSST) data set. The World Ocean Atlas is a quality-controlled set of objectively analyzed global in situ observational data published in four volumes focused on the variables of temperature , salinity , oxygen , and nutrients . Although NODC considers the World Ocean Atlas a data product, rather than a raw data set, because it is a compilation of many individual data sets gathered at various times and locations around the world and because of the quality control and analysis done on the underlying data, I consider it a data set for the purposes of this analysis. The World Ocean Database  is an interactive database of the data used to create the World Ocean Atlas. Since the Atlas and the Database utilize same underlying data, I will refer to them in combination as the WOA/WOD. The WOA/WOD was initially published in 1982 as the ‘Climatological Atlas of the World Ocean’  and rereleased with updated data in 1994, 1998, 2002, 2006, and 2009–2010. The PSST data set  is a long-term set of global sea surface temperature data derived from the Advanced Very High Resolution Radiometer (AVHRR) sensor mounted on NOAA’s polar-orbiting satellites. The GHRSST data set ,  is a global set of combined satellite and in situ sea surface temperature data contributed by a number of institutions from around the world. GHRSST data are initially collected from these institutions by the NASA Jet Propulsion Laboratory and then transferred 30 days after observation to NODC for long term preservation and access.
The citation counts generated for WOA/WOD, PSST, and GHRSST during the first phase of this analysis are summarized in Figure 1. Two consistent patterns seem to emerge in these counts. First, the total number of citations generated for each data set increases as the coverage of the data source increases. WoS is the most limited of the three data sources, since it only indexes article metadata, acknowledgements, and cited references. Publishers’ sites have wider coverage, since they allow access to articles’ full text, but I only searched a limited number of these sites. Google Scholar has the broadest coverage, in that it offers access to the full text of a broad range of publishers’ websites as well as to conference proceedings, institutional repositories, and other websites. The citation counts generated using these data sources seem to follow this pattern, with citation counts generated from publishers’ sites being nearly four times higher than those generated from WoS and counts generated from Google Scholar being nearly eight times higher than WoS.
These results seem to have a number of implications for data curation and data citation initiatives. First, my results indicate that all three of these data sets are highly cited. My phase 1 results suggest that, if they were counted as journal articles in WoS, both the WOA/WOD and the PSST data sets would have citation counts higher than 99% of all articles in Oceanography in WoS from any single publication year from 1995 to the present. Using the more expansive journal full-text method, each of the three data sets would be ranked in the top 1% for citation counts of all articles published in Oceanography during the same year, while the WOA/WOD and PSST data sets would be ranked in the top 0.1%. My phase 2 results indicate that each version of the WOA/WOD would be ranked in the top 0.1% of articles in Oceanography that were published during the same year and the 1982 and 1994 versions have been cited more than twice as often as the most highly cited article in Oceanography published in 1982 and 1994. Percentile values and article citation counts for journal articles in Oceanography were obtained by using the search string “WC = oceanography AND PY = 1995” and sorting the results by “Times Cited – highest to lowest.” This string was then repeated for the other publication years. Because of the limitations of my search methods noted above, these citation counts are likely to be underestimates of the actual totals for each data set.
In this analysis, I attempted to generate citation counts for three oceanographic data sets curated by NODC by searching WoS, publishers’ websites, and Google Scholar for mentions of these data sets in the bibliographic information or full text of scientific articles. I found that although there were substantial differences in the citation counts derived from each source, all three data sets were highly cited in all sources. The WOA/WOD was particularly highly cited, with all versions of the data set having received over 8,000 citations since its first release in 1982. My results suggest that scientific articles are more likely to mention these data sets in the text than in the acknowledgements or cited references sections. I also find wide discrepancies in the methods used to refer to these data sets, both in the full text and in the cited references sections. I found 377 variant methods of citing different versions of a single data set, WOA/WOD, suggesting that researchers are not consistently using the citation formats provided for these data sets.