Uncovering ‘Sneaked References’ in an Article’s Metadata
A researcher working alone – apart from the world and the rest of the wider scientific community – is a classic yet misguided image. Research is, in reality, built on continuous exchange within the scientific community: First you understand the work of others, and then you share your findings.
Reading and writing articles published in academic journals and presented at conferences is a central part of being a researcher. When researchers write a scholarly article, they must cite the work of peers to provide context, detail sources of inspiration and explain differences in approaches and results. A positive citation by other researchers is a key measure of visibility for a researcher’s own work.
But what happens when this citation system is manipulated? A recent Journal of the Association for Information Science and Technology article by our team of academic sleuths – which includes information scientists, a computer scientist and a mathematician – has revealed an insidious method to artificially inflate citation counts through metadata manipulations: sneaked references.
Hidden manipulation
People are becoming more aware of scientific publications and how they work, including their potential flaws. Just last year more than 10,000 scientific articles were retracted. The issues around citation gaming and the harm it causes the scientific community, including damaging its credibility, are well documented.
Citations of scientific work abide by a standardized referencing system: Each reference explicitly mentions at least the title, authors’ names, publication year, journal or conference name, and page numbers of the cited publication. These details are stored as metadata, not visible in the article’s text directly, but assigned to a digital object identifier, or DOI – a unique identifier for each scientific publication.
References in a scientific publication allow authors to justify methodological choices or present the results of past studies, highlighting the iterative and collaborative nature of science.
However, we found through a chance encounter that some unscrupulous actors have added extra references, invisible in the text but present in the articles’ metadata, when they submitted the articles to scientific databases. The result? Citation counts for certain researchers or journals have skyrocketed, even though these references were not cited by the authors in their articles.
Chance discovery
The investigation began when Guillaume Cabanac, a professor at the University of Toulouse, wrote a post on PubPeer, a website dedicated to postpublication peer review, in which scientists discuss and analyze publications. In the post, he detailed how he had noticed an inconsistency: a Hindawi journal article that he suspected was fraudulent because it contained awkward phrases had far more citations than downloads, which is very unusual.
The post caught the attention of several sleuths who are now the authors of the JASIST article. We used a scientific search engine to look for articles citing the initial article. Google Scholar found none, but Crossref and Dimensions did find references. The difference? Google Scholar is likely to mostly rely on the article’s main text to extract the references appearing in the bibliography section, whereas Crossref and Dimensions use metadata provided by publishers.
A new type of fraud
To understand the extent of the manipulation, we examined three scientific journals that were published by the Technoscience Academy, the publisher responsible for the articles that contained questionable citations.
Our investigation consisted of three steps:
- We listed the references explicitly present in the HTML or PDF versions of an article.
- We compared these lists with the metadata recorded by Crossref, discovering extra references added in the metadata but not appearing in the articles.
- We checked Dimensions, a bibliometric platform that uses Crossref as a metadata source, finding further inconsistencies.
In the journals published by Technoscience Academy, at least 9 percent of recorded references were “sneaked references.” These additional references were only in the metadata, distorting citation counts and giving certain authors an unfair advantage. Some legitimate references were also lost, meaning they were not present in the metadata.
In addition, when analyzing the sneaked references, we found that they highly benefited some researchers. For example, a single researcher who was associated with Technoscience Academy benefited from more than 3,000 additional illegitimate citations. Some journals from the same publisher benefited from a couple hundred additional sneaked citations.
We wanted our results to be externally validated, so we posted our study as a preprint, informed both Crossref and Dimensions of our findings and gave them a link to the preprinted investigation. Dimensions acknowledged the illegitimate citations and confirmed that their database reflects Crossref’s data. Crossref also confirmed the extra references in Retraction Watch and highlighted that this was the first time that it had been notified of such a problem in its database. The publisher, based on Crossref’s investigation, has taken action to fix the problem.
Implications and potential solutions
Why is this discovery important? Citation counts heavily influence research funding, academic promotions and institutional rankings. Manipulating citations can lead to unjust decisions based on false data. More worryingly, this discovery raises questions about the integrity of scientific impact measurement systems, a concern that has been highlighted by researchers for years. These systems can be manipulated to foster unhealthy competition among researchers, tempting them to take shortcuts to publish faster or achieve more citations.
To combat this practice we suggest several measures:
- Rigorous verification of metadata by publishers and agencies like Crossref.
- Independent audits to ensure data reliability.
- Increased transparency in managing references and citations.
This study is the first, to our knowledge, to report a manipulation of metadata. It also discusses the impact this may have on the evaluation of researchers. The study highlights, yet again, that the overreliance on metrics to evaluate researchers, their work and their impact may be inherently flawed and wrong.
Such overreliance is likely to promote questionable research practices, including hypothesizing after the results are known, or HARKing; splitting a single set of data into several papers, known as salami slicing; data manipulation; and plagiarism. It also hinders the transparency that is key to more robust and efficient research. Although the problematic citation metadata and sneaked references have now been apparently fixed, the corrections may have, as is often the case with scientific corrections, happened too late.
Disclosures
Thierry Viéville, directeur de recherche inria en charge de la médiation scientifique at Inria, contributed to this article.
Lonni Besançon receives funding from the Marcus And Amalia Wallenberg Foundation.
Guillaume Cabanac receives funding from the European Research Council (ERC) and the Institut Universitaire de France (IUF). He is the administrator of the Problematic Paper Screener, a public platform that uses metadata from Digital Science and PubPeer via no-cost agreements.Thierry Viéville does not work for, consult, own shares in or receive funding from any company or organization that would benefit from this article, and has disclosed no relevant affiliations beyond their academic appointment.