Analyze millions of scientific papers all at once

August 4, 2017
With more than a million scientific papers produced each year, keeping on top of the latest research is becoming an impossible task. That’s why a growing number of scientists are having computers trawl through thousands of research papers at once for raw data and text. Now, in one of the largest text and data mining exercises ever conducted, scientists say they have identified the best way to do such searches, which could improve the hunt for everything from new drug targets to genes that have not been studied in detail. There is long-standing debate among text and data miners: whether sifting through full research papers, rather than much shorter and simpler research summaries, or abstracts, is worth the extra effort. Though it may seem obvious that full papers would give better results, some researchers say that a lot of information they contain is redundant, and that abstracts contain all that’s needed. Given the challenges of obtaining and formatting full papers for mining, stick with abstracts, they say.

In an attempt to settle the debate, Søren Brunak, a bioinformatician at the Technical University of Denmark in Kongens Lyngby, and colleagues analyzed more than 15 million scientific articles published in English from 1823 to 2016. After creating two databases of those articles—one of full-text and one of abstracts—the researchers directly compared the results of mining either. The full texts were obtained from publishers Elsevier and Springer, as well as the open-access section of online repository PubMed Central. The abstracts from the same papers were collected from MEDLINE, a resource that like PubMed Central receives funding from the U.S. National Institutes of Health.

Text mining full research articles gave consistently better results than text mining abstracts, the team reports this month on the preprint server bioRxiv (which was not mined). In one example test, the authors identified far more associations between genes and a variety of diseases from the full-text articles than the abstracts—potentially creating a treasure trove of ideas for future research targets.

The paper “convincingly shows that ideally text mining studies should use full-text,” says Daniel Himmelstein, a biodata scientist at the University of Pennsylvania who was not involved in the study.

Now, many researchers are just using abstracts, says study co-author Lars Juhl Jensen, a bioinformatician at the University of Copenhagen. These summaries are typically much easier to get ahold of than full research papers, have fewer legal restrictions on their use, and are much easier for computers to read due to their simple formatting.

Given those advantages, researchers using text mining may not switch from abstracts any time soon, Himmelstein says. An additional obstacle, he notes, is that because of restrictions put on many full-text articles by publishers, researchers are often restricted from sharing the databases of papers they download and prepare for text mining—making it extremely difficult for others to replicate their research.

Brunak admits that the process of negotiating permissions with publishers was challenging and took his colleagues in the library several months. But he says that arguably the most time-consuming and challenging step in the study was converting the full-text articles the publishers provided in the common PDF file format into a machine-readable text format.

“This is one of the big reasons why nobody did full-text mining at this scale before,” Jensen says. “We probably spent more computational resources teasing the text out of PDFs and beating it into shape than we spent on the actual text mining.” Jensen warns that if researchers aren’t familiar with this step, they may be “unpleasantly surprised” by how many errors they get when converting the files.

One solution, says Jensen, would be for publishers to ensure that full-text articles can be easily mined. He’s eager to see publishers work together to find “a consistent format” that could be used across the board, “rather than each journal just inventing their own.” The XML file format for sharing data used by the scholarly article repository PubMed Central could be a good model for this, Jensen notes.

By Lindsay McKenzie

Science