Okay, if someone asks, we will of course say, don’t do this:
Publishers withdraw more than 120 gibberish papers
Conference proceedings removed from subscription databases after scientist reveals that they were computer-generated.
But if you want to know how the offenders actually do it, here’s physicist Rob Sheldon with a lay-friendly explanation:
I finally had a chance to read the Labbe paper that found all the computer-generated papers. Here’s what they found:
The SCIgen fake paper generator is a rather long “Mad-Libs” template. The sentences are constructed by hand, but blanks are left for “scientific_adjective”, “noun-for-process”, etc. Glossaries of 50 or 100 words are supplied for these adjectives and nouns, and then the paper is constructed by filling the blanks randomly. So the grammar is correct, even the logic is correct, it is just that the content is made up. The code for this generator was made by 3 grad students at MIT in 2005, and originally the blanks were all “computer-science” words. The references are constructed likewise. The students did this as a prank, to demonstrate that many meetings don’t really care about “peer-reviewing” the entries, they care about making money off the registrants. What is scary, is that many of these journals that accepted the papers–IEEE and Springer-Verlag–are respected and “peer reviewed” journals.
The students made their “mad lib” code available to everybody, and since 2005, other uses have been found for it. For example, both in China and in Eastern Europe, promotions are based on getting peer-reviewed articles into English-speaking journals. Many scientists have next to nil English writing ability, and this is a way to raise your publication count.
The Labbe paper showed that you could also fool the Google Scholar metrics with what is called a “quote farm”. Google tracks how many people quote you, and follows the chain of quotes a couple of references backward. So Labbe and his wife created 100 SCIgen papers, carefully putting the other 99 papers in the reference of each. They didn’t need to publish them, because Google Scholar just pulls them off the internet. Thus a closed universe of self-quoting papers was created. When the Google metrics hit this, they ran around in circles finding out that “Dr Antkare” was being quoted by everyone!
Labbe also discovered that when they used the “More Like This One” feature button on the web browsers developed by Google or Nature or whoever, they could feed in SCIgen papers they had made and found numerous others in the literature! Over 120 papers were found this way, which they then dutifully told the publishers were machine generated.
Several variants of the SCIgen paper have also been constructed. One that does for High Energy particle physics exactly what SCIgen did for Computer Science. (Lubos Motl has sarcastically referred to this version as a clone of Lee Smolin’s brain. I too know a colleague who could double for Smolin.) Labbe has offered his “find SCIgen” program to the public, but it simply recognizes that the sentences in SCIgen are “fill in the blank” and therefore never depart from a recognizable script. The opening sentence is always one of two forms, which is a dead giveaway.
Of course, now those MIT grads will probably respond to Labbe’s algorithm by making a “Mad Lib” template generator, so an infinite variety of templates can be produced. It will be a challenge detecting the 2nd round of fake papers. But more importantly, the ease with which these papers can be created, combined with the apparent difficulty for reviewers to recognize them, doesn’t bode well for any of these fields. It looks like there are far more charlatans than we thought, or conversely, the economic benefits of securing tenure far outweigh the punishment for getting caught.
Here’s a challenge for ID types: There must be a way to tell when a sentence is gibberish and when it is meaningful.
Rob offers a suggestion on the fly:
Claude Shannon went through a text and eliminated letters to see if humans could reconstruct the words from the letters remaining. He then had a metric for how much information was encoded in each letter. Has anyone done this for words? How about sentences? Surely we can apply this sort of metric to a mad libs paper, and detect its information content.
Computer-generated nonsense research papers story trending