Markov Chain Essay Generator

In recent years, the field of academic publishing has ballooned to an estimated 30,000 peer-reviewed journals churning out some 2 million articles per year. While this growth has led to more scientific scholarship, critics argue that it has also spurred increasing numbers of low-quality “predatory publishers” who spam researchers with weekly “calls for papers” and charge steep fees for articles that they often don’t even read before accepting.

Ten years ago, a few students at MIT’s Computer Science and Artificial Intelligence Lab (CSAIL) had noticed such unscrupulous practices, and set out to have some mischievous fun with it. Jeremy Stribling MS ’05PhD ’09, Dan Aguayo ’01 MEng ’02andMax Krohn PhD ’08 spent a week or two between class projects to develop “SCIgen,” a program that randomly generates nonsensical computer-science papers, complete with realistic-looking graphs, figures, and citations.

SCIgen emerged out of Krohn’s previous work as co-founder of the online study guide SparkNotes, which included a generator of high-school essays that was based on “context-free grammar.” SCIgen works like an academic “Mad Libs” of sorts, arbitrarily slotting in computer-science buzzwords like “distributed hash tables” and “Byzantine fault tolerance.”

The program was crude, but it did the trick: In April of 2005 the team’s submission,“Rooter: A Methodology for the Typical Unification of Access Points and Redundancy,” was accepted as a non-reviewed paper to the World Multiconference on Systemics, Cybernetics and Informatics (WMSCI), a conference that Krohn says is known for “being spammy and having loose standards.”

When the researchers revealed their hoax, calls started coming in from the likes of The Boston Globe, CNN, and the BBC. Stribling’s phone was ringing off the hook thanks to his name being listed first on the paper. (“Randomly listed first,” he adds proudly.)

In the wake of the international media attention, WMSCI withdrew the team’s invitation to attend. Not to be deterred, the students raised $2,500 to travel to Orlando, Florida, where they rented out a room inside the conference space to hold their own “session” of randomly-generated talks, outfitted with fake names, fake business cards, and fake moustaches.

At the time the stunt may have seemed like nothing more than a silly “gotcha” moment in the tradition of the “Sokal affair,” in which an NYU physicist wrote a nonsense paper that was accepted by a journal of postmodern cultural studies. But SCIgen has actually had a surprisingly substantial impact, with many researchers using it to expose conferences with low submission standards. The team’s antics spurred the the world’s largest organization of technical professionals, the Institute of Electrical and Electronics Engineers (IEEE), to pull its sponsorship of WMSCI; in 2013 IEEE and Springer Publishing removed more than 120 papers from their sites after a French researcher’s analysis determined that they were generated via SCIgen. (Just a few weeks ago Springer announced the release of “SciDetect,” an open-source tool that can automatically detect SCIgen papers.)

The trio of CSAIL alumni have since moved on to other things: Aguayo is a technical lead at Meraki; Krohn, who co-founded both SparkNotes and the dating site OKCupid, now runs Keybase, a startup aimed at making cryptography more accessible; and Stribling had stints at IBM, Google, and Nicira before joining Krohn’s team at Keybase this month.

But even a decade later, the team’s creation improbably lives on. Stribling says the generator still gets 600,000 annual pageviews that manage to crash their CSAIL research site every few months. The creators continue to get regular emails from computer science students proudly linking to papers they’ve snuck into conferences, as well as notes from researchers urging them to make versions for other disciplines.

“Our initial intention was simply to get back at these people who were spamming us and to maybe make people more cognizant of these practices,” says Stribling, before deadpanning: “We accomplished our goal way better than we expected to.”


For the 10-year anniversary, the team reconvened for a project that’s once again aimed at predatory publishers.

“SCIpher” lets you hide secret messages inside randomly-generated calls for papers (CFPs) that appear to be coming from (fictional) conferences with names like “the LYGNY Symposium on relational, software-defined technology.”

Entering a secret message into SCIpher create text for a ready-to-send CFP that the CFP’s recipient can throw back into the generator to recover the original message.

Stribling says he views SCIpher as a cheeky way to trade secrets — not to mention, to poke fun at conferences’ ridiculous, jargon-filled names.

“We combined almost-pronounceable acronyms with random buzzwords cribbed from the SCIgen grammar to evoke the kind of niche specialization that results from thousands of concurrent conferences clamoring for authors,” says Stribling. “Plus, while an encrypted email would be a big red flag for some investigators, in our experience when you send out a call for papers, it's very unlikely that anyone will read it.”

Most physical scientists are probably aware of the ArXiv website, which hosts pre-prints of scientific papers. The site is not peer-reviewed, but does have some mysterious process for weeding out the crazies. In most cases, legitimate scientific articles get posted without issue. There are occasionally delays if something in your paper raises a flag, but very rarely if ever does real research get "rejected". Crackpot "research", on the other hand, is very effectively weeded out. Naturally, there is a competing website that has no screening process and so all of the really out-there stuff gets posted there. I don't know of any real scientists who post to, or who regularly check it because it is full of crazy. Based on a discussion I had with some friends over beers, I decided to write a program to generate titles and abstracts for vixra articles.

Markov Chains

The natural choice for such a project is Markov chains. These are pseudo-random processes where the next value depends on the previous value, but not any any other values. In this project, my Markov chain will generate a random word based off of the previous word that was generated. For example, if the current word is "magnetic" then there is a good chance the next word will be "field." This property allows a markov chain generator to make sentences that are vaguely readable while still being random nonsense. The fun part here is that I am training the markov chain generator on vixra articles (read: nonsense), so much of what gets generated is scarily believable as a vixra entry.


I used the excellent pymarkovchain code to do all the hard work of creating the Markov chain, and the even more excellent beautifulsoup package to parse the titles and abstracts from html pages. As with any machine learning technique, markov chains perform best when they have lots of data so I fed the markov chain generator every vixra page from 2010 to the present. The source code I used is available as a gist here.

The only tricky part of this was trying to get the titles and abstracts to match. Since I trained separate Markov chains to generate titles and abstracts, they tend to have nothing to do with each other. I attempt to get them to match by generating a bunch of of titles and abstracts, and finding the pair where the words in the title appear in the abstract as well.

Generated VixRa Entries.

Now, for your viewing pleasure, I present a few of the random articles that my code produced:

Evidence for Dark Matter as a Standing Sound Nebula

An equally valid interpretation of a force attracting the theory to cube root of physics." the stationary Current Free Lunch” satisfied only demonstrate an apple to 13.82 billion years (Lamda cold and MICROSTRUCTURE of the motion of the wave particle to reduce the corresponding matter is transported to what Time Dilation [1], along the claims that fitting galaxy rotation curves is presented: The Big Bang caused acceleration created negative and Relativistic Quantum Theories. The universe is consistent with the Moon.

The title makes (at least grammatical) sense, while the abstract is pretty much nonsense. That is generally the case, which I assume is because the abstracts are generally much longer. I do like the 13.82 billion year old apple though!

Hypothesis for Dark Matter

Recent observations to obtain the rotation curves and nested parallel universes was proportional to that we don't know why the Accelerating Universe, flat rotation curves of three-body problem in agreement between dark matter and dark matter and Fermion particles. As a dual theorists. Hence, our real energy side of orbital period (which means a second. [6] This temperature dependent energy (ZPE) into gravitational force attracting the Super Universe’s parallel moving more than the annihilation of belt of spiral galaxies usually result of duality as a high center temperature dependent energy to be the TI field potential along the co-relation between dark matter in that at cosmological information inside.

Dark matter and dark energy are really common themes on ViXra (see below). I guess they love to come up with ideas about them since actual astronomers don't know very much about them either.

Asymmetric Dark Energy or Repulsive Gravity

Described as elaborated in the same intensity level, it seems to the angular-diameter distance. This theory of duality and astrophysics, weakly interacting quantum field to be described as real energy and magneton's beeline speed in neutrons absence of the effective mass is in the universe's expansion. Despite of mass of Physics, used first order to study of 3-rd body or by the maximum intensity, where the universe's expansion. A hypothesis for symmetry reasons the effective mass of relativistic mass it returned to r3/2.

The sad part is this title is almost believable as a real scientific article. One last thing: What are the most common words in ViXra titles?

As we saw from the generated entries, dark matter is a really common theme. One that I had to look up was Titius-Bode, which states that the semimajor axis (a) of planets follows the relation

$$ a = 0.4 + 0.3\cdot 2^m$$

for \( m = -\infty, 0, 1, 2, ...\). That is approximately true for our solar system, and even had a real ArXiv paper on the subject recently claiming that it is true-ish enough to use for prioritizing exoplanet searches. However, it is basically just numerology so not given very much serious thought. Crackpots love things like numerology though, so it makes sense that Titius-Bode would make the top ten title phrases!

Posted by Kevin Gullikson SillyScience, Machine Learning, web-scraping


Leave a Reply

Your email address will not be published. Required fields are marked *