to arXiv or not to arXiv

I believe it is a universal phenomenon: when you’re swamped with work, you suddenly feel the irresistible urge to do something else. This is one of those something else.

Back in January (2016), right after the submission deadline of NAACL’16, Chris Dyer famously (?) posted on this Facebook wall, “to arxiv or not to arxiv, that is the increasingly annoying question.” This question of “to arxiv or not to arxiv” a conference submission, that has not yet gone through peer-review, indeed has become a thorny issue in the field of machine learning and a wider research community around it, including natural language processing.

Perhaps one of the strongest proponent of “to arXiv” is Yann LeCun at NYU & Facebook. In his “Proposal for a new publishing model in Computer Science,” he argues that “[m]any computer [s]cience researchers are complaining that our emphasis on highly selective conference publications, and our double-blind reviewing system stifles innovation and slow the rate of progress of [s]cience and technology.” This is a valid concern, as we have observed that the rate of progress in computer science has largely overtaken the speed of publication process. Furthermore, as the focus (and assessment) has moved from journals to so-called top-tier conferences, more and more papers get stuck in the purgatory of submit-review-reject-resubmit. Although the conferences almost always guarantee faster decision making, it’s a binary decision without much possibility of any revision. The only way to salvage a rejected paper is to wait for another conference in the same year, or for the same conference in a subsequent year. Throughout this process, it’s quite often that the content and idea of the submission become stale, thus leading to a slowdown in the scientific progress.1

Of course, at the same time, there are many issues with this approach of “to arXiv,” contrary to the more traditional double-blind peer reviewing system (“not to arXiv.”) Nowadays we see a flood of conference submissions on arXiv a day or two after the submission deadline of one conference, at least in the field of machine learning, or more specifically deep learning. Unfortunately I must say that there are quite some low-quality submissions. Why are there many low-quality submissions being made public? After all, no author probably wants to be associated with a submission that is half-baked and incomplete.

One potential reason I see is the severe competition among researchers from all corners of the globe. Nobody wants to be scooped by simply forgetting to upload their submission on arXiv before their competitors do. Pushed by this anxiety over being scooped by others, authors often end up putting a rather half-baked manuscript out. Or, maybe authors are simply being naive thinking that one can always update her manuscript on arXiv with a newer version. Combined with the open reviewing system, such as that of ICLR, we see a surge of half-baked submissions on arXiv once or twice every year, and this has been spreading over to other conferences as well as other fields.2

Why is it an issue at all? Because it wastes many people’s time. We see an interesting title popping up in our Google Scholar My Update or in someone’s tweet, and as researchers, cannot ignore that submission, be it accepted at some conference or not. And, after reading the paper for 10-30m, we realize that “well, I should wait a few months for a next version!” Also, the oft-lack of thorough empirical validation may mislead readers into a wrong conclusion.

But, again, I’m not trying to either advocate or oppose the idea of “to arXiv” in this post.3 Instead, I’m here to share the result of an informal survey I ran right after reading Chris’ FB posting. The goal of the survey was to see how many people follow either of “to arXiv” or “not to arXiv” paradigms and to which degree they do so. The poll was completely anonymous and was done using Facebook App <Polls for Pages>.4 It was rather informal, and the questions were slightly changed once at the beginning of the survey. Also, it’s quite heavily biased, as most of the participants are people close to me, meaning that they are either working on deep learning or (statistical) natural language processing. In other words, take the result of this poll with a grain of salt. 

In total, 203 people participated, and they were either machine learning or natural language processing researchers. Among them, 64.5% said their major area of research is machine learning, and the rest natural language processing. 

The participants were asked first whether they upload their conference “submission” to arXiv. About two thirds of the participants answered that they do.

When I drew this pie chart, I noticed a striking resemblance to the chart showing the portion of machine learning researchers among the participants. Is it possible that all the machine learning researchers post their submissions to arXiv but no NLP researchers do? It turned out that the answer was “no.”

 

Among ML researchers
 
Among NLP reseachers
But, still I was able to see a stark difference between the machine learning researchers and NLP researchers. While 75.6% of machine learning researchers said they upload their submissions to arXiv, less than 50% of NLP researchers did so. I believe this reflects the fact that this model of “to arXiv” has recently been strongly advocated by some machine learning researchers such as Yann LeCun and Yoshua Bengio.

The second question was on “when” they uploaded their submissions to arXiv.5

 
The respondents were quite divided between “to arXiv right away”, “to arXiv after the deadline”, and “to arXiv after the paper’s accepted.” One lesson is that an absolute majority of the respondents want to put their papers regardless of “official” publication (in proceedings.) 
 
Now, aren’t you curious how much this trend depends upon the field of research? First up, machine learning!
Whoah! More than half of the machine learning respondents said they upload their conference submission to arXiv before any formal feedback on it. Furthermore, it shows that more than 80% of the machine learning researchers make their papers available online way before the actual conference, meaning that if anyone’s determined enough, she can read most of the machine learning papers in far advance of actual conferences (of course, you can’t drink beer with authors, which is a kind of deal breaker for me..)
 
How much does it differ if we only consider NLPers?
Surprise, surprise! We see a radically different picture here. Only about a fifth of all the NLP respondents said they upload their submissions before any formal feedback. Nearly half of the NLPers wait until the decision is made on the submission, before they arXiv it.  Also, nearly a quarter of them do not actively use arXiv for conference submissions.
Now, what have we learned from this? What have I learned from this? What have you learned from this? I have learned quite a lot of interesting things from this survey, but my dinner time’s approaching too fast..
 
One thing for sure is that it’ll be extremely interesting to conduct this type of survey, in a much more rigorous way, at some point this year, and do follow-up study each or every other year for the next decade. This will be an extremely valuable study that may help us build a better publication model for research.
 
So, my conclusion? It was $50 well spent.
 
The data (anonymized) along with a python script I used to draw those pie charts (it was my first time and I don’t recommend it) is available at https://github.com/kyunghyuncho/toarxiv/blob/master/Analysis.ipynb.

1 There is also an issue of malicious reviewers, or more mildly put subconscious bias working against some submission, but I won’t try to touch this can of worms in this post.

2 I am guilty of this myself and do not in any sense intend to blame anyone. I view this as a systematic issue rather than an issue of an individual.

3 I will perhaps make another post some day on this, but not today, tomorrow nor this year.

4 Which was a pretty bad idea, because it turned out that I had to pay $50 in order for me to see the response from more than 50 respondents.. 🙁

5 I assumed every researcher has a good intention of having their paper made public once it’s published regardless of whether to arXiv or not. Therefore, “probably not” should be understood as “probably not uploading a manuscript that was published in another medium/venue to a preprint server such as arXiv.” 

Leave a Reply