I believe it is a universal phenomenon: when you’re swamped with work, you suddenly feel the irresistible urge to do something else. This is one of those something else.
Back in January (2016), right after the submission deadline of NAACL’16, Chris Dyer famously (?) posted on this Facebook wall, “to arxiv or not to arxiv, that is the increasingly annoying question.” This question of “to arxiv or not to arxiv” a conference submission, that has not yet gone through peer-review, indeed has become a thorny issue in the field of machine learning and a wider research community around it, including natural language processing.
Perhaps one of the strongest proponent of “to arXiv” is Yann LeCun at NYU & Facebook. In his “Proposal for a new publishing model in Computer Science,” he argues that “[m]any computer [s]cience researchers are complaining that our emphasis on highly selective conference publications, and our double-blind reviewing system stifles innovation and slow the rate of progress of [s]cience and technology.” This is a valid concern, as we have observed that the rate of progress in computer science has largely overtaken the speed of publication process. Furthermore, as the focus (and assessment) has moved from journals to so-called top-tier conferences, more and more papers get stuck in the purgatory of submit-review-reject-resubmit. Although the conferences almost always guarantee faster decision making, it’s a binary decision without much possibility of any revision. The only way to salvage a rejected paper is to wait for another conference in the same year, or for the same conference in a subsequent year. Throughout this process, it’s quite often that the content and idea of the submission become stale, thus leading to a slowdown in the scientific progress.1
Of course, at the same time, there are many issues with this approach of “to arXiv,” contrary to the more traditional double-blind peer reviewing system (“not to arXiv.”) Nowadays we see a flood of conference submissions on arXiv a day or two after the submission deadline of one conference, at least in the field of machine learning, or more specifically deep learning. Unfortunately I must say that there are quite some low-quality submissions. Why are there many low-quality submissions being made public? After all, no author probably wants to be associated with a submission that is half-baked and incomplete.
One potential reason I see is the severe competition among researchers from all corners of the globe. Nobody wants to be scooped by simply forgetting to upload their submission on arXiv before their competitors do. Pushed by this anxiety over being scooped by others, authors often end up putting a rather half-baked manuscript out. Or, maybe authors are simply being naive thinking that one can always update her manuscript on arXiv with a newer version. Combined with the open reviewing system, such as that of ICLR, we see a surge of half-baked submissions on arXiv once or twice every year, and this has been spreading over to other conferences as well as other fields.2
Why is it an issue at all? Because it wastes many people’s time. We see an interesting title popping up in our Google Scholar My Update or in someone’s tweet, and as researchers, cannot ignore that submission, be it accepted at some conference or not. And, after reading the paper for 10-30m, we realize that “well, I should wait a few months for a next version!” Also, the oft-lack of thorough empirical validation may mislead readers into a wrong conclusion.
But, again, I’m not trying to either advocate or oppose the idea of “to arXiv” in this post.3 Instead, I’m here to share the result of an informal survey I ran right after reading Chris’ FB posting. The goal of the survey was to see how many people follow either of “to arXiv” or “not to arXiv” paradigms and to which degree they do so. The poll was completely anonymous and was done using Facebook App <Polls for Pages>.4 It was rather informal, and the questions were slightly changed once at the beginning of the survey. Also, it’s quite heavily biased, as most of the participants are people close to me, meaning that they are either working on deep learning or (statistical) natural language processing. In other words, take the result of this poll with a grain of salt.
The participants were asked first whether they upload their conference “submission” to arXiv. About two thirds of the participants answered that they do.
When I drew this pie chart, I noticed a striking resemblance to the chart showing the portion of machine learning researchers among the participants. Is it possible that all the machine learning researchers post their submissions to arXiv but no NLP researchers do? It turned out that the answer was “no.”
Among ML researchers | Among NLP reseachers |
The second question was on “when” they uploaded their submissions to arXiv.5
1 There is also an issue of malicious reviewers, or more mildly put subconscious bias working against some submission, but I won’t try to touch this can of worms in this post.
2 I am guilty of this myself and do not in any sense intend to blame anyone. I view this as a systematic issue rather than an issue of an individual.
3 I will perhaps make another post some day on this, but not today, tomorrow nor this year.
4 Which was a pretty bad idea, because it turned out that I had to pay $50 in order for me to see the response from more than 50 respondents.. 🙁
5 I assumed every researcher has a good intention of having their paper made public once it’s published regardless of whether to arXiv or not. Therefore, “probably not” should be understood as “probably not uploading a manuscript that was published in another medium/venue to a preprint server such as arXiv.”