My opening statement at the ICML 2022 Debate

i was honoured to participate in the ICML Debate 2022 on the topic of <Progress towards achieving AI will be mostly driven by engineering not science>. the debate was in the British Parliamentary Style which i was not familiar with at all but found interesting. i was assigned to the opposition party and was designated as the “leader”, which meant i had to open the debate from the opposition side following the opening from the proposition.

the proposition party consisted of Sella Nevo, Maya R. Gupta and François Charton. Been Kim was unfortunately unable to participate, although she would’ve been a great addition to the proposition party. the proposition party argued that progress towards achieving AI will be mostly driven by engineering not science.

the opposition party (i guess … my party) consisted of Ida Momennejad, Pulkit Agrawal, Sujoy Ganguly and your truly. the opposition party (perhaps obviously) opposed the proposition’s stance and argued that progress towards achieving AI will be mostly driven by science not engineering.

if you’re registered at ICML 2022, you can watch the recording of the debate at https://icml.cc/virtual/2022/social/20780. i don’t know if this will be released publicly when the conference is over, but i will update it here if and when that happens.

the debate was fun and was full of many interesting and thought-provoking ideas and points. i won’t try to summarize those points here, as that would require a huge amount of efforts and i shouldn’t have had that much beer over the past 4 days …

instead, i’ll share my opening statement here. a distinct advantage i had as the opposition leader was that i could prepare my statement in advance, and now i can share it here. my main goal was to leave enough rooms for the other members of the party to delve deeper into their own views/expertise and also to expand on various aspects to address the proposition’s follow-up arguments.

here you go!

Opposition opening statement by Kyunghyun Cho

The opposition believes that progress toward achieving AI will be mostly driven by science not engineering.

Recent progress in large-scale models, such as language models and language-conditional image generation models, easily give us an impression that what we see as impressive are largely the product of impressive engineering that has allowed us to effectively and efficiently scale up our systems. This impression is not what we oppose here.

Such impressive progress however has begun to give out an incorrect impression that such a stellar level of engineering is what (if not the only way to) drive progress in AI research toward building a truly intelligent system. This impression is what we oppose here.

Instead of arguing how engineering alone would not be enough for future progress toward achieving AI here. I’d like to focus on more concrete examples of how engineering alone has not been enough to have arrived at even the current state of AI, which I believe most of us agree is not at all close to the ultimate goal of truly intelligent machines.

As the first and perhaps most salient example today, I would like to talk about these super-impressive large-scale language models, represented by GPT-3 and many follow-up even more impressive models such as PaLM, BLOOM, etc. Despite their differences, there are a few core concepts shared by all these models that are critical to their existence.

First, they all rely heavily on the concept of maximum likelihood with autoregressive modeling. These two concepts together end up being building a classifier that predicts the next token given all the preceding tokens (words in many cases but the details do not matter much). And, doing so corresponds to estimating the upper-bound to the true entropy of the distribution underlying a gigantic amount of text we use.

By building a machine to predict the next word correctly, which takes into account both short- and long-term dependencies (unlike what many critics say otherwise,) we approximate the text/language distribution very well and sample/generate extremely well-formed text and images from these distributions.

Where did this idea come from? Has this idea benefited from superb engineering? Yes, superb engineering, including software and hardware, has dramatically pushed the boundary of the said technique but the birth and full formalization of next-word prediction can be traced back all the way to Claude Shannon’s paper from 1950.

This same idea was revived and pushed dramatically since late 80’s when folks from IBM, including Peter Brown and Bob Mercer, built the first statistical machine translation system where a large-scale (yes! it was already large then) target-side language model was a critical component.

The very same idea was revived or rejuvenated multiple times even after that, including late 90’s with Yoshua Bengio’s neural language models, around 2010 with Alex Graves’ and Tomas Mikolov’s recurrent language models, and now with attention-based models.

Better engineering, in terms of better software and better hardware, has indeed pushed the boundary of what we can do with this next-word prediction, but the seed of what we see now was already planted by “science” in 50’s.

Second, I’d like to talk about all the “techniques” or “tricks” that facilitate learning. Although it may look like faster hardware and better software framework are the main drivers of recent advances in large-scale language models, it is highly questionable whether we can train any reasonable model had we not found a series of techniques that enable us to do so.

For instance, non-saturating nonlinearities, such as rectified linear units, are workhorses of modern neural networks, including large-scale language models. It is only natural to use ReLU or its variant now, but it wasn’t so until around 2010 when there were two papers, one from U. Toronto and the other from U. Montreal, that demonstrated the potential effectiveness of ReLU from two different perspectives. As an example, the first one, Nair & Hinton, derived the ReLU for restricted Boltzmann machines by viewing it as an approximation to having an infinitely many replicated binary hidden units that share the weight vector but differ in their biases.

Furthermore, the potential for using ReLU-like nonlinearities was studied extensively in (computational) neuroscience, which has inspired many to consider this in the context of artificial neural network research for many decades.

Would engineering alone have allowed us to jump from much more widely used sigmoid nonlinearities to ReLU? With exhaustive hyperparameter tuning using an excessive amount of resources, engineering may have ended up with a very particular way of initializing parameters and a very particular setup for optimization that makes sigmoid nonlinearity work, but it is unclear if that would’ve happened at all, because the community would’ve already given up on investing further on this direction.

Of course, the last example I want to bring up is shortcut connections, which reflects a bit of my personal preference. Shortcut connections, which include residual connections as well as gated connections in LSTM and GRU’s, are what we, the research community, had to spend decades to come up with in order to address the issue of vanishing gradient or long-range credit assignment. It started with mathematical analysis by Sepp Hochreiter and Yoshua Bengio in the early 90’s, some further empirical analysis by many people since then, and some proposals, such as leaky units, and others, of which some were successful and others were not as much.

Eventually, this was identified as a way to propagate gradient properly across many nonlinear layers of both recurrent and feedforward networks, evident from the near-universal showing of residual blocks or connections in modern neural networks, including large-scale language models that are built as transformers.

However small they seem and are, we could get to this point only because of all these science (or perhaps mathematics) driven innovations. More properly, I could say that it was science that has put us on this path so that engineering could push us forward following this path.

It may not look like this will happen anytime soon, but i can assure you that very soon the bandwagon driven by engineering on this path laid out by science will find itself at the next cross road. Engineering won’t tell us which road we take next, but it will be science that tells us which path we can and should take next in order to move us closer to AI.

Leave a Reply