Drug Discovery may be in the Cold War Era

this post expands on this tweet i sent out a month ago.

as scientists (yes, i identify as a scientist myself, although i can see how this can be debatable,) we are often trained and encouraged to uncover mechanisms behind mysterious phenomena in the universe. depending on what kinds of these mysterious phenomena we consider, we are classified into different buckets; if you are working on biological phenomena, you’re a biologist. if you are working with languages, you’re a linguist (are linguists scientists? a good question, but i will leave this question for another post in the future.) if you are working with problem solving, i think you’d be considered a computer scientist.

often, uncovering hidden mechanisms behind these mysterious phenomena enables us to solve problems that were seemingly unsolvable. by figuring out electromagnetic forces, we live in this wonderful world of social media that makes everyone stressed out and unhappy about themselves. by figuring out relativity, we live in this wonderful world where no one knows how to go from one place to another without Google Maps. by figuring out quantum mechanics, our computers are getting faster and faster with smaller and smaller chips that have more and more transistors. by figuring out molecular biology, we don’t need to butcher cows, horses, etc. for insulin. by figuring out thermodynamics, our streets are filled up with loud and fast cars. these are all amazing.

and, this is why it is tempting for us, scientists, to think that this is the right and perhaps only way to tackle challenging problems. first, figure out precise mechanism by which phenomena behind these challenging problems happen, and then based on the uncovered mechanism, come up with solutions to these problems. surely once we know why and who problems manifest themselves, we can fix the problems.

this temptation however can easily lead us astray. this is because a solution to a problem almost never relies on the full mechanism by which a phenomenon, leading to the problem, arises. rather, practical solutions often rely on some knowledge of the underlying mechanism combined with some approximation and simulation. a representative example is an aircraft. it is extremely important and critical to accurately model how an aircraft takes off, flies in a high altitude, maneuvers in the air and lands safely, but modeling any one of these aspects accurately is often impossible due to many factors, such as computational intractability and unobserved variables. instead, aircraft manufacturers rely on approximate solutions to a set of simplified governing equations with quite a few free parameters. these free parameters are often estimated by an extensive set of computational simulations, existing data points and experiments run with wind turbines. despite a few disastrous incidents, this solution found by a partial knowledge of the phenomenon of flying mixed in with simulations, data and experiments works remarkably well. if we insisted that we have full knowledge of aerodynamics and all the other aspects that matter for the phenomenon of flying before building a solution, that is an aircraft, we would not have such an extensive network of commercial air routes that many of us enjoy (and despise).

another example, that is closer to my heart, is machine translation. as we are all too well aware, machine translation has a deep root in the world wars and cold war, where the demand and desire to automatically and rapidly translate intercepted documents from enemies into our own languages peaked. of course, in a more modern context, machine translation is one of the only tools we have to lower the so-called digital divide and digital barrier created due to the imbalance in the availabilities of contents in different languages on the internet and beyond.

when machine translation was first introduced and formalized as a research & development problem, it was tackled as a problem that would be resolved once we know more about how languages work. that is, if we know how we understand and produce languages, we would be able to build a machine translation system that closely imitates and combines these two processes of understanding and production. on the understanding side, the stack began with word parsing/understanding, combining them into phrases and sentences syntactically, assigning semantics to these syntactically combined units and ultimately distilling out the meaning (interlingua) of the input text. the process of production is the reverse of understanding; start from interlingua, put semantics, map it to a syntactic structure and read out words. of course, it is important to directly transfer some information at each level of understanding to the corresponding level of production, as we progressively lose information as we go up in the hierarchy in the understanding stage. this overly-simplified paradigm of building a machine translation system is called a rule-based paradigm or rule-based machine translation.

i’m quite certain that this is probably a good picture of how i translate text written in korean to english but that this is not the perfect mechanism by which i understand and write/speak any language. and, due to this imperfection (despite years, decades and centuries of ongoing efforts in linguistics) rule-based machine translation has never had a widespread adoption beyond a set of niche, narrow-domain applications. but, as some of you are already doing with this post, machine translation works pretty well these days for many languages. how is this so?

the pivotal moment in machine translation (and similiarly in speech recognition) came when a small group of people, including Brown et al. at IBM in late 80’s, decided to revive the old idea of viewing language as an information-coding mechanism by Shannon as well as Weaver from 40’s and 50’s. in this view, a text snippet written in one language is nothing but a corrupted version of the text snippet written in another (original) language. the goal is then to denoise the given text snippet by first learning about the pattern of noise and second learning about the target (original) language itself. this approach can be approached purely statistically by (1) supervised learning to learn the pattern of distortion and (2) unsupervised learning to learn the overall distribution of the target language. we call this a noisy channel model, and all we need in this paradigm is a large amount of paired and unpaired data.

this purely statistical paradigm is precisely what drives modern machine translation as well as large-scale language models. relative to the paradigm behind rule-based machine translation above, this paradigm exhibits almost nothing that resembles human language understanding and production. rather, in this paradigm, the problem of machine translation (or more generally language understanding-production) is solved by extracting all possible correlations that exist in a large corpus and selectively using a subset of these correlations in the test time to understand and produce text. this is probably not how we use language in our brains but this is how we could and should solve the problem of machine translation and all adjacent problems in natural language processing, such as question-answering, chit-chat, etc.

this statistical paradigm has its own issues, such as its susceptible to spurious correlations, its fragility to shift in underlying distributions and its thirst for large amounts of data. it is nevertheless how we have made some significant jump in building a quality and production-grade machine translation system. after all, my mum will use machine translation to read this post, and i’m almost 100% certain she will understand my main message more or less perfectly.

i’m increasingly convinced that we may be stuck in the first paradigm driven largely by science (or more precisely curiosity) when it comes to drug discovery, when we should move on to the next paradigm where drug discovery is largely driven by high-through experiments, heterogeneous data, high-dimensional statistics and engineering. many of us have this deep desire to figure out how biology works and how physiology works, in order to understand how symptoms (after all diseases are often nothing but the names we assigned to clusters of symptoms) arise in the most detailed mannger and come up with a treatment that disrupts biological processes that lead to these symptoms, based on our understanding of physiology, biology, chemistry and physics. this reminds me so much of the early attempt at machine translation; we figure out how we understand and produce language, build a machinery that imitates this process closely and eventually this machinery would begin to translate text between two languages extremely well. the fallacy here is that figuring out minute details of true mechanism by which symptoms arise may satisfy our appetite for scientific understanding but may be a more challenging problem than coming up with effective therapeutics. that is, we may be stuck with drug discovery because we are not solving drug discovery but are trying to solve a more difficult (albeit potentially more interesting and intellectually stimulating) problem of figuring out biology.

how would this new paradigm of drug discovery look like? it would be much more end-to-end than the current practice of drug discovery which is effectively an exponentially shrinking funnel with many stages of decision making driven by both our knowledge of biology and physiology, our gut feeling based on experiences and intuition and financial constraints. in the new paradigm, to which i will refer as the end-to-end paradigm, the goal is to detect and capture all possible correlations that exist between all symptoms of interest, demographic information about all people (both patients and non-patients), non-drug medicinal interventions (and their descriptions), drug-based medical interventions (and drug’s chemical and biological properties), all knowledge about biology, chemistry, physics and medicine (from scientific and clinical articles as well as non-academic articles, and more. some of these correlations may arise from random interventions, such as as part of clinical trials, or due to some natural experiments (for instance, some states lose medicaid support for their residents over night, some drugs are temporarily unavailable in some parts of the world due to logistical issues, etc.) these interventional data would enable us to use advanced statistical techniques to identify a subset of correlations that arise from some causal mechanisms. once these correlations, both causal and spurious, are captured, we can experiment in silico the potential effect of new therapeutic options for a new cluster of symptoms for a previously unconsidered group of patients. if there are abundant evidence supporting this estimated effect already, we may directly go for clinical trials or even skip some part of clinical trials (especially if the estimated effect was deemed causal.) if not, we may be able to devise a minimal set of experiments (in vitro, in vivo and/or clinical trials) to fill this gap.

in fact, such a large collection of (both causal and spurious) correlations will even guide us in determining which therapeutic options (new and old) should be considered for which set of symptoms (again new and old) and which sub-population, in order to maximize our future ability to come up with better therapeutic options for more severe and critical conditions. in this end-to-end drug discovery paradigm, it is not about coming up with one new drug but about maximizing information so that over time we can come up with an infinite series of drugs for an infinite variety of symptoms and an infinitely diverse population with a very high success rate. a new data point for oncology will improve our chance of success in creating new drugs for all the other therapeutics areas; neurodegenerative diseases, diabetes, autoimmune, etc. you name it. a new data point for one subpopulation (say, east asian middle age males like myself) will help us come up with new drugs for all the other subpopulations. this will not only change drug discovery but the overall healthcare.

one potential downside of this end-to-end paradigm is that we may not learn as much about biology as well as human physiology as we would had we stuck to the old (or current, to be strict) paradigm of drug discovery. this isn’t because we do not want to learn about biology, but because drug discovery can progress without necessarily waiting for biology and medicine to make progress, just like we could build machine translation systems, and subsequently amazing language models we all use these days, without waiting for linguists to perfectly figure out how we understand and produce language.

that said, i do not want to give you impression that the current paradigm of drug discovery is completely stuck due to our desire to scientifically understand physics, chemistry, biology and physiology. in fact, many stages of drug discovery are based on statistics and experimental feedback, without necessarily knowing how things actually work. after all, we still don’t know how tylenol works but are perfectly fine with buying tylenol off the shelf and take them without consulting doctors nor pharmacists (okay, it’s always a good idea to consult medical professionals before taking any drug.) we are however spending a lot of efforts in advancing our understanding of biology and physiology under the name of drug discovery, and i want to question the underlying assumption behind this practice.

in the original tweet (shown at the top of this post) i was being a bit more specific: “AI for drug discovery”. why was i so? it is because i feel like we are using this amazing new technology to solve and/or speed up individual problems that exist largely only under the current paradigm. this feels a lot like how neural nets were initially used to improve sub-components of a larger phrase-based machine translation system 15 or so years ago; we were replacing a phrase table with a tiny sequence-to-sequence model (that was my first foray into machine translation,) we were replacing an n-gram language model with a tiny feedforward neural language model (that was holger schwenk’s rise to fame long ago,) etc. but eventually (and we knew it from the beginning), we had to jump off of this older paradigm of phrase-based translation (or n-gram translation) and jump onto the next paradigm of neural machine translation (or i should’ve coined instead end-to-end machine translation.) i strongly suspect we are ready to do so for AI for drug discovery as well.

science is fascinating because we uncover nature’s mechanisms behind mysterious and awesome phenomena in the universe. science has rightfully helped us uncover some of these hidden mechanisms and gifted us with amazing technological advances. this is however not the only way forward, and sometimes uncovering the underlying mechanism itself can be even more challenging than solving problems arising from such deeply-hidden mechanisms. that is, if you spoke Korean, 배꼽이 배보다 큰 상황이다. machine translation was one such example that i was fortunate enough to see for myself, and drug discovery may be in the same boat as machine translation was at one point (that is, during the cold war and a couple of decades after the berlin wall went down.)

in order for drug discovery to make a significant jump with assistance from this amazing technology, called artificial intelligence (AI), we must start to think end-to-end and steer away from obsessing over every minute detail of physics, chemistry, biology and physiology. such obsession will continue in parallel for the purpose of science, but should not be a prerequisite for making progress in drug discovery. rather, we must look at a bigger picture, capture all possible correlations, both causal and spurious, between every entity involved in the overall healthcare and let AI help us come up with new therapeutic options and propose a set of experiments to run, in order to maximize the gain of information that can be shared across different symptoms, different therapeutic modalities, different population groups and beyond. that is, end-to-end drug discovery must and will simultaneously tackle every disease for every one with every possible approach.

anyhow … it was Dr. Yuanqing Wang who’s both brilliant and thoughtful (and on a job market!) that prompted me to write this lengthy blog post (though, it is as always with any of my blog posts a casual, hastily-written and naive post) explaining why i posted the original tweet “ai for drug discovery looks a lot like machine translation research/development during the cold war.” thanks, yuanqing!

Related Posts

Global AI Frontier Lab at New York University

Softmax forever, or why I like softmax

Amortized Mixture of Gaussians (AMoG): A Proof of Concept for “Learning to X”, or how I re-discovered simulation-based inference

Leave a Reply Cancel reply