so, apparently, emergence has become a hot topic on twitter while i was away in Kigali attending ICLR, moto-taxing in Kigali, injuring myself and breaking my phone running and tracking, seeing a majestic group of gorillas and being back at AIMS Rwanda after 4 years. the mountain gorillas were majestic.
i do not want to discuss any particular paper/tweet/blog, because this topic seems to attract a weird set of people arguing for weird things, when in fact there are just a couple of different views into a single phenomenon, which is only natural in science and engineering. that said, if anyone’s interested in this recent (non-)controversy, these two papers seem to be the ones to take a look: Wei et al. [2022 TMLR] and Schaeffer et al. [2023 arXiv].
in this blog, let me instead define emergence in my own words so that i can point anyone to this blog when i end up talking with emergence with them. as the first step, here are three variables we must keep in our mind:
- $x \in \mathbb{R}$: the quantity that we vary ourselves to study emergence. some examples are # of parameters given a particular parametrization scheme, # of data points sampled from a particular distribution, etc. these are all discrete quantities, but we can imagine these as points sampled from the real line.
- $z \in \mathcal{Z}$: the quantity that we can’t/don’t control or sometimes don’t even observe while varying $x$. some examples include bit flip by cosmic ray. we often want to marginalize this out.
- because we often can’t control nor observe $z$, we assume $z$ follows a distribution $p_Z$.
- $y \in \mathbb{R}$: the quantity that we observe given $x$ and $z$. some examples are accuracy (average 0-1 loss), average negative log-probability (tight upperbound to the average 0-1 loss), etc.
with these variables, i can think of the very first definition of emergence:
Definition 1 [Weak subjective emergence of $y$]
Given $y = \mathbb{E}_z f(x, z)$ and $\delta > 0$, there exists $x’ \in \mathbb{R}$ such that $\left| \mathbb{E}_z \frac{\partial f}{\partial x}(x’, z) \right| > \left| \mathbb{E}_z \frac{\partial f}{\partial x}(\tilde{x}, z) \right| + \delta$ for all $\left| \tilde{x} – x’\right| > \epsilon$.
in words, this definition says that emergence is defined as the existence of a point $x’$ at which the change in $y$ is greater than any other point $\tilde{x}$. this can be further strengthened to include all higher order derivatives instead of only the first order derivative, but let me just stop here for now.
to measure whether this subjective emergence happens in a neural net of a particular architecture w.r.t. the number of parameters, we can follow the steps below:
- given the number of parameters $x$, train the neural net multiple times while varying random seeds in order to account for $z$. let the average validation accuracy be $y(x)$.
- $f$ then corresponds to training a neural net and measuring its accuracy on a held-out validation set.
- repeat this while varying the number of parameters.
- find a pair of consecutive $x$’s between which the validation accuracy changes most; call the mid-point $x’$.
- if this validation accuracy change is greater than that of any other consecutive pair in a meaningful amount $\delta$, we call it weak subjective emergence.
this sounds reasonable, but it raises a lot of questions. some of those questions include:
- why is the particular choice of $f$ meaningful?
- why is the number of parameters a meaningful quantity to use? what if we use the number of bits after compressing all the parameters using e.g. gzip after each update? what makes the former more interesting than the latter?
- why is the accuracy a meaningful quantity to use? what if we use the margin loss since we care about the quality of decision boundary beyond mere accuracy? what makes the former more interesting than the latter?
- why is the particular resolution of $x$ and $y$ meaningful?
- how do we decide on the meaningful amount $\delta$?
- how do we decide on the neighbourhood size $\epsilon$?
there are a few more questions i had, such as whether marginalization of $z$ is desirable over max or min over $z$, but they seem rather minor, compared to these questions above. though, i must emphasize that we have to take into account $z$ one way or another, and it feels very weird to look at only one particular configuration $z$.
these questions naturally answer why i called this particular notion of emergence subjective; it is subjective because we leave the answers to these critical questions to the one who declares emergence of a property. in other words, one can use their subjective choices of $f$, $\delta$ and $\epsilon$. furthermore, this emergence is weak in that one merely needs to choose one particular choice of $f$, $\delta$ and $\epsilon$ to show that emergence happens.
can we then define a stronger version of subjective emergence? i believe we can, but this requires us to introduce a few more concepts:
- $T_x: \mathbb{R} \to \mathbb{R} \in \mathcal{T}_x$: this is a transformation that can be applied to $x$ to change e.g. its scale, magnitude, etc.
- one example of $\mathcal{T}_x$ a set of all monotonic transformations on $x$, although we can imagine many other types of transformations.
- in the case of neural net training, another example is to simply enumerate all the things that change as the number of updates (or the number of parameters) changes. for instance, $T_x$ may map the number of updates to the $L_2$-norm of the parameters.
- $T_y: \mathbb{R} \to \mathbb{R} \in \mathcal{T}_y$: this is a transformation that can be applied to $y$ to change e.g. its scale, magnitude, etc.
- for instance, $T_y$ can map the average accuracy to the logit of the true class.
we can now define a stronger version of subjective emergence:
Definition 2. [Strong subjective emergence of $y$]
For all $T_x \in \mathcal{T}_x$ and $T_y \in \mathcal{T}_y$, let $T_y(y) = \mathbb{E}_z f(T_x(x), z)$. Then, given $\delta_{T_x,T_y} > 0$, there exists $T_x(x’) \in \mathbb{R}$ such that $\left| \mathbb{E}_z \frac{\partial f}{\partial T_x(x)}(T_x(x’), z) \right| > \left| \mathbb{E}_z \frac{\partial f}{\partial T_x(x)}(T_x(\tilde{x}), z) \right| + \delta_{T_x,T_y}$ for all $\left| T_x(\tilde{x}) – T_x(x’)\right| > \epsilon_{T_x}$.
this is essentially identical to weak subjective emergence except that we now impose that emergence should hold over a set of possible transformations made to $x$ and $y$. that is, we cannot simply choose one particular choice of $x$ and $y$, observe emergence and declare that emergence happened. rather, we need to show that such emergence happens even if we transform $x$ and $y$ in many reasonable ways.
these two definitions collapse onto each other when $|\mathcal{T}_x|=1$ and $|\mathcal{T}_y|=1$; that is, if we only consider one particular combination of $x$ and $y$ without considering any other possible transformations of them.
this definition of emergence is still subjective, since it relies on the subjective choice of $\mathcal{T}_x$, $\mathcal{T}_y$, $\delta$ (for each combination $(T_x,T_y)$) and $\epsilon$ (again for each $(T_x,T_y)$). one may even say this is even more subjective, as we need to decide on more things here, including transformations of $x$ and $y$ as well as the tolerance and neighbourhood radius for each transformation combination. nevertheless, because the notion of emergence must hold over a larger set of how we define $x$ and $y$, i’d find emergence observed according to this definition to be stronger and much more interesting.
so, we want these transformation sets to be not too narrow so that these two definitions collapse or not too broad so that we will never observe strong emergence. what would be some possible transformation sets that fall in the middle (since almost always the answer is somewhere in the middle)?
in my view, a good choice of the transformation set (either $x$ or $y$) is a set of all (noisy) monotonic transformations. for instance, if we take $x$ to be the number of updates in neural net training, we should also consider the $L_2$-norm of the parameters, as it grows (almost) monotonically w.r.t. the number of updates. if the claimed weak emergence over the number of updates disappears when we transform it into the $L_2$-norm of the parameters, we can’t claim stronger emergence. in the case of $y$, an interesting transformation is the repeated application of $\log$. how many $\log$-transformations of $y$ does the claimed emergence withstand? this would give us a sense of the strength of observed emergence.
finally, can there be objective emergence? i believe so, although such emergence would be very narrow in that there is essentially no room for any choice or interpretation. for instance, earlier together with Laura Graesser and Douwe Kiela, we demonstrated that a symmetric pair-wise protocol only emerges among communicating agents if there are at least three agents (it’s a bit obvious, though.) in this case, this emergence is objective, in that there’s no other transformation to choose (i.e., the number of agents is just the number of agents, and the communication success is defined as 0-1 and no other way) nor any other definition of tolerance or neighbourhood. in other words, objective emergence would be identical to subjective emergence except that the problem setup is extremely constrained to the point that there is no room for subjective choice nor interpretation, which makes it less interesting in general.
that wraps up yet another post of my random thoughts that would never make it to papers. have a nice day!
Acknowledgement:
- Thank you, Prof. Ernest Davis, for pointing out that the emergence should be defined w.r.t. $y$. this comment has been reflected.
- Thanks to Daniel Paleka’s comment, i clarified in the second definition that $\delta$ and $\epsilon$ are dependent on the choice of transformations.