an interesting urban legend or wisdom is that a classifier we train will work better on examples that appear more frequently in the training set than on those that are rare. that is, the existence of duplicates or near-duplicates in the training set affects the decision boundary learned by a classifier.
for instance, imagine training a face detector for your phone’s camera in order to determine which filter (one optimized for portraits and the other for other types of pictures). if most of the training examples for building such face detector were taken in bright day light, one often without hesitation says that this face detector would work better on pictures taken under bright day light than on pictures taken indoor. this sounds pretty reasonable until you start thinking of some simplified scenarios. And, that started for me a few years ago, which eventually led me to write this blog post.
so, let’s consider a very simple binary classification setup. let $D=\{(x_n, y_n)\}$ be the training set. $f(x)$ returns the number of occurrences of $x$ within $D$, that is,
$$f(x) = \frac{1}{N} \sum_{n=1}^N I(sim(x_n, x) \leq \epsilon),$$
where $sim$ is a similarity metric, and $\epsilon$ is a similarity threshold. $I$ is an indicator function. if we set $\epsilon=0$ and $sim(a,b) = I(a=b)$, $f(x)$ literally looks at the number of duplicates of $x$ within the training set.
we assume that the training set is separable, which makes everything so much easier to imagine in our head and also reason through.
in this simple setup, what is really interesting (to me, at least) is that the number of duplicates $f(x_n)$ of any $x_n \in D$ does not affect a separating decision boundary. as soon as one of the duplicates is correctly classified (i.e., on the right side of the decision boundary), all the other duplicates are equally well classified and would not affect our choice of the decision boundary.
this is most clearly demonstrated by the perceptron learning rule which is defined as
$$w^n = \begin{cases}
w^{n-1}, &\text{if } y_n (w^n \cdot x_n) > 0 \\
w^{n-1} + x_n, &\text{otherwise}.
\end{cases}$$
that is, the decision boundary defined by $w^n$ is only updated if $x^n$ is incorrectly classified, i.e., $y_n (w^n \cdot x_n) \leq 0$. once $x^n$ is correctly classified, all the subsequent duplicates of $x^n$ do not contribute to the decision boundary.
another example is a max-margin classifier, such as a support vector machine. in this case, we can think of how the margin of a (separating) decision boundary is defined. the margin is defined as the sum of the distance to the nearest correctly-classified examples from both classes (positive and negative) respectively. in other words, the only examples that matter for determining the optimal decision boundary are the ones that are nearest correctly-classified ones (at least two; they are called support vector), and all the other examples that are correctly classified and far from the decision boundary (recall the separability assumption) do not contribute to the optimal decision boundary. in other words, it really doesn’t matter whether there are many duplicate copies of any particular example, as either that group of examples contribute equally to the margin or does not contribute at all.
Then, does it mean that the existence of duplicates of each training example does not matter when it comes to learning a classifier? Or, better put, why do we think the existence of duplicates changes how our classifiers work?
every now and then, i stumble upon discussion on the difference between parametric and non-parametric methods. every time i believe i found the answer to this question in a way that is explainable to my students and colleagues, but quite rapidly my belief on that answer fades away, and i start to doubt myself as a computer scientist. the last episode was pretty recent, and you can find people’s responses and insightful answers at
it turned out that this seemingly naive and dumb question connects to this issue of whether/how duplicates of training examples impact classification. what do i mean by that?
instead of perceptron and support vector machine above, which can be thought of as parametric approaches, since their discovered decision boundaries are described without referring to the training examples, i.e., on their own, let us consider one of the simplest and perhaps most powerful non-parametric classifier whose decision boundary is a function of the training examples and its complexity grows as we include more training examples. and, this classifier is $k$-nearest neighbour classifier ($k$NN).
given a new example $x$ we want to classify using our $k$NN classifier, let $(x_n,y_n)$ be the nearest neighbour of $x$. given the number of duplicates in the training set $f(x_n)$, we can now tell how many other neighbours are considered by this $k$NN; the answer is $k – f(x_n)$. that is, the probability of this new example $x$ belonging to $y_n$ is written down as:
$$p(y=y_n| x) = \frac{\min(k, f(x_n))}{k} + \frac{1}{k} \sum_{(x’,y’) \in \mathcal{N}_k(x)} I(x’ \neq x_n) I(y’ = y_n),$$
where $\mathcal{N}_k(x)$ is a set of $k$ nearest neighbours of $x$. as $f(x_n)$ grows, the first term dominates, and the chance of classifying $x$ into $y_n$ consequently grows as well. that is, the more duplicates we have of $x_n$ the higher probability for $y_n$. that is, the region corresponding to $(x_n,y_n)$ grows as the number of its duplicates increases, which is precisely what a non-parametric classifier does.
so, what does this tell us? the impact of duplicates in the training set differs between parametric and non-parametric approaches. it is not only in classification, but also in generative modeling, since much of generative modeling can be thought of as supervised learning in disguise. if we are dealing with non-parametric methods, we probably want to take into account duplicates in the training set and either keep them as they are or de-duplicate them. this decision will have to be made for each problem separately. if we are working with parametric methods, we probably don’t need to worry about these duplicates beyond the computational concern.
how does this observation connect with the urban legend/myth on the impact of duplicates? i believe this simply tells us that classifiers we use often in practice are often non-parametric, including $k$NN, neural nets and random forests. in other words, it wasn’t really about whether duplicates matter but it was more about what is a common practice in modern machine learning; that is, we use non-parametric classifiers.
there’s nothing serious nor insightful here, but i enjoyed this thought experiment!