Lecture note “Brief Introduction to Machine Learning without Deep Learning”

This past Spring (2017), I taught the undergrad <Intro to Machine Learning> course. This was not only the first time for me to teach <Intro to Machine Learning> but also the first time for me to teach an undergrad course (!) This course was taught a year before by David Sontag who has now moved to MIT. Obviously, I thought about re-using David’s materials as they were, which you can find at http://cs.nyu.edu/~dsontag/courses/ml16/. These materials are really great, and the coverage of various topics in ML is simply amazing. I highly recommend all the materials on this web page. All the things you need to know in order to become a certified ML scientist can be found there.

I, however, felt that this great coverage may not be appropriate for an undergrad intro course and also that I wasn’t qualified to talk about many of those topics without spending a substantial amount of time studying them myself first. Then, what can/should I do? Yes, I decided to re-create a whole course with two things in my mind. First, what’s the minimal set of ML knowledge necessary for an undergrad to (1) grasp at least the high-level view of machine learning and (2) use ML in practice after they graduate? Second, what are topics in ML that I could teach well without having to pretend I know without knowing them in depth? With these two questions in my mind, as in the previous year for the NLP course, I started to write a lecture note as the semester continued. At the end of the day (or semester), I feel like I’ve taken a step toward a right direction however with much to be improved in the future.
I started with classification. Perceptron and logistic regression were introduced as examples showing the difference between traditional computer science (design an algorithm that solves a problem) and machine learning (design an algorithm that finds an algorithm for solving any given problem). I then moved on to defining (linear) support vector machine as a way to introduce various loss functions and regularization. I gave up on teaching kernel SVM due to time constraint, though. Logistic regression was then generalized to a multi-class logistic regression with softmax. 
For teaching how to deal with problems which are not linearly separable, I’ve decided an unorthodox approach. I started with a nearest-neighbour classifier, extend it into a radial basis function network with fixed basis vectors, and then to an adaptive basis function network which I dubbed as deep learning (which is true by the way.) At this point, I think I lost about half of the class, but the other half, I believe, was able to follow the logic based on their performance in the final exam. I should’ve talked about kernel methods here, but well, it’s not like I can use the whole semester solely on classification.
Then, I moved on to regression. Here I focused on introducing probabilistic ML. To do so, I had to spend 2 hours on re-capping on probability itself. I introduced Bayesian linear regression and discussed how it corresponds to linear regression with Gaussian prior on the weight vector. This naturally led to a discussion on how to do Bayesian supervised learning. I wanted to show them Gaussian process regression, but again, there wasn’t enough time.
For unsupervised learning, I again took an unorthodox route by putting (almost) everything under matrix factorization (X=WZ) with a reconstruction cost and varying constraints. PCA and NMF were discussed in-depth under this, and sparse coding and ICA were briefly introduced. k-means clustering was also introduced as a variant of matrix factorization, and hard EM algorithm was (informally) derived from minimizing a reconstruction error with a constraint that the code vectors (Z) were one-hot. This whole matrix factorization was then extended to deep autoencoders and to (metric) multi-dimensional scaling. Surprisingly, students were much more engaged with unsupervised learning than with supervised learning, and at this point, I had regained the half of the class I lost when I was teaching them nonlinear classifiers.
The course ended with the final lecture in which I briefly introduced policy gradient. This was again done in a rather unorthodox way by viewing RL as a sequence of classifiers. I’m quite sure RL researchers would cry over my atrocity here, but well, I thought this was a more intuitive way of introducing RL to a bunch of undergrad students who have highly varying backgrounds. Though, now that I think about it, it may have been better simply to play them the RL intro lecture by Joelle Pineau: http://videolectures.net/deeplearning2016_pineau_reinforcement_learning/.
Anyways, you can find a draft of my lecture note (which will forever be a draft until I retire from the university) at 
Any suggestion or PR is welcome at 
However, do not expect them to be incorporated quickly, as I’m only planning to revise it next Spring (2018).
During the course, I showed the students the following talks here and there to motivate them (and to give myself some time to breathe):

Leave a Reply