Supporting female researchers and researchers from under-represented groups, together with CIFAR

if i had to pick organizations that have impacted my current career path most, CIFAR would be very near (if not at) the top of this list. there are a few reasons behind this. first, CIFAR started a program named “Neural Computation & Adaptive Perception” (NCAP) in 2004, supporting research in artificial neural networks, which has become a dominant paradigm in machine learning as well as more broadly artificial intelligence and all adjacent areas, including natural language processing and computer vision. i started my graduate study in 2009 with focus on restricted Boltzmann machines and graduated in 2014 with a

Restricted Boltzmann machines or contrastive learning?

my inbox started to over-flow with emails that urgently require my attention, and my TODO list (which doesn’t exist outside my own brain) started to randomly remove entries to avoid overflowing. of course, this is perfect time for me to think of some random stuff. This time, this random stuff is contrastive learning. my thought on this stuff was sparked by Lerrel Pinto’s message on #random in our group’s Slack responding to the question “What is wrong with contrastive learning?” thrown by Andrew Gordon Wilson. Lerrel said, My understanding is that getting negatives for contrastive learning is difficult. Lerrel Pinto

A few QA’s from the course F’20 <Deep Learning>

i’ve just finished teaching <Deep Learning> this semester together with Yann and Alfredo. the course was in a “blended mode”, implying that lectures were given in person and live-streamed with a limited subset of students allowed to join each week and all the other students joining remotely via Zoom. this has resulted in more active online discussion among students, instructors and assistants over the course, and indeed there were quite a few interesting questions posted on the course page which was run on campuswire. i enjoyed answering those questions, because they made me think quite a bit about them myself.

Scaling laws of recovering Bernoulli

[Initial posting on Nov 29 2020][Updated on Nov 30 2020] added a section about the scaling law w.r.t. the model size, per request from Felix Hill. [Updated on Dec 1 2020] added a paragraph referring to Dauphin & Bengio’s “Big Neural Networks Waste Capacity“.{Update on Feb 8 2021] see “Learning Curve Theory” by Marcus Hutter for a better exposition of the scaling law and where it might be coming from. this is a short post on why i thought (or more like imagined) the scaling laws from <scaling laws for autoregressive generative modeling> by Heninghan et al. “[is] inevitable from

Creating an encyclopedia from GPT-3 using B̶a̶y̶e̶s̶’̶ ̶R̶u̶l̶e̶ Gibbs sampling

[WARNING: there is nothing “WOW” nor technical about this post, but a piece of thought i had about GPT-3 and few-shot learning.] Many aspects of OpenAI’s GPT-3 have fascinated and continue to fascinate people, including myself. these aspects include the sheer scale, both in terms of the number of parameters, the amount of compute and the size of data, the amazing infrastructure technology that has enabled training this massive model, etc. of course, among all these fascinating aspects, meta-learning, or few-shot learning, seems to be the one that fascinates people most. the idea behind this observation of GPT-3 as a