this is a slightly expanded version of my fb post: https://www.facebook.com/cho.k.hyun/posts/10216267975445626.

i’ve lived in three countries-finland, canada and US- over the past 12 years as an expat/immigrant myself, which makes me pretty well aware of issues and challenges faced by immigrants, in particular east asian ones, in these countries. this made me *incorrectly* believe that i know the challenges and issues faced by immigrants everywhere beyond these three countries, including korea where i was born and raised as a korean national and had lived for 20+ years. this was until i saw this post by Alice, where she shared a link to the homepage of “*Hanmaum Education Volunteer Corp who helps children of immigrant families in challenging environments by providing free education*” (my own translation of an excerpt from the original post.)

how did i miss this? this glaringly obvious omission of immigrant kids from all those years i was growing up in korea. somehow i’ve never had a chance to even have a single peer in any of the schools i had attended who was a kid of an immigrant family. realizing this was and is still quite a shock, considering that the number of immigrants, immigrant families and their children has been only growing over the past decades.

then, i realize it’s because i was born and raised near the center of the society. this has made me pretty much blind to corners of the society, and all these immigrant moms (it’s also a bit concerning that it’s disproportionately immigrant “moms”) and their children were and are in those corners of the society. it was this post by Alice and this effort by Emeritus professor Byung-Gyu Choi of KAIST that barely made me take a glimpse at this corner. what a blind fool have i been, and what else am i being blind to..?

last november (2020), i was invited to give an opening talk at SK ICT Tech Summit 2020, perhaps unsurprisingly together with Alice (i’m a huge fan!), and talked about my on-going project on breast cancer screening (see the recording of the talk here). SKT generously paid me $6,000 lecture fee (and yes, it was super-generous, and i rarely receive any lecture fee from my invited talks ever.) i’ve been thinking about how i was going to spend this, and have decided to donate the entire sum to the Hanmaum education volunteer corp.

it’s not a lot, and it doesn’t come any close to students and other volunteers who are on the ground providing education to these moms and kids. i hope however that this small gesture of mine would help immigrant parents & kids receive education they truly deserve.

p.s. i’m quite proud to see my former visiting student and current good friend, Keunwoo, following my lead and showing others what to do

]]>NYU์์๋ ์ด๋ฒ ๊ฐ์์ blended insturction์ ํ๋ค. ๊ฐ ์์ ์ ๊ท๋ชจ (ํ์ ์ ๋ฐ ์ฃผ๋น ๊ฐ์ ์), ํน์ฑ (๋๋ฉด ํ์) ๋ฑ์ ๊ณ ๋ คํ์ฌ remote, in-person ๋๋ blended๋ก ํ๊ธฐ ์์ ์ ๊ตฌ๋ถ์ ์ง์๊ณ , ๋๋ blended mode์ ๊ฐ์๋ฅผ ์งํํ๋ค. blended mode ์์ ์ ๊ฐ์๋ in-person ๊ทธ๋ฆฌ๊ณ lab sessions์ ํ์์ 2-3๋ฐฐ๋ก ๊ฐฏ์๋ฅผ ๋๋ ค์ in-person๊ณผ remote๋ฅผ ๋ชจ๋ ๊ฐ์ก๋ค. ๋ชจ๋ ๊ฐ์์ lab์ zoom์ ํตํด livestreamํ๊ณ ์ด๋ฅผ ํตํด ๋ด์์ ์ค์ง ๋ชปํ, NYU์ global campus์ ๋์ ์งํํ ํ์๋ค์ด ๊ฐ์๋ฅผ ๋ฃ๋๋ฐ ๋ฌธ์ ์๋๋ก ํ๋ค. ๋งค ๊ฐ์ ๋ฐ lab session์ in-person์ผ๋ก ์ฐธ์ํ ํ์์ ํ๊ธฐ ์์ ์ ๋ฏธ๋ฆฌ ๋ฐฐ์ ๋ ์ฃผ์ ๋ฏธ๋ฆฌ ๋ฐฐ์ ๋ ์๋ฆฌ์ ์๋๋ก ํ๊ณ , NYU์ ๋ชจ๋ facility์์๋ ๋ง์คํฌ ์ฐฉ์ฉ์ด ํ์์๋ค. ๊ฐ ๊ฐ์์๋ ํด๋น ๊ฐ์์ค ์ต๋ ์์ฉ ์ธ์์ 1/4-1/3๋ง ๋ค์ด์ฌ ์์๋๋ก ํ๊ณ , ๊ต์๋ ์์ธ ์์ด ์ธ์ ๋ ๋ง์คํฌ๋ฅผ ์ฐฉ์ฉํ๊ณ ๊ฐ์๋ฅผ ์งํํ๋ค. ๋ด ๊ฐ์์๋ ํ๋ฒ์ ์ต๋ 25-30๋ช ์ด ๋ค์ด์ฌ ์ ์์์ผ๋ ์ค์ง์ ์ผ๋ก๋ 3-10๋ช ์ ๋๊ฐ ๋ค์ด์ค๊ณ ๋๋จธ์ง ํ์๋ค์ zoom์ ํตํด livestream์ผ๋ก ์ฐธ์ํ๋ค.

์ด์ ๋์์ ๊ฐ ํ๊ณผ๋ ๊ต์ ๋ฐ ํฌ์ค๋ฅ, PhD ํ์๋ค์ด ํ์์ ๋ฐ๋ผ ์ฐ๊ตฌ์ค์ ๋์์ฌ ์ ์๋๋ก ์ฐ๊ตฌ์ค ๋ฐฐ์ ๋ฐ ์ฑ ์ ๋ฐฐ์น๋ฅผ ๋ชจ๋ ๋ฐ๊ฟจ๋ค. NYU Center for Data Science์ ๊ฒฝ์ฐ, ์ฐ๊ตฌ์ค ์ฌ๋ฐฐ์ ๋ฐ ๋ฏธํ ๋ฃธ ์ฌ๋ฐฐ์ ์ ํตํด ๋ชจ๋ ๊ต์, ํฌ์ค๋ฅ, PhD ํ์์ด 1์ธ1์ค์ ์ฐ๋๋ก ํ๊ณ , ์ด๋ฅผ ํตํด ์ฃผ๊ฑฐ ํ๊ฒฝ์ด ์๋์ ์ผ๋ก ์ด์ ํ ํ์๋ค์ด ๋ง ํธํ ์ฐ๊ตฌ์ ์ง์คํ ์ ์๋ ํ๊ฒฝ์ ์ ๊ณตํ๋๋ก ๋ ธ๋ ฅํ๋ค.

ํ๋ถ์๋ค๋ ์ํ๋ ๊ฒฝ์ฐ residence hall๋ก ๋ค์ด์์ ํ๊ธฐ๋ฅผ ์ง๋๊ณ , ์ด๋ฐ ๊ฒฝ์ฐ์๋ residence hall reconfiguration์ ํตํด์ ํ์๋ค ๊ฐ์ ๋ถํ์ํ ์ ์ด์ ์ต์ํํ๋๋ก ํ๋ค. ํ๊ต๋ด ์๋น (๋๋ถ๋ถ ํ๋ถ์๋ค ์ด์ฉ) ์ ๋ชจ๋ pick up์ผ๋ก ๋ณ๊ฒฝํ๊ณ , ํ๊ต ๋ด ๋ชจ๋ ์ฑ
์ ๋ฐ ๊ณต๋ถํ ์ ์๋ ๊ณต๊ฐ์ ๋ฏธ๋ฆฌ ์์ฝ์ ํ์ง ์์ผ๋ฉด ์ฐ์ง ๋ชป ํ๋๋ก ์์คํ
์ ๊ฐ์ถ์๋ค.

์ ์งํ ๊ต์๋ ๊ธ์ ์ฐ์ ๊ฒ์ฒ๋ผ ์ด๋ฐ ํ๊ฒฝ์ ๋ด์ ๋งจํํผ ํ๊ฐ์ด๋ฐ์ ๊ตฌ์ถํ๊ณ covid-19 outbreak์ ํผํ๊ธฐ ์ํด NYU๋ ๋ด์ ์บ ํผ์ค์ ์จ ๋ชจ๋ ํ์๊ณผ ๊ต์ง์์๊ฒ ํ๊ธฐ ์์ ์ 2์ฃผ ๋์ 1-2 ๋ฒ์ฉ PCR ํ
์คํธ๋ฅผ ๋ฐ๊ฒ ํ๋ค. ๋ด์์ ์ฌ๋ ๊ต์ง์๋ค์ ๋ด์๋ํ๊ต Langone ๋ํ๋ณ์์์, ๊ทธ๋ฆฌ๊ณ ํ์๋ค์ residence hall๋ค ๊ทผ์ฒ์ ํ
ํธ๋ฅผ ์น๊ณ ๋๋์ ์ผ๋ก ๊ฒ์ฌ๋ฅผ ์งํํ๋ค.

ํ๊ธฐ๊ฐ ์์ํ ํ ๋ชจ๋ ๊ตฌ์ฑ์์ ์๋ฌด์ ์ผ๋ก 2์ฃผ์ ํ ๋ฒ์ฉ ์นจ์ ์ด์ฉํ ํ ์คํธ๋ฅผ ๋ฐ์๋ค (https://www.nyu.edu/life/safety-health-wellness/coronavirus-information/safety-and-health/coronavirus-testing/ongoing-testing.html) ๋งค 2์ฃผ์ ํ ๋ฒ์ฉ reminder ์ด๋ฉ์ผ์ด ์ค๊ณ ํด๋น ์ฃผ์ ํ๊ต ๋ด์ ๊ตฌ์ถ๋ 4-5๊ตฐ๋ฐ์ ํ ์คํธ collection point์ ์ง์ ์ฐพ์๊ฐ test kit์ ๋ฐ์ ํ, ์ง ๋๋ ์ฌ๋ฌด์ค์์ ์นจ์ ๋ฑ์ ํ ๋ค์ collection point์ ๋๋ ค์ค๋ค. ๊ทธ ํ 1-3์ผ ํ ์จ๋ผ์ธ์ผ๋ก ๊ฒฐ๊ณผ๋ฅผ ํ์ธํ ์ ์๊ณ , ํด๋น ๊ฒฐ๊ณผ๋ ์ ๋ ฅ๋์ง ์์ ๊ฒฝ์ฐ ์นด๋ํค๋ฅผ ํตํ NYU ์ถ์ ์ด ์ ํ๋๋ค.

์ด๋ฅผ ํตํด ์์ฑ ํ์ ์ด ๋๋ฉด ํด๋น ๊ตฌ์ฑ์์ ๋ฐ๋ก ๊ฒฉ๋ฆฌ์ ๋ค์ด๊ฐ๊ณ ํ๊ต์์๋ contact tracing์ ๋ค์ด๊ฐ๋ค. ์์ฝ๊ฒ๋ contact tracing์ ํ๊ต ๋ด๋ก ํ์ ์ด ๋๊ณ , NY์ฃผ์์ ํ๊ต ๋ฐ contract tracing์ ์งํํ๋ค. ๋ฌผ๋ก ํ์๊ฐ ์ ํ ์๋๋ค๋ ๊ฑด ๋ชจ๋๊ฐ ์๋ ๋น๋ฐ์ด๋ค. ํ๊ธฐ์ด๋ฐ residence hall ๋ฑ์์ outbreak์ ๊ธฐ๋ฏธ๊ฐ ์์ด์ 2-3 ์ธต์ ํต์งธ๋ก ๊ฒฉ๋ฆฌํ๊ณ ์ ์ ๊ฒ์ฌ๋ฅผ ๋ ๋ฒ ์งํํ ๊ฒฝ์ฐ๊ฐ ์์๊ณ , ์ด๋ฅผ ํตํด ๋ ํฐ outbreak์ ํผํ๋ค.

์ด๋ฐ ๊ณผ์ ์ ํตํด ์ด 6๋ง ์ฌ๋ช ๊ตฌ์ฑ์ ์ค 15,000๋ช ์ ๋๊ฐ ์ด๋ฒ ํ๊ธฐ์ ์บ ํผ์ค์ ๋์์๊ณ ์ง๋์ฃผ๋ฅผ ๋ง์ง๋ง์ผ๋ก ํ๊ธฐ๊ฐ ์ค๋จ ์์ด ๋๋ฌ๋ค. ์ค์๊ฐ์ผ๋ก ์ ๋ฐ์ดํธ ๋์ด์จ ๋์๋ณด๋ (https://www.nyu.edu/life/safety-health-wellness/coronavirus-information/nyc-covid-19-testing-data.html)๋ฅผ ๋ณด๋ 8์ 1์ผ ์ดํ ์ด 19๋ง 9870๋ฒ ํ ์คํธ๋ฅผ ์งํํ๊ณ , 758 ์ผ์ด์ค๊ฐ ์์ฑ์ผ๋ก ํ์ ๋์๊ณ , ๋ด์ ์ด์ธ์ ์ง์ญ๊น์ง ํ์ฅํ๋ฉด ์ฝ 1000์ฌ ์ผ์ด์ค๊ฐ ์์ฑ์ด์๋ค. ์์ฑ์จ 0.38%๋ก ์ค์ ๋ด์์์ ๋นํด ํ์ ํ๊ฒ ๋ฎ๋ค.

์ฌ ๋ด ๋ด์์๋.. ํฐ ๋ณ์๋ค์ ์ง์ฅ์ด์๊ณ , ๋ณ์ ๋ฐ์ ์ ๋ น ๋์์๋ค. ์ง๊ธ๋ ์ฌ์ ํ ๋ด์์ฃผ๋ ๋งค์ผ ๋ง๋ช ์ด์ ํ์ง๋๊ณ ์๊ณ , 100๋ช ์ด์ ์ฌ๋งํ๊ณ ์๋ค. ๊ทธ๋ผ์๋ ๋ถ๊ตฌํ๊ณ NYU์์ ํ๊ธฐ ์ค๋จ ์์ด ํ์๋ค ๊ต์ก์ ์์ผฐ๊ณ , ๋ด์์์ public school ๋ํ ์ค๊ฐ์ ์ผ์์ ์ธ 2์ฃผ ์ค๋จ ์ธ์ ํ๊ธฐ๋ฅผ ์งํํ๋ค๋ ๊ฒ์ ํํธ์ผ๋ก๋ ๋ง์์ด ์กฐ๊ธ ํธํด์ง๋ค. ์ด๋ฒ ๋ด์๋, ๊ทธ๋ฆฌ๊ณ ๋ค์ ๊ฐ์์๋ ํ์ํ ๋ชจ๋ ๊ฒ์ ๋ค ํด์๋ผ๋ ํ๊ต๊ฐ ์ด๊ณ , ํ์ ์ง๋๊ฐ ์ ๋๋ก ์งํ๋๊ธธ ๋ฐ๋ผ๊ณ ํ๊ต ๊ตฌ์ฑ์ ์ค ํ๋๋ก์จ ์ต์ ์ ๋คํ ์์ ์ด๋ค.

๋ด์๋, ๋ฏธ๊ตญ ๋ค๋ฅธ์ฃผ๋, ํ๊ตญ๋, ์บ๋๋ค๋, ์ ๋ฝ๋, ๋ด๊ฐ ๋ด์ค๋ฅผ ์ด๋ ์ ๋ ๋ฐ๋ผ๊ฐ๋ ๋ง์ ์ง์ญ๋ค์ด ๋๊ธฐ์ , ๊ฑด๋ฌผ์ฃผ, ๋ถ๋์ฐ ๊ทธ๋ฆฌ๊ณ ๋ถ์๋ค ๊ฑฑ์ ์ ๋ง์ด ํ๋ค. ์ด๋ฐ ๊ฒฝ์ ์ ์ธ ๊ณ ๋ ค์ ๊ทธ์ ๋ํด ๋ฏธ๊ตญ์์ ๋ณด๋ค์ํผ ์ ์น์ ์ธ ๊ณ์ฐ์ด ์ด๋ฒ pandemic์ ์ผ๋ง๋ ์, ๋๋ ์ผ๋ง๋ ์๋ชป ๋ฒํ จ๋ด๋๋์ ๋ง์ ์ํฅ์ ๋ฏธ์น๊ณ ์๋ค. ์ํ๊น๊ฒ๋ ์ด๋ฐ ๋ณต์กํ ๊ณ ๋ ค ํ์์ ๊ต์ก์ด ์ฝ๊ฒ ๋ฌปํ ๋ฒ๋ฆฐ๋ค. pandemic์ ๋์ด ๋๊ฒ ์ง๋ง ์ด ๊ธฐ๊ฐ 1-3๋ ๋์ ๋ค๋ฅธ ์ธ๋๋ค์ ๋นํด ๊ต์ก์ ์ ๋๋ก ๋ชป ๋ฐ์ ์ธ๋์ ๋ํ ์ํฅ์ ์ผ๋ง๋ ์ค๋๊ฐ๊น?

ํน์ฌ๋ ์ง๋์ฃผ ๋ฐ์์ ์์ํ๋ค๋ฉฐ ๋ง์ ๋งฅ์ฃผ ํ ์ ๋๋ฌธ์ ์ฌํ์ ๋ฏธ๋๋ฅผ ํฌ์ํ๊ฑด ์๋๋ฐ์งโฆ

]]>Aalto University (in particular School of Science within) and Finland just keep on giving, and I feel like I continue to receive without giving anything back. I will have to think of some way for me to pay back all that I have received from them.

Kiitos paljon!

Of course, the whole event was virtual, and due to the time difference, I could not attend myself. Instead, I sent the video recording of my greetings. You can watch it at https://youtu.be/074nhA9SQvA. I’m also attaching the script I used for recording this video below.

hi,

i received admission to the international masterโs program in machine learning and data mining, which was called back then Macadamia, from Helsinki University of Technology in the spring of 2009.

although i applied to the program myself, finland was largely a land of mystery to me. perhaps this mysterious nature of the country may have been one of the major motivations for me to apply for this program in the first place. in my mind back then, finland was associated with just a couple of things, such as Nokia and Helsinki Olympics. i must confess that i wasnโt even aware that finland shared a border with russia. unsurprisingly, going to finland to study was definitely not what i had in my mind until one of my friends then handed me the brochure of the Macadamia program in the winter of 2008.

the very first lecture i attended in helsinki university of technology, which was about to be merged with the other two universities to form aalto university back then, was of the course “Machine Learning: Basic Principles”. this course was taught by Tapani Raiko, who had advised and mentored me for the next five years and who i still continue to admire and keep in touch with. in the very first lecture, i could immediately tell that i made the right choice to be there to study machine learning. and, to this day, i still believe i made the right choice to be there at Aalto University to study machine learning and data mining.

as a part of the Macadamia program, some students were assigned to some of the labs within the department, which was back then information and computer science (ICS), to assist in research one day a week with a small amount of stipend. the masterโs program was still free to anyone from anywhere in the world back then in finland, which i sadly learned recently that is not the case anymore. without tuition-free education, my decision to come to finland to study in Aalto University may have taken a very different course.

anyways, i was assigned to the Bayes group, which i do not believe exists anymore and despite its name had a longer history of research in neural networks. the group back then was led by Prof. Juha Karhunen, who i believe had recently retired, together with Tapani and Prof. Alexander Ilin, who recently made a comeback to Aalto to re-build the Bayes group however with a new name โDeep Learningโ. this part-time research gig at the then-Bayes group, which started in September 2009, was the beginning of my research career that is still on-going.

i often wonder what i wouldโve become had it not been for this program, called the โhonours programโ then, if i remember correctly, had it not been for me to be assigned to the Bayes group, or had it not been for me to be advised by Tapani and Alexander. itโs simply unimaginable. five years later in March 2014, i defended my doctoral dissertation against my โopponentโ Prof. Nando de Freitas, in front of my friends, colleagues and supervisors from then-newly-formed the Department of Computer Science of Aalto University School of Science.

over those five years, i spent many days and nights in Maarintalo, studying for exams and working on projects. over those five years, i spent many days and nights in the computer science building, working toward my dissertation. over those five years, i had an uncountable number of lunches at the cafeterias in the computer science building as well as the main building. over those five years, i met so many friends and colleagues, many of whom i still keep in touch with.

Aalto University gave me an enormous opportunity by bringing me to Finland and giving me rigorous education on machine learning. Furthermore, Aalto University had successfully created an international environment in which I could immerse myself among talents from all over the world and be inspired by them. These were just the beginning of the series of opportunities Aalto University School of Science had given me over those five years.

my phd years were generously supported by FICS (the finnish doctoral programme in computational sciences), which has since discontinued and i believe has been replaced by HICT. near the end of my phd programme, i was given a chance and supported by FICS and Prof. Erkki Oja to spend six months visiting the University of Montreal to broaden my view and to further learn from the very best in the world.

this research visit opened my eyes to a broader set of topics in machine learning, and in particular this visit was how and when i began to seriously delve into studying how machine learning and more broadly AI could be used for and improve natural language processing and machine translation. this research visit led me to join the University of Montreal as a postdoc in a lab which was called Lisa back then and is now called Mila, immediately after i defended my dissertation.

And, now, i am an associate professor of computer science & data science at New York University, running my own research lab and teaching machine learning to aspiring students from all over the world.

in my opinion, one of the most important roles served by higher education is to bring the best out of each student. what this implies is that higher education cannot simply shove down knowledge into students, and education cannot simply show easy, comfortable and convenient ways forward to students. education must strive to provide as diverse and broad a set of opportunities and perspectives to students as possible in order to ensure each and every student has a chance to discover their way forward.

What i experienced during my years at Helsinki University Technology which had become Aalto University School of Science and Technology and has eventually become Aalto University School of Science, was precisely this; rigorous and thorough education, and a string of educational and extra-curricular opportunities within and beyond the wall of the university and even the countryโs border.

It is truly my honour to be named the alumnus of the year, and to be frank I am quite unsure whether i deserve it. off the top of my head, i can think of Prof. Alexander Ilin, who is now back at Aalto University. Dr. Tapani Raiko, who is now at Apple, is another obvious candidate. and, no, itโs a totally objective list. they just happened to have mentored me throughout my years at Aalto.

let me wrap it up by dusting off my finnish: Kiitos paljon!

]]>i enjoyed answering those questions, because they made me think quite a bit about them myself. of course, as usual i ended up leaving only a short answer to each, but i thought i’d share them here in the case any students in the future run into the same questions. although my questions are all quite speculative and based on experience rather than rigorously justified, what’s fun in rigorously proven and well-known answers?

of course, there were so much more questions asked and answered during live lectures and at the chatrooms, but i just cannot recall all of them easily nor am i energetic enough after this unprecedented semester to go through the whole chat log to dig out interesting questions. i just ask you to trust me that the list of questions below is a tiny subset of interesting questions.

i will paraphrase/shorten the answers below and remove any identifying information (if any):

- Why was backprop controversial? Yann mentioned that one of the big things that made the use of ConvNets in various applications controversial was the use of backpropagation. backprop is just an application of the chain rule, so why would anyone be suspicious of using it?
- Professor LeCun said that mini-batch has no advantage over single-batch SGD besides being easier to parallelize, and online SGD is actually superior. Is there any other theoretical reason why single-batch is preferable?
- Why we would do batch normalization instead of normalizing the whole dataset all at once at first? Is it for when normalizing the whole dataset is too computationally expensive? I understood that normalization makes the optimization process easier through making the eigenvalues equal. However, if you’re only normalizing over the batch, your normalization for each batch is subject to noise and might still lead to bad learning rates for each dimension.
- Batch normalization in VAE: While implementing the convolutional VAE model, I noticed that removing these BatchNorm layers enabled the model to train as expected. I was wondering why does BatchNorm cause this issue in the VAE model?
- In semi-supervised VAE, how do we decide the embedding dimensions for the class? Also, BERT used position embedding to represent the position, so how do we determine the position embedding dimensions in BERT?
- Why do we divide the input to the softmax in dot product attention by the square root of the dimensionality?
- DL appears to add double descent as a caveat in addition to bias-variance tradeoff learned earlier. Do you have any insights on how we should think about double-descent?
- In your opinion, will we achieve AGI?

**1. Why was backprop controversial? Yann mentioned that one of the big things that made the use of ConvNets in various applications controversial was the use of backpropagation. backprop is just an application of the chain rule, so why would anyone be suspect of using it?**

when yann said it was controversial to use backprop earlier, i believe he meant it in two different ways: (1) backprop itself to compute the gradient of the loss function w.r.t. the parameters and (2) backprop to refer to gradient-based optimization. i’ll explain a bit of each below, but neither of them is considered a serious argument against using backprop anymore.

(1) backprop was controversial and is under great scrutiny when artificial neural nets (what we learn) are compared against biological neural nets (what we have). it’s quite clear due to biological constraints that backprop is not implemented in brains, as it is in our deep learning toolkits (see e.g., https://openreview.net/forum?id=HJgPEXtIUS for some of interesting biological constraints/properties that should be satisfied by any biologically plausible learning algorithms.) to some people, this is a make-or-break kind of issue, because there seems to exist a learning algorithm that results in a superior neural net (human brains!) of course, this could just mean that a biological brain is approximating the gradient computation as well as it could under the constraints, but it’s not easy to verify this (see, e.g., https://www.youtube.com/watch?v=VIRCybGgHts for how a brain might implement backprop.)

another criticism or objection along this line is that biological brains seem to have either zero or multiple objectives that are being optimized simultaneously. this is unlike our usual practice in deep learning where we start by defining one clear objective function to minimize.

(2) gradient-based optimization often refers to a set of techniques developed for (constrained/unconstrained) convex optimization. when such a technique is used for a non-convex problem, we are often working with the local quadratic approximation, that is, given any point in the space, the underlying non-convex objective function can be approximated by a convex quadratic function ($\theta^\top H \theta + g^\top \theta + c$.) under this assumption, gradient-based optimization would be attracted toward the minimum of this local quadratic approximation, regardless of whether there exists a better minimum far away from the current point in the space. this is often used as a reason for criticizing the use of gradient-based optimization with a non-convex objective function, thereby for criticizing the use of backprop. see e.g. http://leon.bottou.org/publications/pdf/online-1998.pdf for extensive study on the convergence properties of SGD.

this criticism however requires one big assumption that there is a big gap of quality between one of the nearby local minimum (we’ll talk about it in a few weeks at the course) and the global minimum. if there is a big gap, this would indeed be a trouble, but what if there isn’t?

it turned out that we’ve known for already a few decades that most of local minima are of reasonable quality (in terms of both training and test accuracies) as long as we make neural nets larger than necessary. let me quote Rumelhart, Hinton & Williams (1986):

“

<Learning representations by back-propagating errors> by Rumelhart, Hinton & Williams (1986)The most obvious drawback of the learning procedure is that the error-surface may contain local minima so that gradient descent is not guaranteed to find a global minimum. However, experience with many tasks shows that the network very rarely gets stuck in poor local minima that are significantly worse than the global minimum. We have only encountered this undesirable behaviour in networks that have just enough connections to perform the task. Adding a few more connections creates extra dimensions in weight-space and these dimensions provide paths around the barriers that create poor local minima in the lower dimensional subspaces.“

this phenomenon has been and is being studied quite extensively from various angles. if you’re interested in this topic, see e.g. http://papers.nips.cc/paper/5486-identifying-and-attacking-the-saddle-point-problem-in-high-dimensional-non-convex-optimization and https://arxiv.org/abs/1803.03635 for some recent directions. or, if you feel lazy, you can see my slides at https://drive.google.com/file/d/1YxHbQ0NeSaAANaFEmlo9H5fUsZRsiGJK/view which i prepared recently.

**2. Professor LeCun said that mini-batch has no advantage over single-batch SGD besides being easier to parallelize, and SGD is actually superior. Is there any other theoretical reason why single-batch is preferable?**

this is an interesting & important question, and the answer to this varies from one expert to another, including Yann and myself as well, based on what are being implicitly assumed and what are being used as criteria to tell which is preferred (computational efficiency, generalization accuracy, etc.)

Yann’s view is that noise in SGD greatly helps generalization because it prevents learning from being stuck at a sharp local minimum and drives learning to find a flatter local minimum which would imply that the final neural net is more robust to perturbation to the parameters, which naturally translates to the robust to the perturbation to the input, implying that it would generalize better. under this perspective, you want to maximize the level of noise, as long as they roughly cancel out on average across all the stochastic gradients computed from the training examples. that would correspond to using just one training example for computing each stochastic gradient.

of course, the amount of noise, which is proportional to the variance of the stochastic gradient, does impact the speed at which learning happens. in recent years, we (as the community of deep learning researchers) have found that certain network architectures require stochastic gradients computed using large minibatches (though, it’s unclear what large means, as it’s quite relative to the size of the training set) to be trained at all. in these cases, it looks like high level of noise sometimes prevents any progress in learning especially in the early stage.

so, in short, it’s still an open question. yann’s perspective may turn out to be the correct one (and that wouldn’t be the first time this happend,) or we may find a completely different explanation in the future.

**3. Why we would do batch normalization instead of normalizing the whole dataset all at once at first? Is it for when normalizing the whole dataset is too computationally expensive?** **I understood that normalization makes the optimization process easier through making the eigenvalues equal. However, if you’re only normalizing over the batch, your normalization for each batch is subject to noise and might still lead to bad learning rates for each dimension.**

there are three questions/points here. let me address each separately below:

“*normalization makes the optimization process easier through making the eigenvalues equal*“

we need to specify what kind of normalization you refer to, but in general, it’s not possible to make the hessian to be identity by simply normalizing the input. this is only possible when we are considering a linear network with a specific loss function (e.g., l2 loss for regression and cross-entropy for classification.) however, it is empirically known and for some cases rigorously as well that normalizing the input variables to be zero-mean and unit-variance makes the conditioning number (the ratio between the largest and smallest real eigenvalues of the hessian matrix) close to 1 (which is good.)

“*why we would do batch normalization instead of normalizing the whole dataset all at once at first?*“

now, in the case of a network with multiple layers, it turned out that we can maximize the benefit of normalization by normalizing the input to each layer to be zero-mean and unit-variance. unfortunately, this is not trivial, because the input to each layer changes as the lower layers’ weights and biases evolve. in other words, if we wanted to normalize the input to each layer, we would need to sweep through the entire dataset every time we update the weight matrices and bias vectors, which would make it intolerable. furthermore, renormalizing the input at a lower layer changes the input to the upper layers, ultimately resulting in the loss function to change dramatically each time we renormalize all the layers, likely making learning impossible. though, this is up to a certain degree addressible (see http://www.jmlr.org/proceedings/papers/v22/raiko12/raiko12.pdf by Tapani Raiko, my phd advisor, and Yann LeCun.)

“*your normalization for each batch is subject to noise*“

this is indeed true, and that’s precisely why it’s a customary practice to keep the running averages of the mean and variance of each dimension in batch normalization. assuming that the parameters of the network evolve slowly, such practice ultimately converges to the population mean and variance.

**4. Batch normalization in VAE: While implementing the convolutional VAE model, I noticed that removing these BatchNorm layers enabled the model to train as expected. I was wondering why does BatchNorm cause this issue in the VAE model?**

i donโt have a clear answer unfortunately, but can speculate a bit on why this is the case. my answer will depend on where batchnorm was used. of course, before reading the answer below, make sure your implementation of batchnorm doesn’t have a bug.

if batchnorm was used in the approximate posterior (encoder), it shouldnโt really matter, since the approximate posterior can be anything by definition. it can depend not only on the current observation $x$

, but can be anything else that helps minimizing the KL divergence from this approximate posterior to the true posterior. so, i wouldnโt be surprised if itโs totally fine leaving batchnorm in the encoder.

if batchnorm was used in the decoder, it may matter, as the likelihood distribution (generative distribution) is over the observation space $\mathcal{X}$ conditioned on the latent variable configuration $z$. with batchnorm, instead, the decoder is conditioned on the entire minibatch of latent variable configurations, that is, the latent variable configurations of the other examples. this may hinder optimization in the early stage of learning (in the later stage of learning, it shouldnโt really matter much, though.)

in general, batchnorm is a tricky technique and makes it difficult to analyze SGD, because it introduces correlation across per-example stochastic gradients within each minibatch.

5. **In semi-supervised VAE, how do we decide the embedding dimensions for the class**? **Also, BERT used position embedding to represent the position, so how do we determine the position embedding dimensions in BERT?**

this question can be answered from two angles.

a. network size

the embedding dimensionality is a part of a neural net, and it can be thought of as a part of determining the size of your neural network. itโs a good rule of thumb to use as large as neural net as you can within your computational and financial budget to maximize your gain in terms of generalization. this might sound counter-intuitive, if you have learned from earlier courses that we want to choose the most succinct model (according to the principle of occamโs razor,) but in neural nets, itโs not simply the size of the model, but the choice of optimization and regularization that matters perhaps even more. in particular, as we will learn next week, SGD is inherently working in a low-dimensional subspace of the parameter space and cannot explore the whole space of the parameters, a larger network does not imply that itโs more prone to overfitting.

b. why more than one dimension?

letโs think of the class embedding (though, the same argument applies to positional embedding.) take as an example handwritten digit classification, where our classes consists of 0, 1, 2, .., 9. it seems quite natural that thereโs a clear one-dimensional structure behind these classes, and we would only need a one-dimensional embedding. why we do need then multi-dimensional class embedding?

it turned out that there are multiple degrees of similarity among these classes, and that the similarity among these classes is context-dependent. that is, depending on what we see as an input, the class similarity changes. for instance, when the input is a slanted 3 (3 significantly rotated clock-wise), it looks like either 3 or 2 but not 8 nor 0. when the input is a straight-standing 3, it looks like either 3 or 8 but not 2. in other words, the classes 3 and 2 are similar to each other when the input was a slanted 3, while the classes 3 and 8 are similar to each other when the input was a upright 3.

having multiple dimensions to represent each class allows us to capture these different degrees of similarity among classes. a few dimensions in the class embeddings of 3 and 2 will point toward a similar direction, while a few other dimensions in the class embeddings of 3 and 8 will point toward another similar direction. when the input is a slanted 3, the feature extractor (a convolutional net) will output a vector that will emphasize the first few dimensions and suppress the other dimensions to exploit the similarity between 3 and 2. a similar mechanism would lead to a feature vector of an upright 3 that would suppress the first few dimensions and emphasize the latter few to exploit the similarity between 3 and 8.

itโs impossible to tell in advance how many such degrees of similarity exist and how to encode them. thatโs why we need to use as high dimensional embedding as possible for encoding any discrete, one-hot input.

**6. Why do we divide the input to the softmax in dot product attention by the square root of the dimensionality? **

This question was asked at one of the office hours, and Richard Pang (one of the TA’s) and i attempted at reverse-engineering the motivations behind the scaled dot-product attention from the transformers.

assume each key vector $k \in \mathbb{R}^d$ is a sample drawn from a multivariate, standard Normal distribution, i.e., $k_i \sim \mathcal{N}(0, 1^2).$ given a query vector $q \in \mathbb{R}^d$, we can now compute the variance of the dot product between the query and key vectors as $\mathbb{V}[q^\top k] = \mathbb{V}[\sum_{i=1}^d q_i k_i] = \sum_{i=1}^d q_i^2 \mathbb{V}[k_i] = \sum_{i=1}^d q_i^2$. in other words, the variance of each logit is the squared norm of the query vector.

assume the query vector $q$ is also a sample drawn from a multivariate, standard Normal distribution, i.e., $q_i \sim \mathcal{N}(0, 1^2)$. in other words, $\mathbb{E}[q_i]=0$ and $\mathbb{V}[q_i]=\mathbb{E}{q_i} \left[(q_i – \mathbb{E}[q_i])^2\right] = \mathbb{E}{q_i} \left[ q_i^2 \right] = 1$. then, the expected variance of the logit ends up being $\mathbb{E}{q} \left[ \mathbb{V}[q^\top k] \right] = \mathbb{E}{q} \sum_{i=1}^d q_i^2 = \sum_{i=1}^d \mathbb{E}{q_i} q_i^2 = \sum{i=1}^d 1 = d.$

we can now standardize the logit to be $0$-mean and unit-variance (or more precisely, we make the logit’s scale to be invariant to the dimensionality of the key and query vectors) by dividing it with the standard deviation $\sqrt{\mathbb{E}_q \mathbb{V}[q^\top k]}=\sqrt{d}.$

these assumptions of Normality do not hold in reality, but as we talked about it earlier, Normality is one of the safest things to assume when we don’t know much about the underlying process.

As Ilya Kulikov kindly pointed out, this explanation doesn’t answer “why” and instead answers “what” scaling does. “why” is a bit more difficult to answer (perhaps unsurprisingly,) but one answer is that softmax saturates as the logits (the input to softmax) grow in their magnitudes, which may slow down learning due to the vanishing gradient. though, it’s unclear what’s the right way to quantify it.

**7. DL appears to add double descent as a caveat in addition to bias-variance tradeoff learned early on. Do you have any insights about how we should think about double-descent? **

The so-called double descent phenomenon is a relatively recently popularized concept that’s still being studied heavily (though, it was observed and reported by Yann already in the early 90s. see, e.g., https://journals.aps.org/prl/abstract/10.1103/PhysRevLett.66.2396 and also https://iopscience.iop.org/article/10.1088/0305-4470/25/5/020 by Krogh and Hartz) The issue I have with double descent in deep neural nets is that it’s unclear how we define a model capacity. the # of parameters is certainly not the best proxy, because the parameters are all heavily correlated and redundant. perhaps it should be the number of SGD steps, because we learned that the size of the hypothesis space is in fact the function of the number of SGD steps.

One particular proxy I find interesting and convincing is the fraction of positive eigenvalues of the Hessian at a solution. With this proxy, it looks like the apparent double descent phenomenon often lessens. see e.g. https://arxiv.org/abs/2003.02139.

So, in short, the model capacity is a key to understanding the bias-variance trade-off or more generally generalization in machine learning, but is not a simple concept to grasp with deep neural networks.

**8. In your opinion, will we achieve AGI?**

Of course, I’m far from being qualified to answer this question well. Instead, let me quote Yann:

]]><An executive primer on artificial general intelligence> by Federico Berruti, Pieter Nel, and Rob Whiteman

Yann LeCun, a professor at the Courant Institute of Mathematical Sciences at New York University (NYU), is much more direct: โItโs hard to explain to non-specialists that AGI is not a โthingโ, and that most venues that have AGI in their name deal in highly speculative and theoretical issues…

[Updated on Nov 30 2020] added a section about the scaling law w.r.t. the model size, per request from Felix Hill.

[Updated on Dec 1 2020] added a paragraph referring to Dauphin & Bengio’s “Big Neural Networks Waste Capacity“.

{Update on Feb 8 2021] see “Learning Curve Theory” by Marcus Hutter for a better exposition of the scaling law and where it might be coming from.

this is a short post on why i **thought** (or more like imagined) the scaling laws from <scaling laws for autoregressive generative modeling> by Heninghan et al. “[is] inevitable from using log loss (the reducible part of KL(p||q))” when “the log loss [was used] with a max entropy model“, which was my response to Tim Dettmers’s tweet on “why people are not talking more about the OpenAI scaling law papers“. thanks to Joรฃo Guilherme for brining it this to my attention. it’s given me a chance to run some fun thought experiments over the weekend, although most of, if not all of, them failed as usual with any ideas and experiments i have. anyhow, i thought i’d leave here why i thought so particularly from the perspective of dataset size.

- The scaling law for Bernoulli w.r.t. the dataset size
- The scaling law for Bernoulli w.r.t. the model size
- The scaling law for Bernoulli w.r.t. the compute amount
- Final thoughts

instead of considering a grand neural autoregressive model, i’ll simply consider estimating the mean of a Bernoulli variable after $N$ trials, and compare the log loss at this point against the log loss computed after $N+\Delta$ trials. let’s start by writing down the loss value after $N$ trials:

$$

-L(N) = p^* \log \frac{N_1}{N} + (1-p^*) \log \frac{N-N_1}{N} =

p^* \log N_1 + (1-p^*) \log (N-N_1) – \log N,

$$

where $p^*$ is the true ratio of heads and $N_1 < N$ is the number of heads from the $N$ trials.

let’s now consider tossing the coin $\Delta$ more times. i will use $\Delta_1 < \Delta$ as the number of additional heads after these additional trials. what’s the loss after $N+\Delta$ trials?

$$

-L(N+\Delta) = p^* \log (N_1 + \Delta_1) + (1-p^*)(N+\Delta – N_1 – \Delta_1) – \log (N+\Delta_1).

$$

so far so good. now, what kind of relationship between these two quantities $L(N)$ and $L(N+\Delta)$ do i want to get? in my mind, one way to say there’s a power law like structure behind $L$ is to show that the amount of improvement i get by running $\Delta$ more trials decreases as the number of existing trials $N$ increase. that is, there’s diminishing return from a unit effort as more efforts have been put.*

then, let’s look at their difference by starting from the loss at $N+\Delta$, while assuming that $\Delta \ll N$ (and naturally $\Delta_1 \ll N_1$ as well) so that i can use $\log (1+x) \approx x$ when $x$ is small:

$$

\begin{align*}

-L(N+\Delta) =& p^* \log (N_1 + \Delta_1) + (1-p^*)\log(N+\Delta – N_1 – \Delta_1) – \log (N+\Delta)

\\

=&

p^* \log N_1 (1+ \frac{\Delta_1}{N_1}) + (1-p^*) \log(N-N_1)(1 + \frac{\Delta – \Delta_1}{N-N_1}) – \log N(1+ \frac{\Delta}{N})

\\

\approx

&

\underbrace{p^* \log N_1 + (1-p^*) \log (N-N_1) – \log N}_{=-L(N)} + p^* \frac{\Delta_1}{N_1} + (1-p^*)\frac{\Delta – \Delta_1}{N-N_1} – \frac{\Delta}{N}.

\end{align*}

$$

The decrease in the loss by running $\Delta$ more trials can now be written as

$$

L(N) – L (N+\Delta) = p^* \frac{\Delta_1}{N_1} + (1-p^*)\frac{\Delta – \Delta_1}{N-N_1} – \frac{\Delta}{N}.

$$

since $\Delta_1 < \Delta$ and $N_1 < N$, let’s rewrite them as $\Delta_1 = \beta \Delta$ and $N_1 = \alpha N$, where $\alpha \in [0,1]$ and $\beta \in [0,1]$. then,

$$

L(N) – L (N+\Delta) = p^* \frac{\beta \Delta}{\alpha N} + (1-p^*) \frac{(1-\beta)\Delta}{(1-\alpha)N} -\frac{\Delta}{N} = \frac{\Delta}{N} \left(p^* \frac{\beta}{\alpha} + (1-p^*)\frac{1-\beta}{1-\alpha} – 1\right)

$$

this says that the change from the loss at $N$ to the loss at $N+\Delta$ is inversely proportional to $N$ itself, which is what i wanted to see from the beginning. although there were a few leaps of faith along the way, but it looks like more tosses I have made (i.e, large $N$), the change i can make to my loss with a constant number of extra tosses diminishes.

the second (multiplicative) term is more complicated, and i find it easier to think of two extreme cases; $p^*=1$ and $p^*=0$. these cases are reasonable if we think of this exercise as a proxy to studying classification, where it’s often assumed that a given input either belongs to one (positive) or the other (negative) class in an ideal world. when $p^*=1$, the second term reduces to

$$

\frac{\beta}{\alpha} – 1~~

\begin{cases}

> 0, & \text{if } \beta > \alpha \\

< 0, & \text{if } \beta < \alpha \\

= 0, & \text{if } \beta = \alpha

\end{cases}

$$

in other words, if the extra tosses reflected the true distribution better ($\beta > \alpha$, because the true positive rate is $1$,) the loss dropped. otherwise, the loss increases ($\alpha > \beta$) or stays same (i.e., no additional information has been added.) the other extreme case of $p^* = 0$ works similarly.

what’s important is that this second term largely dictates the sign of how the loss changes with the extra $\Delta$ tosses. since we are considering only the ratios of the heads within sets of trials and (suddenly!) assume both $N$ and $\Delta$ are reasonably large, the magnitude of change is instead largely determined by the ratio between $\Delta$ and $N$, with $N$ in the denominator.

so, this is how i arrived at my shallow take on twitter that these scaling laws may not have too much to do with whether we use neural net parameterization or not, whether we are solving language modeling, machine translation, etc., nor whether we are working with text, image or both. “i think” it arises naturally from the maximum entropy formulation (you can think of estimating the log-frequency of the heads above with sigmoid/softmax to turn it into the Bernoulli distribution) and the log loss.

of course, because i had to make a number of leaps of faith (or to put it another way, a few unreasonable assumptions,) it’s possible that this actually doesn’t make much sense at the end of the day. furthermore, i’m super insecure about my math in general, and i’m about 99.9% sure there’s something wrong in the derivation above . hence, why “i think” the scaling law arises from log loss (cross-entropy) and maximum entropy models.

it’s important for me to point out at this point that Heninghan et al. did much more than what i’ve discussed in this post and provide a much more extensive set of very interesting findings. they looked not only at the effect of the data size, but also the compute budget $C$ and model size $|\theta|$. in fact, they focus much more on the latter two than the former which was my focus here.

in the case of the model size, it’s quite trivial to map it to the argument above i made regarding the number $N$ of observations. let’s consider the model size $|\theta|$ in this context of recovering Bernoulli as the number of bits (with an arbitrary basis, including $e$) allowed to represent $N$ and $N_1$ (and consequently, $\Delta$ and $\Delta_1$.) then, the maximum $N$ a model can count up to is $\exp(|\theta|)$, and by increasing the model size by $\delta$ (i.e., $|\theta|+\delta$,) we can toss the coin

$$

\exp(|\theta|) \exp(\delta) – \exp(|\theta|) = \exp(|\theta|) (\exp(\delta) – 1)

$$

more. in other words, increasing the size of the model, while assuming that we can run as many tosses as we can to saturate the model capacity, is equivalent to setting $\Delta$ above to $\exp(|\theta|) (\exp(\delta) – 1)$.

in this case, the first term in the change in the loss above reduces to

$$

\frac{\Delta}{N} = \frac{\exp(|\theta|) (\exp(\delta) – 1)}{\exp(|\theta|)} = \exp(\delta),

$$

which is weird, because the dependence on $N = \exp(|\theta|)$ disappeared. that is, the change in the loss w.r.t. the increase in the model size (the number of bits) is not dependent on the number of existing bits used by the model.

what is happening here? in my opinion, this implies that the # of parameters in a neural net, or increasing it, is **not** optimally done in terms of compression.

what if we instead assume that only a polynomial number of trials can be compressed, i.e., $N=|\theta|^c$? in particular, for the sake of simplicity, let’s assume $c=2$. in this case,

$$

\frac{\Delta}{N} = \frac{(|\theta|+\delta)^2}{|\theta|^2} = 2\frac{\delta}{|\theta|} + \left(\frac{\delta}{|\theta|}\right)^2,

$$

and voila! we recovered the dependence on the model size $|\theta|$, and this dependence is inverse proportional, as expected. by further assuming that $\delta \ll |\theta|$, we end up with

$$

\frac{\Delta}{N} \approx 2 \frac{\delta}{|\theta|}.

$$

so, what does it say about the observation by Henighan et al. that there is a scaling law w.r.t. the model size? i suspect that their observation is telling us that deep nets we use are far from optimal in the sense of compressing data. it could be due to the choice of architectures, due to our choice of learning algorithms or even due to regularization techniques we use. it’ll be interesting to pinpoint what’s behind this sub-optimality will be interesting.

as i was writing the last paragraph, i was reminded of this earlier workshop paper by Yann Dauphin & Yoshua Bengio from the workshop track of ICLR’13, titled “Big Neural Networks Waste Capacity.” in this work, they observed the “rapidly decreasing return on investment for capacity in big networks” and conjectured this is due to the “failure of first order gradient descent.” perhaps, Yann was onto something, although i don’t think he’s followed up on this.

in the case of the compute budget, i have absolutely no idea, but i wonder if a similar argument as the model size could be made. the number of SGD steps largely dictates the maximum magnitude of the weights in a neural net. the resolution (?) of the computed probability is largely determined by the maximum magnitude of (or the variance of individual weights in) the final weight matrix (that feeds into the final softmax). perhaps we can connect these two to show that more SGD updates allow our neural net to more precisely identify the target probability. of course, this suggests that different optimization strategies may result in radically different scaling laws.

assuming what i wrote above makes even slightest bits of sense, this raises two interesting question, in my opinion. first, is all a sophisticated neural net does counting examples? the strict answer is no, because it both counts and compresses. it however looks as if it’s compression without any interesting emergent property (such as systematic generalization). second, how does this property change when we move away from the maximum entropy formulation and log-loss? i’ve pointed out two directions that look promising in a tweet earlier: margin ranking loss by Collobert & Weston and entmax series by Martins and co. if so, will it be the change in a desirable direction?

let me wrap up by thanking Henighan et al. and Kaplan&McCandlish et al. for thought-provoking pieces that have made me think of these models and problems i’ve been working with all along from a very different angle.

(*) of course the other (more positive) way to look at it is that there’s always more to be learned if we are ready to invest as much as we have invested already.

]]>Earlier this month (Nov 2020) at the Samsung AI Forum 2020 I was one of the five recipients of the inaugural Samsung AI Researcher of the Year Award by the Samsung Advanced Institute of Technology (SAIT). Samsung has been supporting my research ever since I was a postdoc at Mila in Montreal, and without their support I wouldn’t have been able to support all my PhD students (NSF, i’m looking at you!) Because of this prolonged support, I had been already grateful to Samsung even before this award, and I am even more thankful. It was also a humbling experience for me because of my fellow awardees, Seth Flaxman, Chelsea Finn, Cho-Jui Hsieh, and Jiajun Wu, who are so much more awesome than I am. Thanks for Seth’s suggestion, we are now all on each other’s whatsapp, which is another perk I got out of this award.

**Detour**: Before I continue to talk about this award, let me just briefly share with you my experience as having been living abroad in three different places (Helsinki, Montreal and NYC) that speak three different languages (Finnish, French and English) as an expat and in particular as a student expat, over the past ten years or so. In short, it’s not easy. It’s not easy in many ways, but one that I felt as most challenging was this feeling I had whenever I moved to a new place that I have to stay alert, watch my account balance and prepare for the worst until I fully settle down and get used to this new city and country. Even then, there’s a nagging feeling that I am only a temporary resident here and that I must be prepared to leave immediately without any hesitation if I’m forced to or decide to.

You can literally see this stress from newly arriving students or more broadly expats who are not financially well off. They have a difficult time appreciating beauty and joy in a new place, not to mention enjoying them. Even if this new town is filled up with awesome restaurants, they wouldn’t facy the idea of dining at those restaurants. Even if the city is surrounded by amazing tourist destinations, they wouldn’t spare their time to visit them unless their parents come visit them. Their places are often light on furnitures, and even the furnitures they get are on the cheapest end of the spectrum: in fact, a lot of them don’t even buy a full bed but just a cheap mattress placed on their floor.

Even in my case, where I have been relatively well off financially for a newly arriving student/postdoc, i’ve never bought a couch ever since i left my parents’ place (don’t worry i’m planning to do so shortly,) and i bought a bed with a box spring for the first time only when I moved to NYC as a new faculty member. It took me my parents’ visit after my second year in Finland to travel to Rovaniemi and other touristic destinations in Finland and neighbouring countries (and let me tell you: there aren’t so many.) It took me a workshop at NRC Canada to visit Ottawa when I was in Montreal, and took me an invitation by Hugo Larochelle to visit U. Sherbrooke to visit Quebec City (I know.. it’s not on the way to Sherbrooke, but I took a detour.) Even when I could afford it, it took several walk-by’s before I could mentally prepare myself to decide to dine in at this reasonably fancy (but not that much…) place, and it still does.

That’s the weirdest thing: most of these I could afford back then and can certainly afford now. However, even if I could afford it, even if I knew it would improve how I live, and even if I knew that would make my days more comfortable, a lot of things felt much less accessible and looked overly and unnecessarily luxurious. I’ve experienced this stress, although I’ve thoroughly enjoyed and never regretted moving to and living in these places, been financially stable for most of my expat years and haven’t had any dependent to support. One begins to wonder how challenging it must be for others (and you!) who may be in worse situations.

**Back to the award**: this award comes with generous $30,000 USD monetary prize^{1} (!) And, no, it’s not paid to the university for me to use to support my research, but it is the prize paid directly to me. In other words, I’m free to do whatever i want with this $30,000 that sprang out of nowhere. should i finally buy a couch? well, i could, but i can buy it without this prize money. should i buy a car? well, i live in manhattan. should i go on a luxury vacation? well, pandemic…

After a brief period of pondering, i’ve decided to donate the prize money^{2} to Mila where I was a postdoc for 1.5y + a visiting student for 0.5y. More specifically, i’ve decided to donate the prize money to Mila on the condition that it is used to provide a *one-time cash supplement* of up to $1,500 CAD to each incoming *female* students/postdoc, arriving from either *Latin America*, *Africa*, *South Asia*, *South East Asia* and *Korea*, until the donation runs out. I hope this supplement gives students, who have just arrived at Montreal to start the new chapter of their lives, a bit of room for breathing. Perhaps they can use it to go enjoy a dinner at a nice restaurant in Montreal. Perhaps they can go out with their new friends and family for beer. Perhaps they can buy not just a mattress but a proper bed. it’s not for me to determine what lets them relax a bit in the midst of settling down in a new environment, and I just hope this to be helpful in whatever way suits them best.

I thoroughly enjoyed my time at Mila (which was, to be precise, called Lisa back then,) and have greatly benefited from spending my time there as a postdoc. i cannot imagine where i would be had i not been a postdoc at Mila. And, I hope this small gesture of mine could make a diverse group of incoming students/postdocs from all corners of the world to have a more enjoyable time in Mila and benefit from their time in Mila as much as if not more than i have.

**Why female students from these regions (Latin America, Africa, South Asia, South East Asia and Korea)?** our field has an issue of representation in many aspects. we have an issue of gender representation. we have an issue of geographical representation. we have an issue of educational background/discipline representation. we have many more issues of representation in different aspects. All these issues of representation are equally important and critical, and I know that these are not just pipeline issues, based on my experiences of meeting amazing talents while teaching at Deep Learning Indaba 2018, Khipu.AI 2019, SEAML 2019, Deep Learning UB 2019 and the African Master’s Programme in Machine Intelligence (AMMI). these issues are often of opportunities and support. I believe we need to take even a little action at a time rather than waiting to address all of them simultaneously. in this particular case, I decided to give a minuscule shot at addressing a couple of these issues; the lack of female representation and the limited representation of researchers and students from Latin America, Africa, South Asia and South East Asia (I added Korea because the prize came from a Korean company :))

Also, perhaps a bit selfishly, i want to make sure there’ll be a role model my niece can look up to in the field of AI when she’s older.

(1) they also sent me this awesome plaque, but i don’t think Mila would appreciate it as donation.

(2) i’ve decided to donate $35,000 CAD after setting aside a bit for tax. after all, i’ve been paying more federal tax than the president for quite some time already and am expecting to pay some more this coming tax season.

]]>**Background:** Right before COVID-19 struck NY heavily this past Spring, K-12 teachers from Busan, Korea stopped by at NYC on their trip to US for studying various AI education strategies in US, and asked me for a short meeting. Frankly i was quite skeptical about this meeting, and was assuming it was their vacation in disguise. This skepticism of mine completely melted down when I met them in their hotel’s meeting room and began to hear what they’ve done and are doing at their schools, covering primary (1-6y), middle (7-9y) and high schools (10-12y), to teach their students what AI is, what these students can already do with it, and what they would be able to do with it in the future. it was eye-widening and has since made me realize how outdated my view of K-12 education (be it in Korea or elsewhere) is and how much K-12 education can be updated to keep up with latest developments in the society when teachers are enthusiastic and given opportunities.

This trip was a part of their effort in creating a teaching material for AI education aimed at K-12 teachers. I heard back from them a few months later that this material is ready to be published as a series of four books and was asked to write an opening remark. I was of course more glad to write one for them. Because I’m not too comfortable writing about AI in Korean (i mean.. when have i ever written anything AI in Korean?) i went ahead with English, and one of the participating teachers translated it into Korean.

Today (Nov 21 2020), i received the pdf copies of these four books and was able to take a more careful look at the content. it’s filled up with fun activities teachers can help students go through to learn about AI by experiencing a diverse set of sub-disciplines, including robotics, computer vision, natural language processing, machine learning, data science, etc. i’m so envious of these kids who will get to experience and have fun with all these activities and projects and ultimately become AI-native, unlike any of us.

And, without further ado, here it is.

**Foreword:** Intelligence is one of the last remaining mysteries of this universe and of ourselves that has evaded our collective attempt at uncovering its underlying mechanisms. We think every day, every hour, every minute, if not every second, effortlessly, without realizing that there are 86 billion neurons that are interacting with each other in both highly coordinated and highly chaotic manner behind this process of thinking. We perceive the surrounding world, which consists of our family, our friends and everything you can imagine and interact with each day, effortlessly, when the surrounding world never stays idle but dynamically changes its appearance non-stop. Based on our perception and pondering, we act in the surrounding environment effortlessly, although there are infinitely many possible ways in which our action could go wrong. Intelligence is behind these seemingly facile activities, driving each and every of us from one moment to another, but intelligence has largely evaded our interrogation and investigation even until now.

Despite โartificialโ in artificial intelligence, artificial intelligence (AI) is a scientific discipline in which intelligence in general, not necessarily artificial one, is studied. As the first step in this direction, AI scientists ask what intelligence is. To answer this question, some are inspired by biological intelligence. To answer this question, some look into psychology. To answer this question, some look into philosophy. To answer this question, some look into mathematics. To answer this question, some, like myself, look into computer science which has a good track record of rigorously defining and understanding traditionally illusory concepts, such as information and computation, thanks to Claude Shannon, Alan Turing, who originally โpropose[d] to consider the question, โCan machines think?โ in 1950, and the like.

In this scientific pursuit of (artificial) intelligence, โlearningโ has been found to be a central concept to intelligence. Intelligence is not merely a bag of algorithms and knowledge for solving a fixed set of problems, but it is rather the process of learning to solve a new problem by creating a new algorithm. Every time a new problem or a variant of a known problem is given, a machine, either biological or not, must โlearnโ to solve it and acquire a set of sophisticated skills in this process. The question of โwhat is intelligence?โ has suddenly morphed itself into the question of whether we can build a machine that can learn to solve any problem. If we could build one, that machine would be intelligent, and this machine itself would be our answer to the ultimate question of โwhat is intelligenceโ. Machine learning is a sub-discipline in computer science that has pursued this direction of building a learning machine to figure out what intelligence is.

Machine learning has made rapid progress in recent years, thanks to theoretical and empirical advances in learning algorithms, increased availability of data, wide adoption of open-source software and incredible advances in computing systems. A few years ago, a deep neural network learned to listen to speech in a quiet room and transcribe it almost as well as an average person could. This was quickly followed by a deep convolutional network which could detect an incredible number of different objects in a picture, rivaling humans in object recognition. A couple of years later, a deep recurrent neural network was trained to translate news articles between English and Chinese and ended up translating almost as well as average bilingual speakers could. All these results were openly shared in forms of open-access publications and open-source software packages, which led to an unprecedented level of adoption of these new technologies. Industry has rapidly implemented and deployed these AI systems in various products, including voice assistants, real-time machine translators, automatic image tagging, content recommendation, driving assistance and even automated tutoring. These AI technologies are being deployed in increasingly more challenging domains, such as healthcare, medicine and automation.

Unfortunately positive is not the only way to describe this rapid advance and wide adoption of machine learning and thereby artificial intelligence in recent years. These AI systems have been silently tested and deployed in the society, touching many, if not most, of us often without our realization. These silent, and often premature, tests have sadly revealed negative sides of AI.

Billions of people use social media regularly, and social media companies extensively use AI technology to personalize individual usersโ experience, effectively censoring the flow of information. Billions of people use video streaming services and news aggregation services every day, and the providers of these services use AI to decide not only what to but also what not to recommend and display to individual users, effectively shaping the usersโ opinions without their own realization. This mass adoption of AI-based content filtering has unintentionally but unmistakably resulted in deepening polarization in many societies all over the world, sometimes resulting in fatal incidents and destabilization of otherwise stable, democratic societies.

Hastily developed and prematurely deployed AI systems, such as face recognition, automated exam proctoring and automated interviewer assessment, have been found to amplify undesirable societal biases and inequalities, such as racial bias, gender bias, income inequality and geographical inequality. For instance, incorrect identification of a face recognition system, which has repeatedly been found to disproportionately associate black people and people of colour as threatening, by police in the US has recently led to the wrongful arrest of an innocent black male. The worldโs largest e-commerce company recently had to drop an AI-based recruiting system, because it was giving female candidates unjustifiable disadvantages for software engineering roles. A recent study has uncovered that commercial object recognition systemsโ accuracies significantly drop when presented with pictures taken from poorer countries.

For AI to truly benefit us and the society, these shortcomings must be addressed and addressed fully. Technical advances alone, often made by a small group of elite scientists, will not be enough to make AI safe, fair and beneficial for all. Safe, fair and beneficial AI will only be possible when the whole society, consisting of both AI scientists and others, is aware of AIโs capability, adoption and deployment. The society must continue to carefully watch and monitor AIโs impact on the society, and be ready to rise and intervene against unsafe, unfair and unjust use of AI. This awareness of capability, limitations and underlying technology of AI is necessary for the society to benefit from AI.

Such awareness in the society of a new technology, in particular when it is an enabling technology, does not happen overnight. It must happen carefully and patiently over many years, if not decades, to ensure the whole society possesses a rational and coherent view of AI technology and its use. For this to happen, we must go beyond the status quo in which discourse on AI happens within and across universities and industry. We must start discourse and education on AI already with K-12 students who will be the first generation in the history of humanity to grow to live in a society where AI is not a novelty but an everyday reality. As the first step toward this goal, we must educate teachers of all levels to be familiar with and comfortable with the technologies and implications of AI, and must immediately start preparing educational materials and systems for teaching AI.

I thus applaud this effort by the Busan Metropolitan City Office of Education preparing a new curriculum and accompanying educational materials on AI for both students and teachers. In doing so, the team from the Office of Education has struck perfect balance between theory and application, between history and modern practices, and between technology and ethics. I am envious of students in Busan who will learn to be native in AI according to this curriculum, and am now hopeful rather than worried about the future of AI and its impact on society.

]]>Many aspects of OpenAI’s GPT-3 have fascinated and continue to fascinate people, including myself. these aspects include the sheer scale, both in terms of the number of parameters, the amount of compute and the size of data, the amazing infrastructure technology that has enabled training this massive model, etc. of course, among all these fascinating aspects, meta-learning, or few-shot learning, seems to be the one that fascinates people most.

the idea behind this observation of GPT-3 as a meta-learner is relatively straightforward. GPT-3 in its essence computes the conditional distribution over all possible next tokens (from a predefined vocabulary) given a prefix: $p(x’ | x_1, \ldots, x_t)$. this conditional distribution can be chained to form a conditional distribution over sequences given the prefix: $p(x’_1, \ldots, x’_{t’} | x_1, \ldots, x_t) = \prod_{t”=1}^{t’} p(x’_{t”} | x’_{<t”}, x_{<t})$. this makes GPT-3 subsume a so-called sequence-to-sequence or encoder-decoder model, allowing one to use GPT-3 to find an answer $(x’_1, \ldots, x’_{t’})$ given a question (often referred to as “prompt” which comes together with a couple of known examples) $(x_1, \ldots, x_t)$ by solving

\[

\arg\max_{x_1, \ldots, x_t} \log p(x’_1, \ldots, x’_{t’} | x_1, \ldots, x_t).

\]

This problem turned out to be intractable, and people have been using an approximate search algorithm, such as greedy search or top-$k$ sampling, to find an answer given a prompt. In the GPT-3 paper, the authors present an impressive set of experimental results highlighting this meta-learning aspect of GPT-3.

But, then, you start to wonder: in particular for me, i began to wonder about this just today over our research group‘s weekly meeting, when Elman Mansimov presented a few recent papers that have followed up on this meta-learning aspect of a language model of which GPT-3 greatly increased the awareness. What do i wonder? I wonder if it’s meta-learning, as we think of meta-learning conceptually, that drives this phenomenon, or if there is actually a simpler mechanism behind this observation.

let’s imagine a wonderful hypothetical world in which I can train another GPT-3 on the same data myself at NYU, but this time i will make one slightly tweak. that is, i will train this new GPT-3, to which i refer as GPT-E, after reversing the order of all documents in the original dataset. that is, GPT-E computes the conditional distribution over all possible previous tokens given a suffix: $p(x | x’_t, x’_{t-1}, \ldots)$. since OpenAI has successfully trained GPT-3, you’d trust that i would be able to train this model in this hypothetical, but happy world. I will also assume that in this happy parallel universe, i can hire all the amazing talents who worked on GPT-3 at NYU perhaps as postdocs or even as PhD students so that the quality of GPT-E rivals that of GPT-3.

but, then, something weird happens. if we believe GPT-3’s meta-learning capability, GPT-E does something as amazing as (if not more amazing than) what GPT-3 can do. It takes as input a test question-answer pair and can outputs the prompt, which contains both a few training examples and a test question (!) of course, assuming the amounts of information on both sides are comparable (which should be the case for zero-shot or few-shot learning.)

Do you see where I am getting at? yes, we can now alternate between GPT-3 and GPT-E to sequentially create an encyclopedia of all the knowledge in the world (well, at least those that were represented in the training set.) We start from a random factoid and call it $(Q_0,A_0)$. We can find a reasonable “prompt” by feeding GPT-E with $(r(A_0), r(Q_0))$, where $r$ reverse a string, and sampling from $P_0 \sim p(x_1, \ldots, x_t | A_0, Q_0)$ preferably using top-$k$ sampling to reduce noise but to maintain some stochasticity. this prompt $P_0$ would consist of a (noisy) description of the task that corresponds to this factoid and a few noisy examples that are not exactly $(Q_0,A_0)$, in addition to the next question $Q_1$. We switch to GPT-3 and now sample another piece of factoid $(Q_1, A_1)$ based on $P_0$. We alternate between these two steps or more like between GPT-3 (real) and GPT-E (hypothetical) as long as we want and accumulate $(Q_n, A_n)$ to create the encyclopedia of world knowledge. Beautiful, isn’t it?

But, hold on. Where did meta-learning go? where is meta-learning in this Gibbs-like sampling procedure? is meta-learning just “noise” injected in each round of alternating between GPT-3 and GPT-E, for this Gibbs-like procedure to explore the space of knowledge effectively? If i wanted to put some positive, promising spin: is meta-learning how such noise is shaped by a large neural net so that it only spans relevant directions in this high-dimensional space corresponding to the knowledge manifold?

as I warned you at the beginning, there’s no “wow” moment nor “wow” conclusion in this post. this is just one piece of thought i had about GPT-3 that got me even more confused about all things machine learning (meta-learning, generative modeling, denoising, gibbs sampling, etc.)

P.S. i’m waiting for big tech firms with deep pockets (Amazon, Google, FB, etc. i’m looking at you) to train GPT-E for me to test this idea

P.P.S. you see why it was called GPT-E?

]]>There have been a series of news articles in Korea about AI and its applications that have been worrying me for sometime. I’ve often ranted about them on social media, but I was told that my rant alone is not enough, because it does not tell others why I ranted about those news articles. Indeed that is true. Why would anyone trust my judgement delivered without even a grain of supporting evidence? So, I’ve decided to write a short post on Facebook (shared on Twitter) and perhaps surprisingly in Korean (!) This may have been the first AI/ML-related (though, very casual) post I’ve ever written in Korean, and is definitely not the best written piece from me, although I hope this post would clarify why I’ve been fuming about those news articles.

This post is quite casual and not academic. If I’m missing any important references for general public, that you want me to include here, please drop me a line. As I’m not in any way an expert in this topic, I’m sure I’ve missed many important references, discussions and points.

That said, I realized that it’s not only Korean speakers who engage with this post (via Google Translate, etc.) and that the automatic translation of this post into English is awful (thanks to the hat tip by my colleague Ernest Davis at NYU.) Since it’s a pretty short post, I’ve decided to put its English version along with the original Korean version here in my blog. The version in Korean comes first, and the one in English follows immediately.

Twitter์ FB๋ฅผ ๋น๋กฏํ social media ๋ฐ ํ๊ณ์์ ๋ง์ด ๋ ผ์๊ฐ ๋์ง๋ง ํ๊ตญ์ด๋ก ๋ ๋ ผ์๋ ํฌ๊ฒ ์์ด ๋ณด์ฌ์ ์์ฃผ ๊ฐ๋จํ Social impact & bias of AI ๋ผ๋ ์ฃผ์ ์์ ์ค์ํ๋ค ์๊ฐ๋๋, ๋ฐ์ ํ ์ฐ๊ด๋ point ๋ช ๊ฐ๋ฅผ ์๋ ๋ฆฌ์คํธ์ ํฉ๋๋ค. ์๋ง ์๋๋ฐ ์ ๊ฐ ๋ชป ์ฐพ์ ๊ฒ์ผ ์๋ ์๊ณ , ํน์ ๊ด๋ จ๋ ํ๊ตญ์ด๋ก๋ ์ฐ๊ตฌ ๋๋ ๋ ผ์๊ฐ ์์ผ๋ฉด ๋ต๊ธ์ ๋จ๊ฒจ์ฃผ์๊ธฐ ๋ฐ๋๋๋ค.

[์๋ฌด๋๋ ํ๊ตญ์ด๋ก ๊ธ์ ์ ์จ ๋ฒ๋ฆํด์ ์ ์ฝ๊ธฐ ๋ถํธํด ๋ณด์ ๋๋ค. ์ํด ๋ถํ๋๋ฆฝ๋๋ค.]

*Amplification*

๊ธฐ์ ์ ์ฌํ๋ฅผ ๋ฐ์ํ๋๊ฒ์ด ๋ง์ต๋๋ค. ๋ค๋ง ๊ทธ๋ ๊ฒ ๋ฐ์๋ ์ฌํ์ ํน์ง์ด ๊ธฐ์ ์ ํตํด ๊ฐ์ ์ฌํ ์์์ ์ฆํญ์ด ๋ฉ๋๋ค. Virginia Eubanks์ ๋๋ Ruha Benjamin์ ๋ฅผ ์ฝ์ด๋ณด๋ฉด ์ด๋ป๊ฒ ์ด๋ฐ ์ฆํญ์ด ์ฌ๋๋ค์๊ฒ ํด๋ฅผ ๊ฐํ๋์ง ์๊ฒ ๋ฉ๋๋ค (https://www.nytimes.com/2018/05/04/books/review/automating-inequality-virginia-eubanks.html, https://us.macmillan.com/books/9781250074317, https://www.ruhabenjamin.com/race-after-technology) ์ต๊ทผ์ ์ ๊ฐ AI ์ธํฐ๋ทฐ๊ฐ ๋ง์ด ์ฐ์ธ๋ค๋ ๊ธฐ์ฌ๋ฅผ ๋ณด๊ณ ์ด์ ๋๋ ์ด์ ์ค ํ๋๋ก, ๋ค๋ค ๋ด ์๊ธฐ๋ ์๋๊ฒ ๊ฑฐ๋ ํ์ง๋ง ์ด๋ฐ ์ฆํญ๋ ๋ถ์ ์ ์ธ ๋ฉด์ ๊ถ๊ทน์ ์ผ๋ก ๋ชจ๋๋ฅผ ํดํ๊ฒ ๋ฉ๋๋ค. ํน์ ๋ณธ์ธ์ ์๋ ๊ฐ ์ด๋ฆฐ ์์ ์ ๊น ๊ฐ๋จ์ด ์๋ ๊ณณ์์ ์ด๋ฑํ๊ต๋ฅผ ๋ค๋๋ ๋ฐ๋์ AI ์ธํฐ๋ทฐ์์ ์๋์ ์ผ๋ก ๋จ์ด์ง ๊ฑด ์๋๊น์?

์ฌ์ง์ด๋ ์๋ฒฝํ AI ์์คํ ์ด ์กด์ฌํด๋ amplification ๋ฌธ์ ๋ ์ฌ์ ํ ์กด์ฌํฉ๋๋ค. ๋ง์ฝ AI ์์คํ ์์ ๋ฉด์ ๋ณด๋ ์ฌ๋์ด 60%์ ํ๋ฅ ๋ก ์ฑ๊ณต์ ์ผ ๊ฒ์ด๋ผ๊ณ ํ๊ณ , ์ค์ ๋ก 60%๊ฐ ์๋ฒฝํ (un)certainty๋ผ๋ฉด ์ด๋ป๊ฒ ํ ๊น์? ์๋ง ๋ชจ๋ ํฉ๊ฒฉ์ด๋ผ๊ณ ๊ฒฐ์ ํ ๊ฒ ์ ๋๋ค. AI ์์คํ ์ด ์ค์ ์ ์ฌ์ฉ๋๋ฉด ํด๋น ์์คํ ์ uncertainty๋ฅผ ๋์ด์๋ ๊ฒฐ์ ์ ๋ด๋ฆฌ๊ฒ ๋๊ณ amplification์ด ๋ ์ฌํด์ง๋๋ค.

*Opaqueness* of a model

AI/ML ์์คํ ์ด ํ์ ์์ ์ง์ค์ ์ผ๋ก ์ฐ์ด๊ธฐ ์์ํ ๊ฒ์ ๊ฝค ์ค๋๋ ์ผ์ง๋ง ์ด๋ฌํ ์์คํ ์ complexity๊ฐ ๊ธ๊ฒฉํ ๋์์ง ๊ฒ์ ์๋์ ์ผ๋ก ์ต๊ทผ์ ๋๋ค. ์ด๋ฐ highly complexํ ์์คํ ์ deployํ๋ ์ ์ฅ๊ณผ ์ฌ์ฉํ๋ ์ ์ฅ ๊ทธ๋ฆฌ๊ณ ์ ์ฉ๋ฐ๋ ์ ์ฅ์์๋ ํด๋น ์์คํ ์ ํน์ง์ ๋ํด ์์์ผ ํฉ๋๋ค. ์์ฝ๊ฒ๋ ๋์ ์๋ฆฌ๋ฅผ ์์๋ด๋ ๊ฒ์ ์ด๋ ต๊ณ ์ฐ๊ตฌ ์ค ๋๋ ๊ธฐ์ ๊ธฐ๋ฐ ์ด๋ผ๋ ํ๊ณ ์๋ ์ด๋ฐ ํ์์ฑ์ด ๋ฌด์ ๋นํ๊ณค ํฉ๋๋ค. ๋น์ฐํ ์ด๋ ต๊ณ ์ฐ๊ตฌ ์ค์ธ ๋ด์ฉ์ด๊ธด ํ์ง๋ง ์ค์ ๋ก ์ฌ์ฉ์ ๊ทธ๋ฆฌ๊ณ ์ ์ฉ๋ฐ๋ ์ ์ฅ์์๋ ์ธ์ธํ ๊ณผํ์ ์๋ฆฌ๋ฅผ ์๊ตฌํ๋๊ฒ ์๋๊ณ ํด๋น ์์คํ ์ ๋์ ์์ค์ ๋์ ์๋ฆฌ, ์ฌํ์ ์ํฅ ๋ฑ ์ ํ์๋ก ํ ๋ฟ ์ ๋๋ค (ํ๊ฒฝ์ ์๊ฐํด์ ์๋์ฐจ ๋ฐฐ๊ธฐ๋์ด ์ผ๋ง๋ ๋๋์ง ์๊ณ ์ถ์๋ฐ ๊ฐ์๊ธฐ ๋ด์ฐ๊ธฐ๊ด์ ์๋ฆฌ ๋ฐ ํด๋น ์ฐจ์ข ์ ๋ชจ๋ ๋ํ ์ผ์ ์์ง ๋ชปํ๋ฉด ๋ฐฐ๊ธฐ๋์ ์๋ ๊ฒ์ ์๋ฏธ๊ฐ ์๋ค๋ฉด ๋ง์ด ์ ๋๊ฒ ์ฃ .) ์ด๋ฐ ๋ด์ฉ๋ค์ด ๊ณ ์ง ๋์ง ์์ผ๋ฉด ์์ ๋งํ amplification์ผ๋ก ์ธํ ๋ถ์ ์ ์ธ ์ํฅ์ ์ด๋ฏธ ๋์ดํฌ ์ ์๋ ์ํฉ์ด ๋์ด์๋ ์ ์ ์์ต๋๋ค.

์ด๋ฅผ ์ํด์๋ model card (https://dl.acm.org/doi/abs/10.1145/3287560.3287596) ๋ฐ datasheets for datasets (https://arxiv.org/abs/1803.09010) ๋ฑ์ด ์ด์ ๊ฒจ์ฐ ์์์ด์ง๋ง ์ข์ ๋ฐฉํฅ์ผ๋ก ์ฌ๊ฒจ์ง๋๋ค. ๊ณผ์ฐ ์์ฌ AI ์์คํ ์ ์๋ํ๋ CEO/CTO ๋๋ ๊ฐ๋ฐ์ ์ค model card์ dataset datasheet์์ ์ถ์ฒํ๋ ์ง๋ฌธ์ ์์ฌ ์์คํ ์ ๋ํด ํ์ ๋ ๋ตํ ์ ์๋ ์ฌ๋์ด ์ผ๋ง๋ ๋ ๊น์? ์ ์ค์ค๋ก๋ ์ ๋ชป ํฉ๋๋ค๋ง ํนํ๋ AI ์์คํ ์ deployํ๋ ์ ์ฅ์์๋ ์ด๋ฐ ๋ฌธ์ ์ ๋ํ ๋ต์ ๊ผญ ํ ์ ์์ด์ผ ํฉ๋๋ค.

*Selection bias* of data

์์ ๋ด์ฉ๋ ๋ฐ์ ํ๊ฒ ์ฐ๊ฒฐ๋๋ ๋ด์ฉ์ผ๋ก AI ์์คํ ์ ๋ง๋๋๋ฐ ์ฌ์ฉ๋๋ ๋ฐ์ดํ๊ฐ ์ด๋ป๊ฒ ๋ง๋ค์ด์ง๋์ง๊ฐ ํฐ ๋ฌธ์ ์ ๋๋ค. ๋ค๋ง ์ด์ ๋ํ ๋ ผ์๋ ๋ฐ์ดํ๋ฅผ ๋ง์ด ์ฌ์ฉํ๋ ๋ค๋ฅธ ๋ถ์ผ์ ๋นํด (์, survey) ์๋์ ์ผ๋ก ์ ์ด๋ค์ง์ง ์์ต๋๋ค. ์ต๊ทผ ๋ค์ด AI/ML์ ๋ํ ๊ด์ฌ์ด ๋์์ง๋ฉด์ ๋คํํ data์ ๋ํ ๊ด์ฌ๋ ๋ง์ด ๋์์ง๊ณ ์๊ณ ์ด์ ๋ฐ๋ผ ๊ธฐ์กด์ ๋์น ์ฑ์ง ๋ชปํ๋ ๋ค์ํ ๋ฌธ์ ๋ค์ด ๋๋ฌ๋๊ณ ์์ต๋๋ค. ์๋ฅผ ๋ค์ด Parbhu & Birhane ( https://arxiv.org/abs/2006.16923) ๋ CIFAR-10์ด๋ ๋งค์ฐ ์ ๋ช ํ ๋ฐ์ดํ์ ์ ๋ง๋๋๋ฐ ์ฌ์ฉ๋์๋ TinyImage dataset์ ์ฌ๊ฐํ ๋ฌธ์ ์ ๋ค์ ๋ฐ๊ฒฌํ๊ณ ์ด๋ฅผ ํตํด TinyImage dataset์ด take-down๋์์ต๋๋ค. ์ง๊ธ์ด์ผ take-down๋์์ง๋ง ๊ณผ์ฐ ๊ทธ์ ๊น์ง ํด๋น ๋ฐ์ดํ๋ฅผ ์ฌ์ฉํ AI/ML ์์คํ ๋ค์ด ๋ฐ์ดํ์ ๋ฌธ์ ๋ฅผ ๊ณ ๋ฏผ ํ์ง ์๊ณ ๋ง๋ค์ด์ง ํ ์ผ๋ง๋ ํ์ค์ ์ ์ฉ๋์๋์ง ์๊ฐํด๋ณด์ง ์์ ์ ์์ต๋๋ค. Gururangan et al. (https://arxiv.org/abs/1803.02324) ์ ์์ฐ์ด์ฒ๋ฆฌ ๋ถ์ผ์์ ๊ต์ฅํ ๋๊ฒ ์ฌ์ฉ๋๋ Stanford NLI ๋ฐ์ดํ ์์ ๋ค์ด์๋ ๋ฌธ์ ์ ์ ๋ฐ๊ฒฌํ๊ณ , ํด๋น ๋ฌธ์ ์ ์ด ๋ฐ์ดํ ์์ง ๊ณผ์ ์์ ์๊ฒผ๋ค๋ ๊ฒ์ ๋ณด์์ต๋๋ค. ์ด๋ฐ ๋ฌธ์ ์ ๋ฐ๊ฒฌ์๋ ์ต์ AI/ML ๊ธฐ์ ๋ฐ ์ฐ๊ตฌ์ ๊ฐ๊ฐ์ธ์ manualํ ๋ ธ๋ ฅ์ด ํ์ํ์ต๋๋ค.

์ผ๋ฐ์ ์ผ๋ก AI ์์คํ ์ด ์ผ๋ง๋ ์ ๋์ํ๋์ง ์๋ํ๋ ๊ธฐ์ฌ ๋ฐ ๋ ผ๋ฌธ์ ๋ณด๋ ๊ฒ์ ์ด๋ ต์ง ์์ต๋๋ค. ํ์ง๋ง ์ฌ์ฉ์ ๋ฐ AI ์์คํ ์ ํ๋จ์ ๋ฐ๋ ์ฌ๋์ผ๋ก์จ ๋ ์ค์ํ ๊ฒ์ ๊ณผ์ฐ ํด๋น ์์คํ ์ด ์ด๋ค ํน์ง์ ๊ฐ๊ณ ์๋์ง, ๊ทธ๋ฆฌ๊ณ ํด๋น AI ์์คํ ์ ๋ง๋๋๋ฐ ์ฌ์ฉ๋ ๋ฐ์ดํ๊ฐ ์ผ๋ง๋ ์ ์์ง๋๊ณ ์ ์ ๋์๋์ง๊ฐ ๋ ์ค์ํฉ๋๋ค. ์ด๋ฅผ ์ํด ๋ ๋ง์ ์ฐ๊ตฌ๊ฐ ํ์ํ๊ณ ํ์ ์์๋ ์ค์ AI ์์คํ ๊ฐ๋ฐ๋ณด๋ค๋ ๋ ํฐ ํฌ์์ ๋ ธ๋ ฅ์ ๊ธฐ์ธ์ฌ์ผ ํฉ๋๋ค.

์ต๊ทผ FB์์ ๋์จ ์ฐ๊ตฌ ๊ฒฐ๊ณผ๋ฅผ ๋ณด๋ฉด ๋ฐ์ดํ์ ์ํฅ์ด ์ผ๋ง๋ ํฐ์ง ์ ์ ์์ต๋๋ค (https://openaccess.thecvf.com/content_CVPRW_2019/html/cv4gc/de_Vries_Does_Object_Recognition_Work_for_Everyone_CVPRW_2019_paper.html). ์ด ๋ ผ๋ฌธ์์๋ ์์ฉ object recognition ์์คํ ์ ์ ํ๋๊ฐ ์ฌ์ง์ด ์ฐํ ์ง์ญ์ ์๋๊ณผ correlateํ๋ค๋ ๊ฒ์ ๋ณด์์ต๋๋ค. ํน์ ์ ๋ผ๋จ๋์ ์ด๋ฉด ์์ธ์์ ๋ชจ์ธ ๋ฐ์ดํ๊ฐ ์๋์ ์ผ๋ก ๋ง์ด ์ฐ์ธ ๋ค์ด๋ฒ OCR์ด ๋ ์ ํํ๊ฑด ์๋๊ฒ ์ฃ ? (http://kosis.kr/statHtml/statHtml.do?orgId=101&tblId=DT_1C65, ์ฌ์ค ๋ค์ด๋ฒ OCR์ด ์ด๋ป๊ฒ ๋ง๋ค์ด์ง๋์ง ๋ชจ๋ฆ ๋๋ค. ๋ค๋ง ์์ธ/๊ฒฝ๊ธฐ์์ ๋ชจ์ธ ๋ฐ์ดํ๊ฐ ๋๋ถ๋ถ์ผ ๊ฒ์ผ๋ก ์๊ฐ๋๊ธด ํ๋ค์.)

์ด๋ฅผ ํด๊ฒฐํ๊ธฐ ์ํ ๋ฐฉํฅ์ผ๋ก๋ human-and-machine-in-the-loop์ด๋ผ๋ ํจ๋ฌ๋ค์์ด promisingํด ๋ณด์ ๋๋ค: https://arxiv.org/abs/1909.12434, https://arxiv.org/abs/1910.14599, https://openreview.net/forum?id=H1g8p1BYvS. ๋ค๋ง ์ด๋ฐ ํจ๋ฌ๋ค์์ ์ด๋ป๊ฒ ๊ตฌํ์ ํ๋๋์ ๋ฐ๋ผ ๊ฒฐ๊ณผ๊ฐ ํฌ๊ฒ ๋ฌ๋ผ์ง ์ ์๊ณ , ๊ตฌํํ๋ ๊ณผ์ ์์ ํผํด๋ฅผ ๋ณด๋ ์ฌ๋๋ค์ด ์๊ธธ ์๋ ์์ต๋๋ค (์๋ฅผ ๋ค๋ฉด https://www.theverge.com/2019/2/25/18229714/cognizant-facebook-content-moderator-interviews-trauma-working-conditions-arizona)

*Correlation vs. Causation* & *systematic generalization*

์ข ์ข ์ด๋ฐ ๋ฌธ์ ๋ ๊ธฐ์ ์ ๋ฌธ์ ๊ฐ ์๋๋ผ๊ณ ์ฃผ์ฅํ๋ ์ฌ๋๋ค์ด ์์ต๋๋ค. ์ด๋ฐ ์ฃผ์ฅ์ ๋ณดํต AI/ML์ ๊ทผ๋ณธ์ ์ธ ๋ชฉํ๋ฅผ ์ดํดํ์ง ๋ชปํด์ ํ๋ ๊ฒ ์ ๋๋ค. ํนํ๋ AI/ML์ ๋ชฉํ์ ์ฃผ์ด์ง ๋ฐ์ดํ์ sufficient statistics๋ฅผ ๋ฝ์๋ด๋ ๊ฒ์ ๋์ผํ๊ฒ ๋ณด๋ ๊ฒฝ์ฐ๊ฐ ์๋๋ฐ, ์ด๊ฑด ์ฌ์ค์ด ์๋๋๋ค.

AI/ML์ ๋ชฉํ๋ ์ผ๋ฐ์ ์ผ๋ก inductive inference๊ณ , Vapnik์ ์ํ๋ฉด ์ด๊ฒ์ “an informal act [with] technical assistance from statisticians” (paraphrase) ์ ๋๋ค. ์กฐ๊ธ ๋ ์ต๊ทผ์ ๋์จ Arjosvky et al. (2019; invariant risk minimization https://arxiv.org/abs/1907.02893)์์๋ ์ข ๋ ๋ถ๋ช ํ๊ฒ “minimizing training error leads machines into recklessly absorbing all the correlations found in training data” ํ์ฌ “machine learning fails to fulfill the promises of artificial intelligence” ๋ผ๊ณ ํฉ๋๋ค. ํ ๋ง๋๋ก AI์ ๋ชฉํ๋ ๋ฐ์ดํ ์์ง ํ๊ฒฝ์ ๊ตฌ์ ๋ฐ์ง ์๋ mechanism (์ธ์ ๋๋ ์๋์ง๋ง ๋ง์ ๊ฒฝ์ฐ causal) ์ ์ฐพ์๋ด์ out-of-domain (๋๋ systematic) generalization์ ์ฑ๊ณต์ ์ผ๋ก ์ํํ๋ ๊ฒ์ ๋ชฉํ๋ก ํฉ๋๋ค.

์ํ๊น๊ฒ๋ ๊ธฐ์กด์ ์ฌ์ฉ๋๋ ๋๋ถ๋ถ์ ML algorithm๋ค์ ์ด๋ฐ ๋ฉด์ด ๋ถ์กฑํฉ๋๋ค (์ด๋ฐ ์๊ฐ ๊ถ๊ธํ๋ฉด ์ต๊ทผ ์ ๋ฐํ์ ์ด๋ฐ์ ๋ณด๋ฉด ๋ฉ๋๋ค: https://drive.google.com/file/d/1CrkxcaQs5sD8K2HL2AWCMnrMRpFoquij/view) ์ด๋ฅผ ๊ทน๋ณตํ๊ธฐ ์ํด meta-learning๊ณผ IRM ๋ฑ์ ์๋ก์ด paradigm๋ ์ ์๋๊ณ causal inference from observational data๋ฅผ ML์ ์ ์ฉ์ํค๋ ์ฐ๊ตฌ๋ ๋ง์ด ์งํ๋๊ณ ์์ต๋๋ค (์๋ฅผ ๋ค๋ฉด https://arxiv.org/abs/1911.10500, https://arxiv.org/abs/1901.10912, https://arxiv.org/abs/1805.06826.)

๋จ์ํ ๋ฐ์ดํ์ ์๋ correlated feature๋ฅผ ์๊ณ ๋ฆฌ์ฆ์ด ์ฐพ์ ๊ฒ์ธ๋ฐ ์ด์งธ์ ๊ทธ๊ฒ์ด ๋ฌธ์ ์ด๋ ๋ฌป๋๋ค๋ฉด ์ผ๋จ AI/ML์ด ๋ฌด์์ธ์ง์ ๋ํ ๊ณ ๋ฏผ๋ถํฐ ๋ค์ ํด์ผ ํฉ๋๋ค.

Although it’s a topic that’s actively discussed both in academic settings and social media, such as Twitter and FB, I haven’t seen much discussion on the Social Impact & Bias of AI in Korean. To contribute even minimally to addressing this lack of discussion, here’s the list of a few points that are relevant to this topic. It’s possible that I simply have failed to find discussions surrounding this topic in Korean, and if there’s any, please kindly point me to them.

[My apologies for unprofessional writing. It’s not really everyday I write anything in Korean.]

*Amplification*

It is true that technology reflects the society. It is however also true that such technology that reflects the society is used within the society and that it inevitably amplifies what’s been reflected on the technology. It’s illuminating to read <Automating Inequality> by Virginia Eubanks and <Race after Technology> by Ruha Benjamin to see how such amplification harms people. (https://www.nytimes.com/2018/05/04/books/review/automating-inequality-virginia-eubanks.html, https://us.macmillan.com/books/9781250074317, https://www.ruhabenjamin.com/race-after-technology) This amplification of negative aspects of the society is precisely why I fumed over the recent news articles on wide adoption of AI inteviews in Korea. You may think you’re not the one who’ll suffer from such amplification, but it eventually gets to everyone unless without any intervention. Have you ever considered the possibility that your kid may not have received the job offer because he didn’t attend a primary school in Gangnam when they were small?

Even if one imagines a perfect AI system, the issue of amplification still exists. Consider this hypothetically perfect AI system that has determined a candidate to be 60% fit to the company and that this 60% is perfectly calibrated. As soon as a user of this system simply thresholds at 50% to make a hiring decision, it ends up with the same issue of amplification, because in practice users of such AI system inevitably overrule the supposedly perfect uncertainty estimated by the system.

*Opaqueness* of a model

Although it has been quite some time since so-called AI/ML systems have been put in practice, it’s relatively recent that their complexity has greatly increased. When a system in practice exhibits such a high level of complexity, it is important for both a provider, user of and those who are influenced by such a system to be aware of the principle behind these systems. Unfortunately there’s a trend that this need and request for awareness are ignored based on a variety of excuses such as that it is difficult to know the full details of the working principles, it is under active research to figure out the working principles and it is a part of corporate secret. Of course it is a difficult scientific issue on its own, but what is needed in terms of transparency is not every single scientific and engineering detail but a high-level description of the working principle behind such systems and understanding of their impacts on the society (think of how ridiculous it would be when a car manufacturer doesn’t tell you the horse power of a car you are considering because there’s no way you can know about all the details of the car such as the minute details of internal combustion engines.) Unless these (even high-level) details are provided together with these AI systems, the negative impact of such systems on the society will only be discovered once the (potentially irreversible) damages have been made.

One promising direction I have observed in recent years is the proposal for model cards and datasheets for datasets: https://dl.acm.org/doi/abs/10.1145/3287560.3287596 and https://arxiv.org/abs/1803.09010. I wonder how many CEO/CTO and developers can answer the questions, suggested for the model cards and datasheets, about their own AI systems they tout as well as data used for those systems. I’m not particularly a good example myself, but I believe the bar is even higher for those who tout and deploy AI systems in the society.

*Selection bias* of data

It’s quite related to the previous point. It is important to think of how data used for building an AI system was collected and created. Unfortunately and perhaps surprisingly this aspect of data has received relatively little attention compared to other adjacent areas, but the research community has begun to pay more attention to data itself and notice various issues behind widely used datasets. For instance, Parbhu & Birhane (https://arxiv.org/abs/2006.16923) identified serious flaws and issues behind one of the most widely used image datasets, called TinyImages, from which the widely used CIFAR-10 was created. This has led to the removal of the TinyImages dataset after 10 years since the original dataset was created and released. Although it’s now removed, you must wonder how many AI systems have been built using this data and been deployed in practice. Gururangan et al. (https://arxiv.org/abs/1803.02324) found various issues (or artifacts, as they called them) in the Stanford natural language inference (SNLI) data, stemmed from the process of data collection. These findings are the result of the combination of both state-of-the-art AI/ML techniques and individual researchers’ manual efforts.

It’s not difficult to find news articles and academic papers bragging the awesomeness of their AI systems. It is however more important for users and people who are being (either intentionally or unintentionally) judged by such systems to know the properties and characteristics of such systems and to be able to trust the quality of data and its collection process. It is thus imperative to invest more on this aspect of quality assurance than on the actual development of AI systems, in addition to continued research.

A recent work from FB demonstrates well the impact and importance of data and its collection: https://openaccess.thecvf.com/content_CVPRW_2019/html/cv4gc/de_Vries_Does_Object_Recognition_Work_for_Everyone_CVPRW_2019_paper.html. In this paper, the authors demonstrated that the accuracies of commercial object recognition systems correlate with the income levels of the regions in which pictures were taken. Hopefully, it doesn’t mean that the OCR service from Naver is less accurate for those who live in Jeollanam-do (which has the lowest per-capita GDP in Korea according to http://kosis.kr/statHtml/statHtml.do?orgId=101&tblId=DT_1C65) because the OCR system was trained mainly using data from Seoul and its metropolitan area (to be honest, I have no idea how Naver OCR is implemented, but I’m quite sure the majority of data used for building the system were collected from Seoul and its surrounding regions.)

To me, human-and-machine-in-the-loop paradigm looks quite promising: https://arxiv.org/abs/1909.12434, https://arxiv.org/abs/1910.14599 and https://openreview.net/forum?id=H1g8p1BYvS. Although promising, it’s important to keep in our mind that the outcome of such a paradigm heavily depends on how it’s implemented, not to mention that some may suffer from its implementation. See for instance https://www.theverge.com/2019/2/25/18229714/cognizant-facebook-content-moderator-interviews-trauma-working-conditions-arizona.

*Correlation vs. Causation* & *systematic generalization*

Often we see people who claim this is *not* the problem of technology. Such a claim often arises from the lack of understanding the fundamental goal of AI/ML. In particular, some equate the goal of AI/ML to estimating sufficient statistics from given data, which is simply not true.

In general, the goal of AI/ML is inductive inference, and according to Vapnik (https://www.wiley.com/en-us/Statistical+Learning+Theory-p-9780471030034), it’s “an informal act [with] technical assistance from statisticians” (paraphrase). More recently, Arjosvsky et al. (https://arxiv.org/abs/1907.02893) explicitly stated that “minimizing training error leads machines into recklessly absorbing all the correlations found in training data” and this makes “machine learning [fail] to fulfill the promises of artificial intelligence.”In short, the goal of AI is to identify an underlying mechanism that is independent of (or invariant to) changing environments (which are often but not always causal) and successfully generalize to a new environment, which is often referred to as out-of-domain (or systematic) generalization.

Sadly, most of the existing (widely used) ML algorithms fall short in this aspect. See the first part of my recent talk for an example: https://drive.google.com/file/d/1CrkxcaQs5sD8K2HL2AWCMnrMRpFoquij/view. In order to overcome this inability, new paradigms have been proposed, such as meta-learning and invariant risk minimization, and there is an on-going effort in marrying causal inference from observational data with machine learning. See e.g. https://arxiv.org/abs/1911.10500, https://arxiv.org/abs/1901.10912 and https://arxiv.org/abs/1805.06826.

If you still insist that it is not an issue of the algorithm which has faithfully captured correlations that exist in data, I suggest you to think once more what AI/ML is and what its goal is.

]]>TL;DR: after all, isn’t $k$-NN all we do?

in my course, i use $k$-NN as a bridge between a linear softmax classifier and a deep neural net via an adaptive radial basis function network. until this year, i’ve been considering the special case of $k=1$, i.e., 1-NN, only and from there on moved to the adaptive radial basis function network. i decided however to show them how $k$-NN with $k > 1$ could be implemented as a sequence of computational layers this year, hoping that this would facilitate students understanding the spectrum spanning between linear softmax classification and deep learning.

we are given $D=\left\{ (x_1, y_1), \ldots, (x_N, y_N) \right\}$, where $x_n \in \mathbb{R}^d$ and $y_n$ is an associated label represented as a one-hot vector. let us construct a layer that computes the nearest neighbour of a new input $x$. this can be implemented by first computing the activation of each training instance:

\begin{align*}

h^1_n =

\frac{\exp(-\beta | x_n – x |^2)}

{\sum_{n’=1}^N \exp(-\beta | x_{n’} – x |^2)}.

\end{align*}

in the limit of $\beta \to \infty$, we notice that this activation saturates to either $0$ or $1$:

\begin{align*}

h^1_n {\to}_{\beta \to \infty}

\begin{cases}

1, &\text{if $x_n$ is the nearest neighbour of $x$} \\

0, &\text{otherwise}

\end{cases}

\end{align*}

the output from this 1-NN is then computed as

\begin{align*}

\hat{y}^1 = \sum_{n=1}^N h^1_n y_n = Y^\top h^1,

\end{align*}

where $h^1$ is a vector stacking $h^1_n$’s and

\begin{align*}

Y=\left[

\begin{array}{c}

y_1 \\

\vdots \\

y_N

\end{array}

\right].

\end{align*}

this was relatively straightforward with 1-NN. how do we extend it to 2-NN? to do so, we define a new computational layer that computes the following activation for each training instance:

\begin{align*}

h^2_n =

\frac{\exp(-\beta (| x_n – x |^2 + \gamma h^1_n))}

{\sum_{n’=1}^N \exp(-\beta (| x_{n’} – x |^2 + \gamma h^1_n))}.

\end{align*}

now we consider the limit of both $\beta\to \infty$ and $\gamma \to \infty$, at which this new activation also saturates to either 0 or 1:

\begin{align*}

h^2_n \to_{\beta, \gamma \to \infty}

\begin{cases}

1, \text{if $x_n$ is the second nearest neighbour of $x$} \\

0, \text{otherwise}

\end{cases}

\end{align*}

this magical property comes from the fact that $\gamma h_n^1$ effectively kills the *first* nearest neighbour’s activation when $\gamma \to \infty$. this term does not affect any non-nearest neighbour instances, because $h_n^1=0$ for those instances.

the output from this 2-NN is then

\begin{align*}

\hat{y}^2 = \frac{1}{2} \sum_{k=1}^2 \sum_{n=1}^N h^k_n y_n.

\end{align*}

now you see where i’m getting at, right? let me generalize this to the $k$-th nearest neighbour:

\begin{align*}

h^k_n = \frac{

\exp(-\beta (| x_n – x |^2 + \gamma \sum_{k’=1}^{k-1} h^{k’}_n))

} {

\sum{n’=1}^N \exp(-\beta (| x_{n’} – x |^2 + \gamma \sum_{k’=1}^{k-1} h^{k’}_n))

},

\end{align*}

where we see some resemblance to residual connections (add the previous layers’ activations directly.)

In the limit of $\beta\to\infty$ and $\gamma \to \infty$,

\begin{align*}

h^k_n \to_{\beta, \gamma \to \infty}

\begin{cases}

1, \text{if $x_n$ is the $k$-th nearest neighbour of $x$} \\

0, \text{otherwise}

\end{cases}

\end{align*}

the output from this $K$-NN is then

\begin{align*}

\hat{y}^K = \frac{1}{K} \sum_{k=1}^K \sum_{n=1}^N h_n^K y_n,

\end{align*}

which is reminiscent of so-called deeply supervised nets from a few years back.

it is not difificult to imagine not taking the infinite limits of $\beta$ and $\gamma$, which leads to soft $k$-NN.

In summary, soft $k$-NN consists of $k$ nonlinear layers. Each nonlinear layer consists of radial basis functions with training instances as bases (nonlinear activation), and further takes as input the sum of the previous layers’ activations (residual connection.) each layer’s activation is used to compute the softmax output (self-normalized) using the one-hot label vectors associated with the training instances, and we average the predictions from all the layers (deeply supervised).

of course, this perspective naturally leads us to think of generalization in which we replace training instances with learnable bases across all $k$ layers and learn them using backpropagation. this is what we call *deep learning*.

[NOTE: I became aware that an extreme similar (however with some differences in how 1-NN is generalized to k-NN) has been proposed recently in 2018 by Plรถtz and Roth at NeurIPS’18: https://papers.nips.cc/paper/7386-neural-nearest-neighbors-networks]

]]>