What is A Generative Model and How to Learn It

Discriminative model

First, may be we are more familiar with a discriminative model, or we usually call it a classification model. Here are some examples of classification models:

Input Output (label)
Image Whether it’s a animal (1 or 0)
Email Whether it’s a spam (1 or 0)
A sentence Next word in the sentence (a word)
Voice record Text (word sequence)

In machine learning, a discriminative model does not directly give the label, but the probability of the label, and that is $p(y\vert x)$ . $x$ is the input data and $y$ is the label. In the case of spam detection, we can say the email is a spam when $p(y=1\vert x) > 0.5$. We can also be more conservative, and only say a email is spam when $p(y=1\vert x) > 0.9$. So you see, we have a recall and precision trade-off.

The training objective given a datapoint $(x_d,y_d)$ is that we want to maximize the probability of $y_d$ when we observe $x_d$, or here we call it likelihood. It can be written in

Here, we use the logarithm because given multiple datapoints, we can do a summarization on the log-likelihoods instead of production. $\theta$ is the parameter of our model, it may be the parameters of a neural network.

Generative model

Now as we know the discriminative model and how to train it. We move on to the generative model. In generative model, instead of the output label, we want to know the probability of the input data $p(x)$. Is this awkward? But consider the email spam detection example, if we know a distribution that generates spamming emails $p_{\mathrm{spam}}(x;\theta)$. Please note here $\theta$ is still the parameters of this probability model. Then, given a new email $x_d$, we can just plug in the data into the probability $p_{\mathrm{spam}}(x;\theta)$ and say the mail is a spam when $p_{\mathrm{spam}}(x=x_d;\theta) > 0.5$.

So, because the probability distribution $p(x)$ try to capture the generation process of $x$, we call it a generative model. Then, similar to the discriminative case, we train the generative model by maximizing the log-likelihood:

Wait, there is a problem, the model now only has the output $x$, but no input is provided. For example, it $x$ is an image, then we are basically predicting all the pixels in the image. So are we going to create a model that maps nothing to $x$ ? Hmm, we have no idea on how to compute such a model and its log-likelihood.

But, the good thing is that we know how to compute a lower-bound of the log-likelihood by introducing another variable $z$. We call it a latent variable, and the probability of $x$ depends on $z$ as illustrated in the diagram.

When we translate this assumption into equations, it will be

Simple, right? Then, let’s do some surgeries to the log-likelihood. We first marginalize w.r.t. $z$:

This is easy to understand. Let’s say our mail has a title “Amazing discount”, the latent variable only has two cases: $z=\text{“spam”}$ and $z=\text{“not spam”}$. So the probability of this mail is just

Next, we introduce a Q distribution, which makes the equation:

The template $\int q(z\vert x) … dz$ is actually an expectation $\mathbb{E}_{z \sim q(z\vert x)}[…]$. So the equation can be written in

Hmm, well, we still don’t know how to compute this equation. Wait, can we use Jensen’s inequality here? Remember that Jensen’s inequality tells us

The reason is because the logarithm is a convex function. And now we find a lower-bound of the log-likelihood:

Remember our assumption $p(x,z) = p(x\vert z)p(z)$, we just plug it into the equation to make it

Oh, multiplication and division in logarithm, let’s decompose them:

OMG, we just find that two probabilistic distributions $q(z\vert x)$ and $p(x\vert z)$ are mapping something to something just like our discriminative model, and certainly, we can create such models. $p(z)$ is just the prior distribution of the latent variable, let’s just set it to be a standard Gaussian $p(z) = N(0,1)$.

Cool. Let’s further clean up the equation:

The second half is what we call Kullback–Leibler divergence, or in short, KL divergence, or in short, KL. It is also called relative entropy. Therefore, we usually write the lower bound equation as

Anyway, the good news is that when both $q(z\vert x)$ and $p(z)$ are Gaussians, the KL divergence can be analytically solved. If you are interested, take a look at this stackexchange post: https://stats.stackexchange.com/a/7449 .

So, to compute the lower-bound, we just need to sample a $z$ from $q(z\vert x)$, compute the left part $\mathbb{E}_{z \sim q(z\vert x)}[ \log p(x\vert z)]$ and the right part, which is KL divergence. Because this conclusion is so important, we give the equation a name, we call it evidence lower-bound, or in short, ELBO. Suppose the two conditional distributions are parameterized by $\theta$ and $\phi$, then ELBO is defined as

Instead of maximizing the log-likelihood, we maximize the lower-bound:

Understanding ELBO

The left part of the ELBO $\mathbb{E}_{z \sim q(z\vert x;\phi)}[ \log p(x\vert z;\theta)]$ basically says that if we sample a $z$ from $q(z\vert x)$, can we really reconstruct the original $x$ with $p(x\vert z)$ ? So, this is a reconstruction objective. Of course, when $q(z\vert x)$ is a complicated distribution, $z$ can carry more information from $x$, so the reconstruction objective can be higher.

But, when $q(z\vert x)$ is complicated, it can’t have a shape close to the prior. Remember, the prior $p(z)$ is just a standard Gaussian. So the right part $\mathrm{KL}(q(z\vert x;\phi) \vert p(z))$ will output a high value to punish the ELBO when q is complicated.

Oh, I see. The left part and right part are basically fighting with each other. To summarize two scenarios, check this figure:

In next posts, we are going to discuss how to train generative models with neural networks and back-propagation. And we will further discuss the situation when $z$ is a discrete latent variable.

Book Review | Zero to One: Notes on Startups, or How to Build the Future

Zero to One: Notes on Startups, or How to Build the Future

Authors: Peter Thiel, Blake Masters

Peter Thiel is the CEO of PayPal, provided funding for LinkedIn, Yelp, partner of founder’s fund that helped SpaceX and Airbnb.

TL;DR Book Review

Key Message:

To successfully run a startup, the founders have to consider seven factors: 1) significant tech breakthrough 2) timing 3) monopoly in small markets 4) right management team 5) effective distribution of products 6) durability of continuous monopoly 7) uniqueness.

Main Points

Seven keys

  • Tech breakthrough
    • Customers don’t buy incremental improvements
    • Make sure you get 10X improvement
  • Timing
    • Choose the right time to join the competition
    • Last mover advantage: the last development to productize the tech is more important than the first development
  • Monopoly
    • Gain monopoly in small markets with big share is important for high profit
    • Facebook started as a social website in Stanford
  • Right management team
    • Choosing founders are like marrying people
    • Balanced ownership, position and control
    • Question who owns the company, who runs it, who governs it
  • Distribution
    • Think how to sell to customers
  • Durability
    • How the business can grow and endure, generating long-term cash flow
    • Companies are valued by future cash flow
  • Uniqueness
    • Don’t just another internet company in dot com bubble

Thoughts on competition

  • Competition drives down the growth and profit space
  • So, better to build monopoly in a small market
    • With much better tech
    • With network effects
    • Then scale it
    • Then do some branding

Thoughts on lean startups

  • Lean startup is a methodology that takes you to a local optimum
  • However, the power law focuses on exponential growth
  • Plan the blueprint of growth. Company sell shares and merge because they don’t know what to do.

How to recruit people

  • Make sure people commit full-time
  • Stock option is better than cash
  • Either on bus or off the bus
  • Great people join because of the mission and team (not cash and stocks)

How to build the board

  • A board of three is health for private companies

Thoughts on clean tech bubble

  • They failed to answer the seven key questions
  • Tesla wins as it has the monopoly

Install horovod for PyTorch to work with slurm on ABCI

Horovod is a great tool for multi-node, multi-gpu gradient synchronization. The official ABCI document does give a guide on running Tensorflow with multiple nodes, which can be found here https://docs.abci.ai/en/apps/tensorflow/. However, to make horovod run with PyTorch and work with multiple nodes is not that straight-forward.

First, it turns out that Anaconda and Miniconda is not well supported in this case. The PyTorch plugin requires GCC version higher than 4.9. So finally, may be the only solution is to stick with the Python provided in module list and use venv for package control.

Load modules and create a python environment:

module load cuda/10.0/10.0.130 cudnn/7.6/7.6.4 nccl/2.4/2.4.8-1 python/3.6/3.6.5 openmpi/2.1.6
python3 -m venv $HOME/base
source $HOME/base/bin/activate

Install PyTorch

pip install numpy
pip install torch

Install horovod with NCCL support and a higher version of g++

CC=/apps/gcc/7.3.0/bin/gcc CXX=/apps/gcc/7.3.0/bin/g++ HOROVOD_GPU_ALLREDUCE=NCCL HOROVOD_NCCL_HOME=/apps/nccl/2.4.8-1/cuda10.0/lib pip install horovod

To test whether the horovod works or not, let’s make a simple script named hvd_test.py .

import horovod.torch as hvd
import torch
hvd.init()
print(
	"size=", hvd.size(),
	"global_rank=", hvd.rank(),
	"local_rank=", hvd.local_rank(),
	"device=", torch.cuda.get_device_name(hvd.local_rank())
)

The launch the job, make a script hvd_job.sh like this:

#!/bin/bash

source /etc/profile.d/modules.sh
module load cuda/10.0/10.0.130 cudnn/7.6/7.6.4 nccl/2.4/2.4.8-1 python/3.6/3.6.5 openmpi/2.1.6
source $HOME/base/bin/activate

NUM_NODES=${NHOSTS}
NUM_GPUS_PER_NODE=4
NUM_GPUS_PER_SOCKET=$(expr ${NUM_GPUS_PER_NODE} / 2)
NUM_PROCS=$(expr ${NUM_NODES} \* ${NUM_GPUS_PER_NODE})

LD_LIBRARY_PATH=/apps/gcc/7.3.0/lib64:$LD_LIBRARY_PATH

mpirun -np $NUM_PROCS --map-by ppr:${NUM_GPUS_PER_SOCKET}:socket \
--mca mpi_warn_on_fork 0 -x PATH -x LD_LIBRARY_PATH \
python hvd_test.py > hvd_test.stdout

Launch the job with

qsub -g group_name -l rt_F=2 hvd_job.sh

After the job is executed, you can check the output by

cat hvd_test.stdout

Thoughts of PhD research

At the end of my 3-year PhD, I want to record some of my thoughts on the researches in natural langauge processing and machine learning. Especially, I want to help those new students who are just started their voyage in the (red) ocean of machine learning. Help them to efficiently find their goal of research, and how to reach the goal.

In these three years, I talked to many master and PhD students in this field and I found they generally can be categorized into three groups:

  1. Interested in creating something cool with artificial intelligence
  2. Want to publish a good paper in a good conference
  3. Care more about how the PhD can benefit their careers

I believe that all these motivations are good and the best students can have a combination of them all. However, sadly, some students I’m worrying a lot are exteremly biased in only one motivation. In the long run, such a biased motivation can make the PhD really painful, but not a joyful process.

Group 1. I just want to do something new and cool with AI These students are the best to work with actually, because they are highly motivated and passionate about learning new things. But, one pitfall of such a mindset is that you can be lost in your own fantasy. The academic world is not working like this. Because given the exponentially growing researcher population recently, it’s highly unlikely that you are the first one to think about your idea. In the worst scenario, your idea may be developed for multiple years by others. And you are going to find them anyway if you finally want to publish your paper and compare your method with others. So the best thing to do is not just jumping into experimenting your own idea, but do a survey, at least briefly, on how other people are solving the same or similar problems you want to solve. In some cases, you are thinking that you tackling a brand-new problem, but that’s not true. Many machine learning problems share the same intrinsic property. For example, if you want to generate a future frame in a video clip. Suppose you can’t find a paper on this subject, but it really is a conditional image generation problem.

Another possible scenario is that you are setting a goal way too big for a PhD student. In this case, you may be stuck in the same problem for years without a publication. This is stressful and you may finally give up. So the best thing is to solve a small but crucial problem.

Group 2. My goal is to publish papers in good conference All PhD students need to publish papers in order to gradudate. However, it’s dangerous to chase the trend and attempt to publish a paper with better scores, which is seemingly a shortcut for publication. It is the same reason why the restaurants are do difficult to survive, because there are so many competitors. You will have very limited amount of time to develop your approach, and will be disappointed a lot if someone else published a paper with the same idea as yours. It’s better to find a sweet spot to propose a impactful model and evalaute a thoroughly rather than crunching the numbers. Competing with scores is also stressful anyway.

Group 3. Care more about how the PhD can benefit their careers Internship can cost the will power, especially for those internship not contributing to your PhD research. Try to set a time budget for the internship, then forget it completely when back to school. Don’t try to do the internship in the morning and do research in the afternoon, it will not work. Some students cared too much about their careers, so they finally lose the will power to do their PhD research.

Hope these thoughts can help new students in this field and I’m willing to write more based on my experience.