Contrastive Predictive Coding for Representation Learning

An exploration of Noise Contrastive Estimation (NCE) and how it enables Contrastive Predictive Coding (CPC) for learning useful representations from high-dimensional sequential data like speech.

What is Noise Contrastive Estimation (NCE)?

Let $x \in R^{n}$ be a random vector with an unknown probability density $p_{d} (\cdot)$ . We wish to model the unknown data density with a parameterised family of functions ${p_{m} (\cdot; α)}_{α}$ , where $α$ is a vector of parameters. This assumes that $p_{d}$ is within in our chosen parameterised family. The question we are concerned with is how do we select the $α^{*}$ such that $p_{d} = p_{m} (\cdot; α^{*})$ .

Since the density has to be normalised (integrate to 1), we can redefine the p.d.f in a way that means this constraint is always satisfied: $p_{m} (\cdot; α) = \frac{p _{m}^{0} ( \cdot ; α )}{Z ( α )}$ where $Z (α) = \int p_{m}^{0} (u; α) d u$ and $p_{m}^{0} (\cdot; α)$ specifies a functional form of the density that doesn’t need to integrate to 1.

The issue is that computing the normalising coefficient is generally intractable. NCE is a method to optimise the unnormalised model and the normalisation parameter by maximising the same objective. The normalisation constant is treated as another parameter of the model. I.e., $ln (p_{m} (\cdot; θ)) = ln p_{m}^{0} (\cdot; α) + c$ .

Let $X = (x_{1}, \dots, x_{T})$ be the observed data set consisting of T observations of the data x. Let $Y = (y_{1}, \dots, y_{T})$ be an artificially generated dataset of noise with distribution $p_{n} (\cdot)$ . The objective function for theta is:

J_{T} (θ) = \frac{1}{2 T} t \sum ln [h (x_{t}; θ)] + ln [1 - h (y_{t}; θ)]

where $h (u; θ) = sigmoid (ln p_{m} (u; θ) - ln p_{n} (u))$ . This can be seen as the log-likelihood of a logistic regression model which discriminates the data X from the noise Y.

NCE Connection to Supervised Learning

Let $U = (u_{1}, \dots, u_{2 T})$ be the union of two sets X and Y. Each $u_{i}$ has a class label, $C_{t}$ , where if $u_{i} \in X$ , $C_{t} = 1$ and if $u_{i} \in Y$ , $C_{t} = 0$ . For logistic regression, the posterior probabilities of the classes given the data are estimated. We don’t know the p.d.f $p_{d} (\cdot)$ of the data, therefore the class-conditional probability $p (\cdot ∣ C = 1)$ is modeled $p_{m} (\cdot; θ)$ . Then we have:

p (u ∣ C = 1; θ) p (u ∣ C = 0) = p_{m} (u; θ) = p_{n} (u)

Basic Idea: Estimate the parameters by learning to discriminate between data x and some artificially generated noise y.

Theorem 1: Optimising the proposed objective attains maximum when $f (\cdot) = ln p_{d} (\cdot)$ . This optima is global if the noise density is selected such that it is non-zero whenever $p_{d} (\cdot)$ is non-zero.

Key benefit of this method is that we can optimise it without knowing the normalising coefficient.

Representation Learning with Contrastive Predictive Coding

The objective is to learn representations that encode information shared between different parts of a high-dimensional signal (e.g. speaker information) while discarding low-level information and noise that is more local to each distinct region.

What is Contrastive Predictive Coding?

An encoder maps, $g_{e n c}$ maps an input sequence of observations $x_{t}$ to a sequence of latent representations $z_{t} = g_{e n c} (x_{t})$ , potentially the latent representation has a lower temporal resolution.

An autoregressive model $g_{a r}$ summarises all $z_{\leq t}$ , all the previous z’s in time, producing a context latent space $c_{t} = g_{a r} (z_{\leq t})$ . A figure of CPC Structure is given below (Figure 1 from the CPC paper [1]):

CPC Structure diagram showing encoder and autoregressive model — CPC Structure taken from [1]

We do not want to predict future observations $x_{t + k}$ directly with a generative model $p_{k} (x_{t + k} ∣ c_{t})$ , this is a difficult and expensive task. Moreover, the objective is representation learning, and hence learning a generative model is unnecessary. NCE provides a framework to build for solving this problem. We instead model a density ratio that preserves the mutual information between $z_{t + k}$ and $c_{t}$ .

f_{k} (x_{t + k}, c_{t}) \propto \frac{p ( x _{t + k} ∣ c _{t} )}{p ( x _{t + k} )}

Since $f$ is a density ratio, it does not need to be normalised. $f$ is implemented, in this case, as a log-bilinear model:

f_{k} (x_{t + k}, c_{t}) = exp (z_{t + k}^{T} W_{k} c_{t})

I note that I found the order that this was presented to be a little confusing. In practice, the above formulation for $f_{k}$ is chosen, and then the InfoNCE objective described below is at optimum when $f_{k}$ is proportional to the above density ratio.

Using a density ratio and inferring $z_{t + k}$ with an encoder means we don’t have to model the high dimensional distribution $x_{t + k}$ . We cannot evaluate $p (x)$ or $p (x ∣ c)$ directly. However, samples can be taken from these distributions allowing the use of NCE based objectives.

Either $z_{t}$ or $c_{t}$ could be used for downstream tasks. $c_{t}$ could be used if context from the past is useful, and if no context is needed, then $z_{t}$ could be a better representation. Lastly, if the downstream task requires a representation of the whole sequence, then one can pool the representations from either $z_{t}$ or $c_{t}$ over all locations.

What is InfoNCE?

Given a set $X = {x_{1}, \dots, x_{n}}$ of N random samples containing one positive sample from $p (x_{t + k} ∣ c_{t})$ and N-1 negative samples from the ‘proposal’ distribution $p (x_{t + k})$ we optimise:

L_{N} = - E_{X} [lo g \frac{f _{k} ( x _{t + k} , c _{t} )}{\sum _{x_{j} \in X} f _{k} ( x _{j} , c _{t} )}]

In other words, the categorical cross-entropy of classifying the positive sample correctly, $\frac{f _{k}}{\sum f _{k}}$ is the prediction model.

A distinction between NCE and InfoNCE: In NCE, we consider binary classification between the true distribution and noise distribution. Here, we are optimising a multiclass classification problem.

Implementation

As part of my research into CPC, I thought it to be a good idea to attempt implementing the proposed algorithm as no sample implementation is provided with the paper, and I was not able to find many online implementations. I was interested in a speech application and decided on learning a representation speech signals from the “SpeechMNIST” dataset (people speaking the digits zero through nine from the Google Speech Commands Dataset). The implementation can be found as a GitHub Gist.

Disclaimer: I am not 100% confident this implementation is entirely correct.