In this post I am going to talk about a statistical tool called importance sampling. In statistics importance sampling is a technique to create an unbiased estimate of a variable. An unbiased estimate is when the mean of the estimator function is not equal to the variable. So creating an unbiased estimator function is essential for a large number of purposes. Let me give an example . Suppose we want to calculate $E_{p}[f(x)]$ where the probability distribution is not known . Well the most common thing would be to use a Monte Carlo simulation. Well it is all very good but consider a function f(x) which will take non-zero but very large values in a very small area . The Monte Carlo simulation will hardly be able to sample values from that small area. Just consider the worst case of a delta-function and you will understand why monte carlo simulation won't work. So what is the solution in this case? Well, one of the solutions is definitely importance sampling which means that we take another function q(x) which must be a different probability distribution that is intended to give importance to the portion where $f(x) \neq 0$ and calculate $E_{q}[\dfrac{f(x)p(x)}{q(x)}]$. This imposes a restriction on q(x) i.e it cannot be 0 where $f(x)p(x) \neq 0$. Let us define $Q=[x|f(x)p(x)>0]$ and $D=[x|p(x)>0]$. Hence for $D\cap Q^{c},f(x)=0$ and for $Q\cap D^{c},p(x)=0$. So we can write $E_{q}[\dfrac{f(x)p(x)}{q(x)}]=\int_{Q}\dfrac{f(x)p(x)}{q(x)}q(x)dx=\int_{D}\dfrac{f(x)p(x)}{q(x)}q(x)dx-\int_{Q^{c}\cap D}\dfrac{f(x)p(x)}{q(x)}q(x)dx+\int_{Q\cap D^{c}}\dfrac{f(x)p(x)}{q(x)}q(x)dx$
$=\int_{D}f(x)p(x)dx=E_{p}[f(x)dx]=\mu$ which is what we want. So this creates an unbiased estimator for our function f(x). We also need to be able to calculate the likelihood function $\dfrac{p(x)}{q(x)}$ at every point.$$\hat{\mu}_{q}=\dfrac{1}{n}\sum \dfrac{f(x_{i})p(x_{i})}{q(x_{i})} \qquad X_{i} \sim q$$
But there must be a caveat right? and of course it is the variance of the estimate. We have to choose a probability function q such that the variance is minimized. We need to choose
$$q^{*}=min_{q}Var_{q}\big(\dfrac{f(x)p(x)}{q(x)}\big)$$
We can have have 99% confidence interval by having $\hat{\mu}_{q}\pm 2.58 \dfrac{\hat{\sigma}_{q}}{\sqrt{n}}$. We can write out the variance of the expression as
$$\sigma_{q}^{2}=\int_{D}\dfrac{(f(x)p(x))^{2}}{q(x)}-\mu_{q}^{2} $$. We can easily see that we can have a zero variance by taking $q=\tfrac{fp}{\mu}$ provided $f,p,\mu>0$. But that is indeed ironical because we are trying to estimate $\mu$ itself. But even then this helps us understand what a function $q$ must look like and what information should we have in order to make this work. From the expression , we can see that having $q$ in the denominator definitely means that a light tail of $q$ is extemely risky and we can even end up with a variance of $\infty$. So as a rule we must have a distribution which has a tail as heavy as p itself. Also we must have some educated guess regarding where the spikes of $f(x)$ is supposed to lie so that we can design the distribution $q(x)$ accordingly. Lastly I would say that this technique is used in a number of areas. The most recent area which I am delving into right now is that of the Multi -Arm Bandits and in the adverserial case both the Hedge Algorithm as well as EXP 3 uses Importance sampling to create an unbiased estimator of the rewards.
THANKS FOR READING