Let \(P_n\) denote the set of all real polynomials of degree \(\leq n\). For any function \(f : [0, 1] \to \mathbb{R}\), we define the \(\ell_\infty\)-norm as
$$\|f\|_\infty = \max\{|f(x)|: x \in [0, 1]\}$$
The Weierstrass approximation theorem says: for any continuous real function \(f\) on \([0, 1]\), that
$$\lim_{n\to\infty} \min_{p \in P_n} \|p − f\|_\infty = 0$$
This means that any function can be approximated arbitrarily well by an infinite-degree polynomial. In 1912 Bernstein gave an elegant probabilistic proof of this theorem. In this post, I'm going to prove a variant of the approximation theorem in which we have a Lipschitz condition on \(f\): that there is a \(C < \infty\) such that
$$\sup_{x\neq y \in [0,1]} \frac{|f(x) - f(y)|}{|x − y|} \leq C$$
The goal of this post is to show that \(\|f - b_f\|_\infty \leq O(C/\sqrt{n})\). This means that I'm going to bound the distance of the furthest deviation from the approximation by a term that tends to \(0\) (by the inverse square root of \(n\)) as \(n\to\infty\). I'm also going to define an explicit polynomial with which we can create such an approximation: the Bernstein polynomial.
The Bernstein polynomials of degree \(n\) are defined for \(0 \leq k \leq n\) as:
$$b_{n,k}(x) = {n\choose k}x^k(1−x)^{n−k}$$
Also, the Bernstein approximation of a function \(f\) is given by
$$b_f(x)=\sum_{k=1}^n f(k/n) b_{n,k}(x)$$
We then show that \(\|f-b_f\|_\infty \leq O(C/\sqrt{n})\), where \(f\) is a \(C\)-Lipschitz function such that \(\sup_{x\neq y \in [0,1]} |f(x) − f(y)|/|x − y| \leq C\), where \(C < \infty\). For \(x\in[0,1]\), let \(\{K_i\}_{i=1}^n\) be independently and identically distributed Bernoulli random variables with expectation \(x\), and let \(K = \sum_{i=1}^n K_i\). Then, \(\Pr(K_i = 1) = x\) and \(\Pr(K_i = 0) = 1−x\).
We observe that \(b_{n,i}(x) = {n\choose i}x^i (n-x)^{n-i} = \Pr[K = i]\) as \(K \sim \mathrm{Bin}(n,x)\) since \(K = \sum_{i=1}^n K_i\), where \(K_i \sim \mathrm{Bernoulli}(x)\). Then, \(b_f(x) = \mathbb{E}[f(K/n)]\) since \(b_f(x) = \sum_{i=1}^n f(i/n) b_{n,i}(x) = \sum_{i=1}^n f(i/n) \Pr[K = i] = \mathbb{E}[f(K/n)]\).
\(\DeclareMathOperator*{\EE}{\mathbb{E}}\)
Note that for all \(x\in[0,1]\), if \(|f(x) -b_f(x)| \leq \beta\), then \(\|f-b_f\|_\infty \leq \beta\) as well. So, it suffices to show that for all \(x\in[0,1]\), we have that \(|f(x) - b_f(x)|\leq O(C/\sqrt{n})\). That is what we aim to first show. First, since \(b_f(x) = \EE[f(K/n)]\) from earlier, we must have that:
\begin{align*}
|f(x)-b_f(x)| &= |f(x) - \EE[f(K/n)]| = |\EE[f(x) - f(K/n)]|
\end{align*}
The last equality follows by linearity of expectations: \(\EE[\alpha X + \beta Y] = \alpha \EE[X] + \beta\EE[Y]\) for constants \(\alpha,\beta\) and random variables \(X,Y\). Then, to simplify this further, we note that the absolute value operator is a convex function since \(|\alpha X + \beta Y| \leq \alpha |X| + \beta|Y|\) for positive \(\alpha,\beta\). Therefore, Jensen's inequality must hold for it. Jensen's inequality states that for a convex function \(\zeta\), we must have that \(\zeta(\EE[X]) \leq \EE[\zeta(X)]\). Therefore, setting \(\zeta = |\cdot|\), we get that:
\begin{align*}
|\EE[f(x) - f(K/n)]|&\leq \EE[|f(x) - f(K/n)|]
\end{align*}
Since \(f\) is \(C\)-Lipschitz continuous, we have that \(|f(x) - f(K/n)|/|x-K/n| \leq C \implies |f(x) - f(K/n)|\leq C|x-K/n|\). So,
\begin{align*}
|f(x)-b_f(x)| &\leq \EE[|f(x)-f(K/n)|] \leq \EE[C|x-K/n|] \\
&= C\EE[|x-K/n|]
\end{align*}
In the last line, we used the linearity of expectations. To compute the resulting expectation, we then define a family of events \(\{D_i\}_{i=1}^n\):
$$D_i = \bigg[\frac{i}{\sqrt{n}}<\bigg|\frac{K}{n}-x\bigg|\leq \frac{i+1}{\sqrt{n}}\bigg]$$
So, we then use the law of total expectation, where we condition on \(D_i\), which gives us:
\begin{align*}
|f(x) - b_f(x)| \leq C\EE[|x-K/n|] = C\cdot \sum_{i=0}^n \EE[|x-K/n| \text{ $|$ }D_i]\Pr(D_i)
\end{align*}
Note that since \(D_i = [\frac{i}{\sqrt{n}}<|\frac{K}{n}-x|\leq \frac{i+1}{\sqrt{n}}]\), we can write that \(\EE[|x-K/n|\text{ | }D_i]\geq \frac{i}{\sqrt{n}}\). Additionally, we upper bound the two-tailed probability scenario by only considering the limit of the lower-tail:
\begin{align*}
\Pr(D_i) &= \Pr\bigg(\frac{i}{\sqrt{n}}<\bigg|\frac{K}{n}-x\bigg|\leq\frac{i+1}{\sqrt{n}}\bigg) \leq \Pr\bigg(\frac{i}{\sqrt{n}}<\bigg|\frac{K}{n}-x\bigg|\bigg) \\
\end{align*}
Simplifying this term by considering each case of the absolute value yields that:
\begin{align*}
\Pr\bigg(\frac{i}{\sqrt{n}}<\bigg|\frac{K}{n}-x\bigg|\bigg) &= \Pr\bigg(\frac{i}{\sqrt{n}}<\frac{K}{n}-x\bigg) + \Pr\bigg(\frac{i}{\sqrt{n}} < x - \frac{K}{n}\bigg) \\
&= \Pr(i\sqrt{n}< K-x n) + \Pr(i\sqrt{n}< x n - K)
\end{align*}
Note that \(\EE[xn - K] = nx - \EE[K] = nx - nx = 0\), and \(\EE[K-xn]=\EE[K]-xn = xn - xn = 0\).
Before I continue, I'm quickly going to state the Chernoff bound: Let \(X_i\) for \(1\leq i\leq n\) be Bernoulli with \(\EE[X_i]=p_i\). Set \(Y_i = X_i - p_i\) so \(\EE[Y_i]=0\). Set \(Y = \sum Y_i\). Then, for \(\lambda>0\), we must have that \(\Pr[Y>\lambda\sqrt{n}]< e^{-2\lambda^2}\).
Therefore, the Chernoff bound tells us that \(\Pr(i\sqrt{n}< K-xn) < e^{-2i^2}\) and \(\Pr(i\sqrt{n}< xn-K) < e^{-2i^2}\). So, \(\Pr(D_i)< 2e^{-2i^2}\). Hence, we get that:
\begin{align*}
|f(x) - b_f(x)| &\leq C\cdot \sum_{i=0}^n \EE[|x-K/n| \text{ $|$ }D_i]\Pr(D_i) \leq C\sum_{i=0}^n \frac{2ie^{-2i^2}}{\sqrt{n}} = \frac{2C}{\sqrt{n}} \sum_{i=0}^n i e^{-2i^2}
\end{align*}
We now bound the resulting finite sum with its infinite sum: \(\sum_{i=0}^n ie^{-2i^2} \leq \sum_{i=0}^\infty ie^{-2i^2}\). Then, \(e^{2i^2}>i^3\) for large enough \(i\) since taking natural logs on both sides tells us that \(2i^2 > 3\ln i\). Thus, we have that \(\sum_{i=0}^n ie^{-2i^2}\leq \sum_{i=0}^\infty i/i^3 = \sum_{i=0}^\infty \frac{1}{i^2} = \frac{\pi^2}{6}\). Therefore:
\begin{align*}
|f(x) - b_f(x)| &\leq \frac{2C}{\sqrt{n}} \sum_{i=0}^n i e^{-2i^2} \\
&\leq \frac{2C}{\sqrt{n}} \cdot \frac{\pi^2}{6} = \frac{\pi^2}{3} \cdot \frac{C}{\sqrt{n}} \\
&= O(C/\sqrt{n})
\end{align*}
Hence, \(|f(x) - b_f(x)| \leq O(C/\sqrt{n})\). But, from our earlier assertion, this then implies that
\begin{align*}
\boxed{\|f - b_f\|_\infty \leq O(C/\sqrt{n})}
\end{align*}
This proves the claim. So, all \(C\)-Lipschitz functions can be approximated arbitrarily well by a linear combination of Bernstein polynomials, where the furthest deviation errs by atmost \(O(C/\sqrt{n})\), where \(n\) is the degree of the Bernstein polynomial.