Separating Gaussian data

[math] \newcommand{\smallfrac}[2]{\frac{#1}{#2}} \newcommand{\medfrac}[2]{\frac{#1}{#2}} \newcommand{\textfrac}[2]{\frac{#1}{#2}} \newcommand{\tr}{\operatorname{tr}} \newcommand{\e}{\operatorname{e}} \newcommand{\B}{\operatorname{B}} \newcommand{\Bbar}{\overline{\operatorname{B}}} \newcommand{\pr}{\operatorname{pr}} \newcommand{\dd}{\operatorname{d}\hspace{-1pt}} \newcommand{\E}{\operatorname{E}} \newcommand{\V}{\operatorname{V}} \newcommand{\Cov}{\operatorname{Cov}} \newcommand{\Bigsum}[2]{\mathop{\textstyle\sum}_{#1}^{#2}} \newcommand{\ran}{\operatorname{ran}} \newcommand{\card}{\#} \renewcommand{\P}{\operatorname{P}} \renewcommand{\L}{\operatorname{L}} \newcommand{\mathds}{\mathbb}[/math]

After having studied in previous chapters the properties of high-dimensional data that comes from a Gaussian distribution, we now want use this knowledge to understand under which conditions it is possible to separate (or ‘disentangle’) the data that stems from two ‘Gaussians’ with variance one and different means. Let us first look at low dimensions, for example at the 2-dimensional data set in Figure, which we generated by drawing samples of equal size from two Gaussians with different centers.

2-dimensional dataset arising from two Gaussian distributions.


Separation (or disentanglement) in Figure would mean to determine for each point if it came from the lower left Gaussian or from the upper right Gaussian. Firstly, we observe that this cannot be done in a deterministic way but only ‘with high probability for almost all points’. Secondly we see, that for this to be possible it is necessary that the distance between the two means is sufficiently large. If this is the case, we might have a good chance that pairs of data points with small distances belong to the same Gaussian whereas pairs with large distances belong to different Gaussians. Moreover, there will be only a few points located half way between the centers which means that if we start by picking an arbitrary data point and then compute the other points' distances from the fixed one, we may expect that two clusters, corresponding to the two Gaussians, appear.

In high dimensions the picture looks different. We know that here most data points of a Gaussian are located in an annulus with radius [math]\sqrt{d}[/math] around the mean of the Gaussian. An analogue of the assumption in the 2-dimensional case would thus require that the distance [math]\Delta:=\|\mu_1-\mu_2\|[/math] between the means is larger than [math]2\sqrt{d}[/math], see Figure, implying that the data sets will be almost disjoint. \begin{center}

Two high-dimensional Gaussians with distance [math]\|\mu_1-\mu_2\| \gt 2\sqrt{d}[/math].

If this is the case, then we may expect that separation can be done analogously to the low-dimensional situation. There are however two principal differences: Firstly, our asymptotic approach to high dimensions means that we do not consider [math]d[/math] as a constant but always ask what happens for [math]d\rightarrow\infty[/math]. Therefore, we need to understand the distance of the means [math]\Delta=\Delta(d)[/math] as a function of the dimension as well. Otherwise, i.e., for constant [math]\Delta[/math] and dimension tending to infinity, the two annuli would eventually overlap more and more and become almost coincident. From this we see that we have to require that [math]\Delta\rightarrow\infty[/math]. Our low-dimensional arguments suggest that [math]\Delta \gt 2\sqrt{d}[/math] yields one scenario in which disentanglement can be achieved. On the other hand, the picture in Figure suggests that the overlap might also be small, even if [math]\Delta\leqslant2\sqrt{d}[/math].

We emphasize that Figure and in particular Figure constitute rather imperfect 2-dimensional abstract illustrations of high dimensional facts. Notice, e.g., that any two points in one of the annuli have, with high probability, distance [math]\sqrt{2d}[/math]. The picture does however not represent this at all!

Two high-dimensional Gaussians with distance [math]\|\mu_1-\mu_2\|\leqslant2\sqrt{d}[/math].

Notice that also in this second scenario [math]\Delta\rightarrow\infty[/math] must hold in order to avoid that for large [math]d[/math] the the two annuli become almost coincident. The picture suggests nevertheless that this can be achieved also for [math]\Delta[/math]'s that grow slower than [math]2\sqrt{d}[/math] and experiments corroborate this idea. For Figure below we considered two Gaussians in dimension [math]d=1000[/math], sampled 100 points from each of them, and plotted the distribution of the sample's mutual distances.

Mutual distances of points in a set consisting of 100 points sampled from [math]\mathcal{N}(0,1,\mathbb{R}^{1000})[/math] and 100 points sampled from [math]\mathcal{N}(\mathds{1},1,\mathbb{R}^{1000})[/math], where [math]\mathds{1}=(1,1,\dots)[/math].

Indeed, the distances of points that belong to the same Gaussian accumulate around [math]\sqrt{2d}\approx44.72[/math]. This suggests that the other ‘bump’ in the picture belongs to those pairs of points which come from different Gaussians. In our example we have a distance of [math]\Delta\approx31.62[/math] between the Gaussians' centers and [math]2\sqrt{d}\approx63.25[/math]. This shows that we are in the situation of Figure, i.e., the Gaussians overlap. The distribution of mutual distances suggest that it nevertheless should be possible to disentangle them with high accuracy. Indeed, if we pick for a start one point at random and then take its 100 nearest neighbors, Figure suggests that our odds are good that these 100 points will belong to one of the Gaussians and the rest of the points to the other.

Our aim now is to make the above formal. We start by computing the expected values for the distance between points that belong to different Gaussians.

Theorem

Let [math]\mu_1[/math], [math]\mu_2\in\mathbb{R}^d[/math] and [math]\Delta:=\|\mu_1-\mu_2\|[/math]. Let [math]X_1\sim\mathcal{N}(\mu_1,1,\mathbb{R}^d)[/math] and [math]X_2\sim\mathcal{N}(\mu_2,1,\mathbb{R}^d)[/math] be random vectors.

  • [math]\forall\:d\geqslant1\colon\bigl|\E\bigl(\|X_1-X_2\|-\sqrt{\Delta^2+2d}\bigr)\bigr|\leqslant \medfrac{3d+8\Delta^2+\sqrt{24}\Delta d^{1/2}}{2(\Delta^2+2d)^{3/2}}[/math].
  • [math]\forall\:d\geqslant1\colon\V(\|X_1-X_2\|)\leqslant \Bigl(\medfrac{3d+8\Delta^2+\sqrt{24}\Delta d^{1/2}}{2(\Delta^2+2d)^{3/2}}\Bigr)^2 + \medfrac{3d+8\Delta^2+\sqrt{24}\Delta d^{1/2}}{\Delta^2+2d}[/math].


Show Proof

(i) Let [math]X_1[/math], [math]X_2\colon\Omega\rightarrow\mathbb{R}^d[/math] be as above. We define the auxiliary random vector [math]\tilde{X}_1:=\mu_2+X_1-\mu_1[/math]. Then [math]\tilde{X}_1\sim\mathcal{N}(\mu_2,1,\mathbb{R}^d)[/math] holds. Moreover, [math]\tilde{X}_1[/math] and [math]X_2[/math] are independent. For a fixed [math]\omega\in\Omega[/math] we get the following picture for the values [math]x_1:=X_1(\omega)[/math], [math]x_2:=X_2(\omega)[/math] and [math]\tilde{x}_1:=\tilde{X}_1(\omega)\in\mathbb{R}^d[/math]. Firstly, in view of Theorem we may assume that [math]x_1[/math] will be located near the surface of a ball with radius [math]\sqrt{d}[/math] centered at [math]\mu_1[/math] and that [math]x_2[/math] and [math]\tilde{x}_1[/math] will both be located near the surface of a ball with radius [math]\sqrt{d}[/math] centered at [math]\mu_2[/math]. By definition of [math]\tilde{x}_1[/math] the lines connecting [math]\mu_1[/math] with [math]x_1[/math] and [math]\mu_2[/math] with [math]\tilde{x}_1[/math], respectively, will be parallel. Moreover, if we think of a coordinate system in which [math]\mu_2[/math] is the origin, then we may expect that [math]\tilde{x}_1[/math] and [math]x_2[/math] have a distance of approximately [math]\sqrt{2d}[/math], compare Theorem. Finally, we think of the line going from [math]\mu_2[/math] to [math]\mu_1[/math] marking the north pole of [math]\B(\mu_2,\sqrt{d})[/math]. Then Theorem suggests that [math]\tilde{x}_1[/math] and [math]x_2[/math] will be located close to the equator of [math]\B(\mu_2,\sqrt{d})[/math], since the scalar products [math]\langle{}\tilde{x}_1-\mu_2,\mu_1-\mu_2\rangle{}[/math] and [math]\langle{}x_2-\mu_2,\mu_1-\mu_2\rangle{}[/math] both will be close to zero. In the triangle given by all three points we thus may expect that the angle [math]\theta[/math] will be close to [math]90^{\circ}[/math] which implies that [math]\|x_1-x_2\|^2\approx\Delta^2+2d[/math] should hold.

Points sampled from different unit Gaussians lead to an almost orthogonal triangle with hypothenuse between [math]x_1[/math] and [math]x_2[/math].

Following the above intuition, we will now formally use the cosine law and our results from Chapter The curse of high dimensions to compute the expectation. We put [math]\xi:=\mu_1-\mu_2[/math] and start with

[[math]] \begin{equation}\label{COS-LAW} \begin{array}{rl} \|X_1-X_2\|^2\hspace{-7pt}\,& = \|X_2-\tilde{X}_1\|^2+\|X_1-\tilde{X}_1\|^2 - 2\|X_2-\tilde{X}_1\|\|\tilde{X}_1-X_1\|\cos(\theta)\\ & = \|X_2-\tilde{X}_1\|^2+\|\mu_1-\mu_2\|^2 - 2\langle{}X_2-\tilde{X}_1,X_1-\tilde{X}_1\rangle{}\\ & = \|X_2-\tilde{X}_1\|^2+\Delta^2 -2(\langle{}X_2-\mu_2,\xi\rangle{}+ \langle{}X_1-\mu_1,-\xi\rangle{})=:(\circ) \end{array} \end{equation} [[/math]]
where we know [math]\E(\|X_2-\tilde{X}_1\|^2)=2d[/math] since [math]\tilde{X}_1\sim\mathcal{N}(\mu_2,1,\mathbb{R}^d)[/math] by EXP-NORM-SQUARED. Moreover, by Theorem we get [math]\E(\langle{}X_2-\mu_2,\xi\rangle{})=\E(\langle{}X_1-\mu_1,-\xi\rangle{})=0[/math] which leads to

[[math]] \E(\|X_1-X_2\|^2)=2d + \Delta^2 + 2(0 - 0) = \Delta+2d. [[/math]]
Now we estimate the variance of [math]\|X_1-X_2\|^2[/math] but we have to be careful as the three non-constant expressions in \eqref{COS-LAW} are not independent. Indeed, from the second line of \eqref{COS-LAW} we get

[[math]] \begin{align*} \V(\|X_1-X_2\|^2) & = \V(\|X_2-\tilde{X}_1\|^2 + \Delta^2 - 2\langle{}X_2-\tilde{X}_1,X_1-\tilde{X}_1\rangle{})\\ & = \V(\|X_2-\tilde{X}_1\|^2) + 4\V(\langle{}X_2-\tilde{X}_1,X_1-\tilde{X}_1\rangle{})\\ & + \Cov(\|X_2-\tilde{X}_1\|^2,-2\langle{}X_2-\tilde{X}_1,X_1-\tilde{X}_1\rangle{})\\ & \leqslant \V(\|X_2-\tilde{X}_1\|^2) + 4\V(\langle{}X_2-\tilde{X}_1,X_1-\tilde{X}_1\rangle{})\\ & + \bigl(\V(\|X_2-\tilde{X}_1\|^2)\cdot\V(-2\langle{}X_2-\tilde{X}_1,X_1-\tilde{X}_1\rangle{})\bigr)^{1/2}=:(*) \end{align*} [[/math]]
where we know, see Problem, that [math]\V(\|X_2-\tilde{X}_1\|^2)\leqslant 3d[/math] holds. For the second term under the root we compute

[[math]] \begin{align*} \V(\langle{}X_2-\tilde{X}_1,X_1-\tilde{X}_1\rangle{}) & = \V(\langle{}X_2-\mu_2,\mu_1-\mu_2\rangle{} + \langle{}X_1-\mu_1,\mu_2-\mu_1\rangle{} )\\ & = \V(\langle{}X_2-\mu_2,\mu_1-\mu_2\rangle{}) + \V(\langle{}X_1-\mu_1,\mu_2-\mu_1\rangle{} )\\ & \leqslant 2\|\mu_1-\mu_2\|^2 \end{align*} [[/math]]
where we use [math]Z:=X_i-\mu_i\sim\mathcal{N}(0,1,\mathbb{R}^d)[/math], [math]\xi:=\pm(\mu_1-\mu_2)[/math] together with Theorem to get [math]\V(\langle{}Z,\xi\rangle{}) = \|\xi\|^2[/math]. This allows us to continue the estimate of the variance as follows

[[math]] (*) \leqslant 3d+8\|\mu_1-\mu_2\|^2 + (3d\cdot8\|\mu_1-\mu_2\|^2)^{1/2}= 3d+8\Delta^2+\sqrt{24}\Delta d^{1/2}. [[/math]]
Now we adapt the method used in the proof of Theorem and split

[[math]] \|X_1-X_2\|-\sqrt{\Delta^2+2d} = \medfrac{\|X_1-X_2\|^2-(\Delta^2+2d)}{2\sqrt{\Delta^2+2d}} - \medfrac{(\|X_1-X_2\|^2-(\Delta^2+2d))^2}{2\sqrt{\Delta^2+2d}(\|X_1-X_2\|+\sqrt{\Delta^2+2d})^2} [[/math]]
where we know that the expectation of the first term is zero. The expectation of the second term we estimate by

[[math]] 0\leqslant \E\Bigl(\medfrac{(\|X_1-X_2\|^2-(\Delta^2+2d))^2}{2\sqrt{\Delta^2+2d}(\|X_1-X_2\|+\sqrt{\Delta^2+2d})^2}\Bigr)\leqslant \medfrac{\V(\|X_1-X_2\|^2)}{2(\Delta^2+2d)^{3/2} }\leqslant \medfrac{3d+8\Delta^2+\sqrt{24}\Delta d^{1/2}}{2(\Delta^2+2d)^{3/2} } [[/math]]
where we used our previous estimate for [math]\V(\|X_1-X_2\|^2)[/math]. This establishes (i).

(ii) Finally we get

[[math]] \begin{align*} \V(\|X_1-X_2\|) &= \bigl|\E\bigl((\|X_1-X_2\|)^2\bigr) - \bigl(\E(\|X_1-X_2\|)\bigr)^2\bigr|\\ & = \bigl|(\Delta^2+2d) - \bigl[\E\bigl(\|X_1-X_2\|-\sqrt{\Delta^2+2d}\bigr)+\sqrt{\Delta^2+2d}\bigr]^2\bigr|\\ & = \bigl|(\Delta^2+2d) -\bigl[\bigl(\E\bigl(\|X_1-X_2\|-\sqrt{\Delta^2+2d}\bigr)\bigr)^2\\ &+2\E\bigl(\|X_1-X_2\|-\sqrt{\Delta^2+2d}\bigr)\sqrt{\Delta^2+2d}+(\Delta^2+2d)\bigr]\bigr|\\ & \leqslant \bigl|\E\bigl(\|X_1-X_2\|-\sqrt{\Delta^2+2d}\bigr)\bigr|^2\\ & + 2\bigr|\E\bigl(\|X_1-X_2\|-\sqrt{\Delta^2+2d}\bigr)\bigr|\sqrt{\Delta^2+2d} \end{align*} [[/math]]
which shows (ii) by employing twice the estimate from (i).

In the following results we understand [math]\Delta=\Delta(d)[/math] as a function of the dimension. To keep the notation short we will however drop the ‘of [math]d[/math]’ most of the time.

Corollary

Let [math]\mu_1[/math], [math]\mu_2\in\mathbb{R}^d[/math] and [math]\Delta:=\|\mu_1-\mu_2\|[/math]. Let [math]X_1\sim\mathcal{N}(\mu_1,1,\mathbb{R}^d)[/math] and [math]X_2\sim\mathcal{N}(\mu_2,1,\mathbb{R}^d)[/math] be random variables. Assume either that there exists [math]d_0\geqslant1[/math] such that [math]\Delta\geqslant2\sqrt{d}[/math] holds for all [math]d\geqslant d_0[/math] or that there exists [math]d_0\geqslant1[/math] such that [math]\Delta \lt 2\sqrt{d}[/math] holds for all [math]d\geqslant d_0[/math]. Then

[[math]] \lim_{d\rightarrow\infty}\E\bigl(\|X_1-X_2\|-\sqrt{\Delta^2+2d}\bigr)=0 \;\;\text{ and }\;\; \limsup_{d\rightarrow\infty}\V\bigl(\|X_1-X_2\|\bigr)\leqslant23. [[/math]]


Show Proof

We start with the first scenario, assume [math]\Delta\geqslant2\sqrt{d}[/math], i.e., [math]d\leqslant\Delta^2/4[/math], and estimate

[[math]] \medfrac{3d+8\Delta^2+\sqrt{24}\Delta d^{1/2}}{2(\Delta^2+2d)^{3/2}} \leqslant \medfrac{(3/4)\Delta^2+8\Delta^2+\sqrt{6}\Delta^2}{2(\Delta^2)^{3/2}}\leqslant\medfrac{3/4+8+\sqrt{6}}{4\sqrt{d}}\xrightarrow{d\rightarrow\infty}0. [[/math]]
For the second scenario assume [math]\Delta \lt 2\sqrt{d}[/math]. Then

[[math]] \medfrac{3d+8\Delta^2+\sqrt{24}\Delta d^{1/2}}{2(\Delta^2+2d)^{3/2}} \leqslant \medfrac{3d+32d+2\sqrt{24}d}{2(2d)^{3/2}} = \medfrac{17+\sqrt{24}}{2\sqrt{2d}}\xrightarrow{d\rightarrow\infty}0 [[/math]]
holds. This shows that the expectation tends to zero. For the variance we get from the above that the first term will tend to zero and it thus remains to bound the second. In the first scenario we estimate

[[math]] \medfrac{3d+8\Delta^2+\sqrt{24}\Delta d^{1/2}}{\Delta^2+2d}\leqslant\medfrac{(3/4+8+\sqrt{6})\Delta^2}{\Delta^2+2d}\leqslant\medfrac{12}{1+(2d/\Delta^2)}\leqslant 12 [[/math]]
and in the second

[[math]] \medfrac{3d+8\Delta^2+\sqrt{24}\Delta d^{1/2}}{\Delta^2+2d} \leqslant\medfrac{45d}{\Delta^2+2d}= \medfrac{45}{(\Delta^2/d)+2}\leqslant \medfrac{45}{2} [[/math]]
which finishes the proof.

The above suggests that if we pick two points [math]x[/math] and [math]y[/math] from Gaussians with different centers at random, then we have to expect that [math]\|x-y\|\approx\sqrt{\Delta^2+2d}[/math] holds for large [math]d[/math] and that the distribution does not spread out when [math]d[/math] increases. On the other hand we know already from Chapter The curse of high dimensions that [math]\|x-y\|\approx\sqrt{2d}[/math] for large [math]d[/math] provided that [math]x[/math] and [math]y[/math] are sampled from the same Gaussian. This implies that the question if the two Gaussians can be separated, requires that the distance [math]\sqrt{\Delta^2+2d}-\sqrt{2d}[/math] between the two expectational values is large enough. For the two scenarios covered by Corollary we get the following.

Lemma

Let [math]\Delta\colon\mathbb{N}\rightarrow(0,\infty)[/math] and [math]\delta\colon\mathbb{N}\rightarrow(0,\infty)[/math], [math]\delta(d):=\sqrt{\Delta^2+2d}-\sqrt{2d}[/math].

  • If there exists [math]d_0\geqslant1[/math] such that [math]\Delta\geqslant2\sqrt{d}[/math] holds for all [math]d\geqslant d_0[/math], then [math]\delta(d)\geqslant\sqrt{d}[/math] holds for [math]d\geqslant d_0[/math].
  • If there exists [math]d_0\geqslant1[/math], [math]c \gt 0[/math] and [math]\alpha\geqslant0[/math] such that [math]cd^{1/4+\alpha}\leqslant\Delta \lt 2\sqrt{d}[/math] holds for all [math]d\geqslant d_0[/math], then [math]\delta(d)\geqslant (c^2/4)d^{2\alpha}[/math] holds for [math]d\geqslant d_0[/math].


Show Proof

(i) In this case we compute

[[math]] \delta(d)=\sqrt{\Delta^2+2d}-\sqrt{2d} \geqslant \sqrt{4d+2d}-\sqrt{2d} = (\sqrt{6}-\sqrt{2})\sqrt{d}\geqslant\sqrt{d}. [[/math]]
(ii) Using [math]\Delta^2\leqslant4d[/math] and [math]\Delta \gt cd^{1/4+\alpha}[/math] we get

[[math]] \delta(d)=\medfrac{\Delta^2}{\sqrt{\Delta^2+2d}+\sqrt{2d}}\geqslant\medfrac{(cd^{1/4+\alpha})^2}{\sqrt{6d}+\sqrt{2d}}\geqslant \medfrac{c^2d^{2\alpha}}{(\sqrt{6}+\sqrt{2})}\geqslant \medfrac{c^2}{4}\cdot d^{2\alpha} [[/math]]
as claimed.


We will see later how Lemma outlines for the situation sketched in Figure, how fast [math]\Delta[/math] needs to grow in order to separate two Gaussians. Before we discuss this, we need however the following explicit probability estimate complementing our results in Theorem and Corollary on expectation and variance.

Theorem

Let [math]\mu_1[/math], [math]\mu_2\in\mathbb{R}^d[/math] and let [math]\Delta:=\|\mu_1-\mu_2\| \gt 0[/math]. Let [math]x_1[/math] and [math]x_2[/math] be drawn at random with respect to Gaussian distribution with unit variance and means [math]\mu_1[/math] and [math]\mu_2[/math], respectively. Then

[[math]] \P\bigl[\bigl|\|x_1-x_2\|-\sqrt{\Delta^2+2d}\bigr|\geqslant\epsilon\bigr] \leqslant \medfrac{54}{\epsilon}+\medfrac{8}{\sqrt{d}}+\medfrac{20}{\epsilon\sqrt{d}} [[/math]]
holds for [math]\epsilon \gt 54[/math] and [math]d\geqslant\epsilon^2/9[/math].


Show Proof

Formally, we consider random variables [math]X_i\sim\mathcal{N}(\mu_i,1,\mathbb{R}^d)[/math] for [math]i=1,2[/math] and we estimate

[[math]] P:=\P\bigl[\bigl|\|X_1-X_2\|-\sqrt{\Delta^2+2d}\bigr|\geqslant\epsilon\bigr] [[/math]]
by the same trick that we used already in Theorem and Theorem, namely by multiplying the left side of estimate in the square brackets by [math]\|X_1-X_2\|+\sqrt{\Delta^2+2d}[/math] and the right side by [math]\sqrt{\Delta^2+2d}[/math]. This way we get

[[math]] P \leqslant \P\bigl[\bigl|\|X_1-X_2\|^2-(\Delta^2+2d)\bigr|\geqslant\epsilon\sqrt{\Delta^2+2d}\bigr]=:(\circ). [[/math]]
Now we introduce the auxiliary random variable [math]\tilde{X}_1:=\mu_2+X_1-\mu_1\sim\mathcal{N}(\mu_2,1,\mathbb{R}^d)[/math] and formula \eqref{COS-LAW} that we derived from the cosine law to get

[[math]] \begin{align*} (\circ)& = \P\bigl[\bigl|\|X_1-X_2\|^2-(\Delta^2+2d)\bigr|\geqslant\epsilon\sqrt{\Delta^2+2d}\bigr]\\ & = 1 - \P\bigl[-\epsilon\sqrt{\Delta^2+2d}\leqslant\|X_2-\tilde{X}_1\|^2+\Delta^2-2\bigl(\langle{}X_2-\mu_2,\mu_1-\mu_2\rangle{}\\ & +\langle{}X_1-\mu_1,\mu_2-\mu_1\rangle{}\bigr)-\Delta^2-2d\leqslant\epsilon\sqrt{\Delta^2+2d}\bigr]\\ & \leqslant 1 - \P\bigl[-\medfrac{\epsilon\sqrt{\Delta^2+2d}}{3}\leqslant\|X_2-\tilde{X}_1\|^2-2d\leqslant\medfrac{\epsilon\sqrt{\Delta^2+2d}}{3}\bigr]\\ & \cdot\P\bigl[-\medfrac{\epsilon\sqrt{\Delta^2+2d}}{3}\leqslant-2\langle{}X_2-\mu_2,\mu_1-\mu_2\rangle{}\leqslant\medfrac{\epsilon\sqrt{\Delta^2+2d}}{3}\bigr]\\ & \cdot\P\bigl[-\medfrac{\epsilon\sqrt{\Delta^2+2d}}{3}\leqslant-2\langle{}X_1-\mu_1,\mu_2-\mu_1\rangle{}\leqslant\medfrac{\epsilon\sqrt{\Delta^2+2d}}{3}\bigr]\\ & \leqslant 1 - \P\bigl[\bigl|\|X_2-\tilde{X}_1\|^2-2d\bigr|\leqslant\medfrac{\epsilon\sqrt{2d}}{3}\bigr]\cdot\P\bigl[|\langle{}Z,\xi\rangle{}|\leqslant\medfrac{\epsilon\sqrt{\Delta^2}}{6}\bigr]^2 =:(*) \end{align*} [[/math]]
where we use [math]Z:=X_i-\mu_i\sim\mathcal{N}(0,1,\mathbb{R}^d)[/math] and [math]\xi:=\mu_1-\mu_2[/math]. In order to continue we need to estimate both probabilities in [math](*)[/math] from below. From Corollary we get for [math]\epsilon/3 \gt 18[/math] and [math]d\geqslant(\epsilon/3)^2[/math] the estimate

[[math]] \P\bigl[\bigl|\|X_2-\tilde{X}_1\|^2-2d\bigr|\leqslant\medfrac{\epsilon\sqrt{2d}}{3}\bigr]\geqslant1-\bigl(\medfrac{18}{\epsilon/3}+\medfrac{8}{\sqrt{d}}\bigr)=1-\bigl(\medfrac{54}{\epsilon}+\medfrac{8}{\sqrt{d}}\bigr). [[/math]]
Using Corollary we get

[[math]] \P\bigl[|\langle{}Z,\xi\rangle{}|\leqslant\medfrac{\epsilon\sqrt{\Delta^2}}{6}\bigr]\geqslant1-\medfrac{4}{\sqrt{2\pi}}\medfrac{\|\xi\|}{(\epsilon\Delta/6)\sqrt{d}} = 1-\medfrac{24}{\sqrt{2\pi}\epsilon\sqrt{d}}\geqslant1-\medfrac{10}{\epsilon\sqrt{d}}. [[/math]]
Combining both leads to

[[math]] \begin{align*} (*) & = 1 - \P\bigl[\bigl|\|X_2-\tilde{X}_1\|^2-2d\bigr|\leqslant\medfrac{\epsilon\sqrt{2d}}{3}\bigr]\cdot\P\bigl[\bigl|\langle{}Z,\xi\rangle{}\bigr|\leqslant\medfrac{\epsilon\sqrt{\Delta^2}}{6}\bigr]^2\\ &\leqslant 1-\Bigl[1-\bigl(\medfrac{54}{\epsilon}+\medfrac{8}{\sqrt{d}}\bigr)\Bigr]\Bigl[1-2\cdot\medfrac{10}{\epsilon\sqrt{d}}+\bigl(\medfrac{10}{\epsilon\sqrt{d}}\bigr)^2\Bigr]\\ &\leqslant 1-\Bigl[1-\bigl(\medfrac{54}{\epsilon}+\medfrac{8}{\sqrt{d}}\bigr)\Bigr]\Bigl[1-\medfrac{20}{\epsilon\sqrt{d}}\Bigr]\\ &\leqslant 1- \Bigl[1-\bigl(\medfrac{54}{\epsilon}+\medfrac{8}{\sqrt{d}}\bigr)-\medfrac{20}{\epsilon\sqrt{d}}\Bigr]\\ &=\medfrac{54}{\epsilon}+\medfrac{8}{\sqrt{d}}+\medfrac{20}{\epsilon\sqrt{d}} \end{align*} [[/math]]
which finishes the proof.

The following is the core result on separability as we can use it to see that points which have ‘small’ mutual distance belong to the same Gaussian with high probability.

Theorem

(Separation Theorem) Let [math]\mu_1[/math], [math]\mu_2\in\mathbb{R}^d[/math] and put [math]\Delta:=\|\mu_1-\mu_2\|[/math]. Let [math]\epsilon_1 \gt 18[/math] and [math]\epsilon_2 \gt 54[/math] be constant and assume that there is [math]d_0\geqslant1[/math] such that [math]\sqrt{\Delta^2+2d}-\sqrt{2d} \gt \epsilon_1+\epsilon_2[/math] holds for [math]d\geqslant d_0[/math]. Let [math]X_i\sim\mathcal{N}(\mu_i,1,\mathbb{R}^d)[/math] and [math]S_i:=\{x_i^{\scriptscriptstyle(1)},\dots,x_i^{\scriptscriptstyle(n)}\}[/math] be samples of [math]X_i[/math] for [math]i=1,2[/math] such that [math]S_1\cap S_2=\emptyset[/math]. Let [math]x,y\in S_1\cup S_2[/math]. Then

[[math]] \liminf_{d\rightarrow\infty}\;\P\bigl[x,\,y \text{ come from same Gaussian}\:\big|\:|\|x-y\|-\sqrt{2d}| \lt \epsilon_1 \bigr]\geqslant\medfrac{1-18/\epsilon_1}{1+54/\epsilon_2} [[/math]]
holds.


Show Proof

On the basis of our prior experiments, see Figure, and in view of Theorem and Corollary we expect that the mutual distances will accumulate around the values [math]\sqrt{2d}[/math] and [math]\sqrt{\Delta^2+2d}[/math]. Our assumptions imply [math]\sqrt{2d}+\epsilon_1 \lt \sqrt{\Delta^2+2d}-\epsilon_2[/math] which guarantees that [math]\Bbar(\sqrt{2d},\epsilon_1)[/math] and [math]\Bbar(\sqrt{\Delta^2+2d},\epsilon_2)[/math] have empty intersection, i.e., there is some space between the two ‘bumps’ sketched in Figure below.

Picture of the distribution of mutual distances accumulating around [math]\sqrt{2d}[/math] and [math]\sqrt{\Delta^2+2d}[/math].

To keep notation short let us abbreviate [math]x\sim y[/math] for the case that [math]x[/math] and [math]y[/math] come from the same Gaussian, i.e., if [math]x,\,y\in S_1[/math] or [math]x,\,y\in S_2[/math]. Consequently, [math]x\not\sim y[/math] means [math]x\in S_1, y\in S_2[/math] or [math]x\in S_2,\,y\in S_1[/math]. We use Bayes Theorem to compute the conditional probability

[[math]] \P\bigl[x\sim y\:\big|\:\bigl|\|x-y\|-\sqrt{2d}\bigr| \lt \epsilon_1\bigr] = \frac{\P\bigl[\bigl|\|x-y\|-\sqrt{2d}\bigr| \lt \epsilon_1 \:\big|\:x\sim y\bigr]\cdot\P[x\sim y]}{\P\bigl[\bigl|\|x-y\|-\sqrt{2d}\bigr| \lt \epsilon_1\bigr]}=:(\circ) [[/math]]
and continue with the denominator as follows

[[math]] \begin{align*} \P\bigl[\bigl|\|x-y\|-\sqrt{2d}\bigr| \lt \epsilon_1\bigr] & = \P\bigl[\bigl|\|x-y\|-\sqrt{2d}\bigr| \lt \epsilon_1\:\big|\:x\sim y\bigr]\cdot\P[x\sim y]\\ & + \P\bigl[\bigl|\|x-y\|-\sqrt{2d}\bigr| \lt \epsilon_1\:\big|\:x\not\sim y\bigr]\cdot\P[x\not\sim y]\\ & \leqslant 1\cdot\medfrac{1}{2}+\bigl(1-\P\bigl[\bigl|\|x-y\|-\sqrt{2d}\bigr|\geqslant\epsilon_1\:\big|\:x\not\sim y\bigr]\bigr)\cdot\medfrac{1}{2}\\ & \leqslant\medfrac{1}{2}\cdot\Bigl(1+1-\P\bigl[\bigl|\|x-y\|-\sqrt{\Delta^2+2d}\bigr| \lt \epsilon_2\:\big|\:x\not\sim y\bigr]\Bigr)\\ & \leqslant\medfrac{1}{2}\cdot\Bigl(1+\P\bigl[\bigl|\|x-y\|-\sqrt{\Delta^2+2d}\bigr|\geqslant\epsilon_2\:\big|\:x\not\sim y\bigr]\Bigr)\\ & \leqslant\medfrac{1}{2}\cdot\Bigl(1+\medfrac{54}{\epsilon_2}+\medfrac{8}{\sqrt{d}}+\medfrac{20}{\epsilon_2\sqrt{d}}\Bigr)\\ \end{align*} [[/math]]
where we employed [math]\P[x\not\sim y]=\medfrac{1}{2}[/math], the fact that [math]\Bbar(\sqrt{2d},\epsilon_1)\cap\Bbar(\sqrt{\Delta^2+2d},\epsilon_2)=\emptyset[/math] and the estimate from Theorem. Using [math]\P[x\sim y] =1/2[/math] and the estimate from Theorem to treat the numerator of [math](\circ)[/math] we get

[[math]] \begin{equation}\label{SEP-EQ-1} \P\bigl[x\sim y\;\big|\;\bigl|\|x-y\|-\sqrt{2d}\bigr| \lt \epsilon_1\bigr] \geqslant \frac{1-\frac{18}{\epsilon_1}-\frac{8}{\sqrt{d}}}{1+\frac{54}{\epsilon_2}+\frac{8}{\sqrt{d}}+\frac{20}{\epsilon_2\sqrt{d}}} \end{equation} [[/math]]
and letting [math]d\rightarrow\infty[/math] finishes the proof.

Notice that in order to apply Theorem we need that [math]\delta=\sqrt{\Delta^2+2d}-\sqrt{2d}\geqslant\epsilon_1+\epsilon_2 \gt 72[/math] holds. Here, [math]\delta=\delta(d)[/math] can for example be large and bounded or even tend to infinity. If the latter is the case, we may in \eqref{SEP-EQ-1} consider [math]\epsilon_2\rightarrow\infty[/math] as well. We state concrete examples for this below.

Corollary

Let [math]\epsilon_1 \gt 18[/math] be constant, [math]\epsilon_2=d^{\alpha}[/math] and [math]\Delta=2d^{1/4+\alpha}[/math] with [math]\alpha \gt 0[/math]. Then the assumptions of Theorem are met and we have

[[math]] \liminf_{d\rightarrow\infty}\;\P\bigl[x,\,y \text{ come from same Gaussian}\:\big|\:|\|x-y\|-\sqrt{2d}| \lt \epsilon_1 \bigr]\geqslant1-\medfrac{18}{\epsilon_1}. [[/math]]


Show Proof

We first consider the case [math]\alpha\geqslant1/4[/math]. Then by Lemma(i) we get that [math]\delta\geqslant d^{1/2} \gt \epsilon_1+d^{1/4}[/math] holds for large [math]d[/math]. We may thus proceed as in the proof of the previous theorem and at its end we see, since [math]\epsilon_2\rightarrow\infty[/math] for [math]d\rightarrow\infty[/math], that the limit inferior is bounded from below by [math]1-18/\epsilon_1[/math].

Let now [math]\alpha \lt 1/4[/math]. Then we apply Lemma(ii) and get that [math]\delta\geqslant (2^2/4)d^{2\alpha} \gt \epsilon_1+d^{\alpha}[/math] for large [math]d[/math]. From this point on we can proceed as in the first part.

We conclude this chapter by running the separation algorithm suggested by our previous results with the parameters outlined in Corollary. That is, we consider two [math]d[/math]-dimensional Gaussians whose centers have distance [math]\Delta=\Delta(d)[/math] and sample from each of them 100 points. Then we pick one data point at random and label its 100 nearest neighbors with [math]0[/math], the rest with [math]1[/math]. Finally we check how many of the points with the same label came from the same Gaussian. The charts below show the rate of correctly classified points for different choices of [math]\Delta[/math].


Rate of correctly classified data points for [math]\Delta\equiv{}10[/math], [math]\Delta=2d^{1/4}[/math], and [math]\Delta=2d^{0.3}[/math]. The pictures show the means over 100 experiments for each dimension.


We leave it as Problem to verify the above and we mention that for [math]\Delta=2\sqrt{d}[/math] in experiments the rate of correctly classified points turned out to be one for all dimensions. Finally let us look at the the distribution of mutual distances in a case where [math]\Delta[/math] is (too) small.


Mutual distances of points in a set consisting of 100 points sampled from [math]\mathcal{N}(0,1,\mathbb{R}^{10000})[/math] and 100 points sampled from [math]\mathcal{N}((\Delta,0,0,\dots),1,\mathbb{R}^{10000})[/math] with [math]\Delta=25,\,30[/math].

The pictures in Figure show the distribution of mutual distances between 200 points sampled from two different Gaussians in [math]\mathbb{R}^{1000}[/math]. In the left picture the distance between their means is 25 and in the right picture it is 30. In the experiment that lead to the right picture our algorithm still correctly classified 98% of the points, whereas in the experiment underlying the left picture we only got 85% right. Notice that the threshold identified in Corollary is [math]2d^{1/4}=20[/math].

General references

Wegner, Sven-Ake (2024). "Lecture Notes on High-Dimensional Data". arXiv:2101.05841 [math.FA].

Notes