In the present paper
we study statistics of the so-called first nearest neighbors graphs (next
FNNG). The main goal of this investigation is to analyze the adequacy of the
nearest neighbor heuristics as a machine learning model. Usually, such a
problem arises in many tasks of classification, where we need to recognize some
attributes of the system, which presumably determine the state of this system.
Machine learning methods for FNNG models are described in sufficient detail,
for example, in works [1, 2]. Practical heuristics using the first (as well as
second, third, etc.) nearest neighbors for clustering problems are described in
[3]. In the work of [4], the dependence of the number of nearest neighbors on
the dimension of the embedding space is studied. Various algorithms for
searching nearest neighbors are described in [5]. In the work [6] some
approximation methods for FNNG searching were discussed. Thus, work in this
area continues to develop actively, and a number of tasks remain unresolved,
despite the long period of study.
The essence of the
problem we are working on is that for reliable machine learning it is required
to have sufficient number of samples from the same attribute distribution.
However, the structures of the nearest neighbors corresponding to various
samples from such a distribution may differ significantly. With regard to FNNG,
the question arises as to how the structure of connections between elements of
a set observed in practice is characteristic of sets of the type being studied.
To carry out such an analysis, it is necessary to formalize the concept of
sample statistics for FNNG. This problem is not so simple as it may seem. On
the one hand, if it is necessary to find the probability that a certain family
of graphs has a given property, then the problem is solved methodically simply:
graphs of this family are generated and their proportion for which the studied
property is fulfilled is determined. After this determination, the convergence
can be analyzed in the sense of the central limit theorem. We emphasize that
both the family and the property are defined in terms of graph parameters: it
can be the number of edges, the average density of connections, the number of
disconnected subgraphs, etc. But if we are not interested in the graph itself,
but in its interpretation as a structure of a physical system, then the
situation becomes much more complicated. For example, let's take a certain
number of fiction texts belonging to certain authors and build a graph of nearest
neighbor connections for them. The question arises: how should the structure of
the resulting graph be treated? Let us suppose, that we have three disconnected
subgraphs with node amount equals
. Is this structure characteristic of this
particular group of authors and texts, or is it in some sense typical of
writers in general? If a graph with five disconnected fragments is obtained for
other texts by the same authors, should it be considered as a significantly
different system or as a random variation of the first one?
Since in practice,
as a rule, one specific set of objects for which data were collected is
analyzed, the question of the variability of established empirical patterns
becomes of practical importance. Different samples of the same type of objects
will generate, generally speaking, different connection graphs. In this sense,
the relationship graph is random, since the selection of objects into the
analyzed group is random. The theory of random graphs is developed in great
detail in the direction of "probability theory". Starting from the
classical work [7], random graphs have been studied in the context of solving
many applied problems: see, for example, [8 - 11].
The asymptotic
properties of random graphs were also analyzed within the framework of specific
probabilistic models in relation to various ontologies, social and transport
networks and similar objects (see [12 - 14]) However, in the direction of
"mathematical statistics" there are significantly fewer constructive
models and results in graph theory. This is due to the fact that "graph
sampling" does not take place by itself, because the graph simply
visualizes some property of the studied group of objects. So, the concept of
sampling is addressed specifically to a group of objects that should be
homogeneous in the studied property in the statistical sense. But since the
studied property is revealed just as a result of analyzing the graph structure,
it is impossible to form groups of objects with the desired property in
advance. Then, having considered the distribution of parameters that are
considered key for a particular sample in a particular model of the system, it
is possible to multiply samples with the found empirical distribution of
parameters, and then study the statistics of graphs corresponding (according to
the assumption of the model) to various states of the system under study. Since
the sample of parameters is finite, the corresponding sample distribution
function fluctuates from sample to sample. Then it is necessary to study the
variability of the graph family structure when changing the distribution of
system parameters.
The main problem is
that the analysis of one particular graph does not allow us to estimate the
probability of its realization as an element from a certain set of random
graphs, since the desired set is not formalized. A rare exception is the
nearest neighbor graph, which is based on a matrix of pairwise distances
between points. In this case, the distance distribution function is known,
which makes it possible to find out which graphs correspond to it.
We formulate the
main tasks of statistical analysis of FNNG, which are of interest for the
classification and identification of elements from some finite set.
First, it is necessary
to determine the realization probability of a certain graph structure with
nodes in the form of a set
of disconnected fragments. In the case of nearest neighbors,
the number
can be varied from 1 (the graph is
connected) to
(the graph consists of pairs of
mutually closest nodes or, according to the parity of
, consists of pairs and one triple).
Secondly, since the
distribution of a graph by degrees of nodes depends on the number of these
nodes themselves, it is necessary to construct a distribution of a connected
fragment
by the number of nodes depending on the
total amount of nodes
in the graph and the number of
fragments
.
Thirdly, it is
interesting to study distribution of nodes in the fragment by their node
degrees
with known values
,
and
up to isomorphism.
Thus, for a given
total number of nodes
, we will be interested in the
distribution
of graphs by the number
of disconnected fragments, the distribution
of these fragments by the number of nodes
, and the distribution
of a given graph by nodes degree
. It is quite natural that independent samples of distances
from the same distribution will generate, generally speaking, non-isomorphic
graphs. Then we can find out which structures are most likely. And as a result,
it is possible to separate by studying the constructed distributions whether we
are dealing in each case with a typical situation or with a rare deviation,
which may be caused by a special reason that was not considered when forming
the similarity model of objects.
The asymptotic
behavior of sample distributions for high-dimensional graphs is also of our
interest. So we suggest a new method of FNNG generation for high dimension
metric space. Instead of generation of random coordinates of multidimensional
vector we generate a random symmetric matrix of distances between objects in
our space. Thus we can construct a lot of samples of such matrices and
investigate the distribution function of corresponding graph structure. The
matrix order is equal to the number of points N. If N is sufficiently large,
only the graph analysis can visualize a typical mutual structure of random
objects. So the statistical properties of the distributions, mentioned above,
enable us to get an idea of typical FNNG structure and to estimate the corresponding
probability of realization of such a graph.
We will assume that
for each point of the set under study there is a unique nearest neighbor. In
our study we follow to a standard definition of FNNG as an oriented graph,
connecting two points
A
and
B
in a metric space if and only if
, where
is a distance function between
two given points.
If the number of
elements in the set is relatively small, then the problem of calculation of
graph distribution according to the number of disconnected fragments and
distribution of nodes according to the number of incoming edges can be solved
by direct enumeration. For a large graphs, such enumeration process may have a
significant computational complexity, and it is not obvious in advance how
accurately the problem of collecting statistics should be solved.
If we proceed from
the fact that the distances between the elements of the set are known, then we
can assume that the physical system for which the FNNG structure is analyzed is
determined by the distribution function of these distances. Then it would be
possible to study FNNG families corresponding to given distributions of
distances between points of the set. The question arises: how much does the
probability of realizing one or another FNNG structure depend on the type
distribution of distances between points?
Proposition 1.
The probability of realizing
the FNNG structure does not depend on the distribution of distances between
points.
Proof. As it is known from the
general course of probability theory, if
is a distribution function of random variable
, then a random variable
has a uniform distribution on
. Based on this statement, algorithms for generating a sample
with a given continuous density of the distribution function
are constructed, for which
is sample distribution function. To do this we can generate
uniform distributed set
on
, after that from the equation
the set of solutions
can be numerically found. This series is a sample subset from
by construction. Let
be a uniformly distributed set of distances, where
, and
is an amount of numbers in the
set. This set corresponds to a structure of certain FNNG. Due to the monotony
of the function
the inverse function is also
monotonic with the same direction of monotony as the forward mapping. Hence,
the ordering of the quantities in the samples
and
is the same. Therefore, the
graphs of the corresponding nearest neighbors also coincide. So, the
probabilities of FNNG structure realization do not depend on the distribution
of distances between nodes.
Proposition 1 is proved.
From this it follows
that in numerical analysis it is enough to collect statistics of FNNG for a
uniform distribution of distances, without discussing yet whether it is
possible to implement such realization for points of Euclidean space at all (on
a straight line, for example, it is impossible). This will allow tabulating the
probabilities of the realization of a particular structure.
Suppose that for any
N, all available structure realizations of each nearest neighbor graph
are listed up to isomorphism. A structure is a specific collection of fragments
with a given number of nodes.
Let the amount of
such structures be equal to
, the probability of realization
of each of them is known (for example, from a computational experiment) and is
equal to
. The value of
is a number of ways, which the connected set of
points can be divided into
subsets,
, with a number of nodes
and with the condition
. Ideally, this is the solution to the problem of a particular
graph realization. The distribution of degrees of nodes for each structure is
known by construction, we denote it
,
. This distribution is
considered precisely within the particular structure, so that
. Then the distribution of a random graph by nodes degree is
.
The problem is that
the determination of
is associated with a complete
search and has exponential complexity, which becomes for large
an insurmountable obstacle. It follows from the above that
the number of possible FNNG structures up to isomorphism is related to the
number
, ways of splitting a given natural number
into a sum of natural numbers. An asymptotic estimate of the
number of partitions is given by the Hardy – Ramanujan formula [15], which
approximately has the form:
.
The number of disconnected
fragments is determined by the following sentence.
Proposition 2.
Amount
of different fragments, in the form of which the graph of the
first nearest neighbors of
points can be realized,
asymptotically equal to the principal part of the derivative of the number of
partitions
by the number of nodes:
.
Proof. Number of structures
is an amount
of the representation of the
number
in form of sum of natural numbers minus
the number of impossible graph structures in case of this problem. Isolated
points, of which there may be one, two, etc., are impossible. Let's denote this
number of invalid structures by
. Then
. Note now that the number of different structures is taken
into account up to isomorphism. This means that there is
options when there is one isolated node,
options for two isolated vertices, etc. Hence, the number
is presented as sum
. Then from here and from equality
it follows that
.
Thus,
.
From formula (1) for
it follows that
, where
. Let's obtain the main part by
N
in this expression. We have the following transformations:
from which it follows that the
main part
is equal to
. Since here
, we get the result (2).
Proposition
2 is proved.
The illustration of
this proposition is presented in Fig. 1 (a, b). As it is known [16], the
problem of natural numbers decomposition is closely connected with asymptotical
semi-circle Wigner distribution. These figures correspond to situation, where
the amount [N/2] of nodes is presented as a set of disconnected graphs.
One can see from one million numerical experiments for generating FNNG
structures with uniform distribution distances for cases
and
, that the distribution of
probability of the fragment numbers tends to Wigner distribution.
The exponential
growth in the number of ways to divide a connected graph into fragments leads
us to the fact that it is necessary to abandon a detailed consideration of the
graph structure and move on to its more enlarged characteristics. Let
be a probability of the nearest neighbor graph realization in
the form of
fragments and
is a distribution of these fragments by the number of nodes,
i.e. it is the probability that in the graph of
nodes, which make up into
fragments, there is a structure with a given amount of nodes
.
Fig. 1. The distribution of FNNG
structures on subgraphs and nodes:
(a-left) N = 200; (b-right) N =
1000
To describe this
graph-family we will need the key object – the number of non-isomorphic
connected graphs with a given number of nodes. Let 's denote this value
. It can be used to express the number of other non-isomorphic
structures. If, for example, the graph consists of two fragments with the
number of nodes in them equal to
and
, then the number of
non-isomorphic graphs in such a family is equal to
. Due to the independence of FNNG from the type of distribution
of distances between nodes, the probability of each graph within this family is
the same and equals
.
We emphasize that
the distribution of nodes by degrees depends on the configuration, which,
although equally probable within a given partition on fragments with a given
number of nodes, is naturally different for different partitions. Therefore,
the distribution of nodes by their degree corresponding to a certain
configuration of
fragments (regardless of the
specific realization of these fragments) in the form of a certain node
distribution, is not uniquely determined. It itself has some distribution
: this is the probability that in this configuration the
degree
has the frequency of realization
. In practice, the probability
is estimated based on the results of experiments during which
the realization of
fragments was obtained. In each
experiment the number of
nodes, that have a degree
, can be found, so that the corresponding empirical frequency
is equal to
. The set
forms an empirical distribution
of node number
, with a degree
in a given configuration
.
If we now consider
random realization of FNNG on
nodes, then the distribution of
its vertices by degrees has the form
.
Each realization is
a selection of pairs
with conditions
: the number of nodes is equal to
N, the sum of the
degrees of the nodes is also equal to
N.
To build a benchmark
of FNNG statistics the set of graphs with the node number from 100 to 1500 with
step 100 was modeled. Statistics were collected based on the results of one
million graph realization of each size.
Further, based on
the results of the simulation, we can answer the question, what is the
probability
that a random FNNG contains a given number
of nodes with given degree
by incoming edges. For example, the distribution
with
is presented in Fig. 2 for
.
Fig. 2. Distribution of the
number of nodes with degree
for
N
= 300
Similarly, it is
possible to construct distributions nodes by the first degree, second, etc. If
we arrange these distributions on the same plane in the form of a “surface”
, which is a set of distributions
, then we get the phase space of possible realizations FNNG.
At the same time, not every trajectory is possible
, but only one for which the conditions are satisfied
.
Example of
distribution
is shown in Fig. 3 for
.
Fig. 3. Distribution of nodes by
degrees for
N
= 1000
Surfaces
represent a numerically obtained reference base for analysis
of FNNG. For example, for
the most likely realization of
a random FNNG has a probability of approximately
. On the other hand, analyzed in [17] task of recognizing the
authors of literary texts of particular one hundred authors leads to the line
"degree of nodes - number of nodes" in the form
. It follows from the benchmark data that the probability of
such a random graph is
, it is ten thousand times less
than the probability of a typical nearest neighbor graph. This indicates that
this particular sample is formed, in any case, by objects with a high degree of
correlation, and a similar distribution should not be expected with another
sample of this type. If the graph was typical, it would not make sense to look
for the reasons of such connections. For this task, it is possible to discuss
the reasons of the appearance of specific clusters. Note also that the number
of disconnected FNNG fragments in the recognition task from [16] turned out to
be equal 23, which is close to the most likely
according to the Wigner limit distribution.
An important result
of the numerical analysis is that the empirical distributions
turned out to be very close to normal. For example, for the
density of the distribution of the number of nodes of degree 0 in a graph from
nodes with the determination above 0,999, we get an
approximation (normalized by the number of nodes)
.
For nodes with the
first degree, the distribution has the form
.
Distributions for
other degrees of nodes have a similar form. Note now that these distributions
are bound by the condition
, which mean that they are
correlated. For example, the joint distribution of degrees 0 and 1 has the form
(see Fig. 4):
Fig. 4 – Joint distribution of
nodes by degrees 0 and 1 for N = 300
The main part of
this distribution has the form of an ellipse stretched along the diagonal. This
means that the empirical joint distribution of the number of nodes by degrees
can be represented by a multidimensional normal distribution:
,
Where
is a determinant of the covariance matrix, and
is an element of the inverse covariance matrix.
The fact that the
distribution of degrees of nodes is Gaussian follows from the local limit
theorem of Moivre-Laplace applied to the polynomial distribution of vertices by
degrees. Indeed, if
is a probability that the node
in FNNG has the degree
, then the probability that
nodes from
have a degree
, equal to
, what has Gaussian asymptotics
in the limit of large
. So, the distribution of the
form (8) is quite expected for correlated normally distributed random variables
and confirms the correctness of the computational experiment. A nontrivial
computational fact turned out to be that the probabilities
, appearing in (8), themselves have a normal distribution for
sufficiently large values
. According to the results of
the analysis of graphs with the number of nodes
from 300 to 1500, based on a million experiments, the
following values were obtained for each variant of the number of nodes
, which are independent on
:
These quantities have a Gaussian
approximation with determination 0,999:
.
As a result, it
turns out that both the normalization of the distribution (8) and the sum of
the average values of the degrees equal to
N.
This study is
devoted to the development of a method for statistical analysis of graph
structures using the example of nearest neighbor graphs. The aim of the work,
which is still far from complete, is to construct a theory of the sampling
method applying to graphic structures. In particular, here we consider the
problem of estimating the probability that the observed graph has a typical
structure in terms of the number of disconnected subgraphs and the distribution
of nodes degree. In case of FNNG, these results will allow more correct
interpretation of pattern recognition using this heuristic method.
These results are
very useful for practical analysis of any concrete structure of objects system
under the base supposition, that these objects are non-correlated with each
other. If the number of nodes in each separate sub-graph corresponds to the
mode of distribution function of the nodes and also the distribution of nodes
powers is approximately equal to theoretical Gaussian distribution with the
given accuracy, then we may confirm, that these objects are independent. But in
the opposite case we can estimate the probability of this graph realization and
conclude, that the observed structure corresponds to certain connections
between objects.
In addition, the
study of the frequency of occurrence of certain FNNG structures is closely
related to the classical problems of combinatorial geometry. For example, an
important task is to estimate the dimension of the embedding space of problem
parameters by the matrix of distances between them. The maximum number of
nearest neighbors for any of the points under study can be used to estimate the
space dimension from below. Note that this problem has not been solved for
spaces of arbitrary dimension. We also point out the connection between FNNG
statistics and set splitting and Voronoi diagrams, which are encountered in
various applications of discrete mathematics. It seems that numerical
experiments in these directions could lead to the appearance of practically
useful heuristics.
Another task area
arising in the statistical analysis of FNNG structures - is the construction of
asymptotic formulas for dependent distributions. Such solutions would make it
possible in some cases to circumvent the practical difficulties associated with
a complete enumeration of options that have exponential complexity.
1.
Hastie T., Tibshirani R., Friedman J.
The Elements of
Statistical Learning. – Springer, 2001. – 764 p.
2.
Chen
G. H., Shah D.
Explaining the success of nearest neighbor methods in prediction. Foundations
and Trends in Machine Learning, 10(5-6):337–588, 2018.
3.
Miller
G. L., Teng S., Thurston W. and Vavasis S.A.
Separators for Sphere-Packings and
Nearest Neighbor Graphs. Journal of the ACM, Vol. 44, No. 1, 1997, pp.
1–29.
4.
Eppstein
D., Paterson M., Yao F.
On Nearest-Neighbor Graphs. Discrete & Computational
Geometry, 1992, pp. 1-20.
5.
Baranchuk
D., Persiyanov D., Sinitsin A. and BabenkoA.
Learning to routein similarity
graphs. Proc. of the InternationalConference on Machine Learning, pp.
475–484, 2019.
6.
Kiana
Hajebi, Yasin Abbasi-Yadkori, Hossein Shahbazi and Hong Zhang.
Fast Approximate Nearest
Neighbor Search with k-Nearest Neighbor Graph. Proc. of 22-d International
Joint Conference on Artificial Intelligence, 2009. P. 1312-1317.
7.
Erdos
P., Renyi A.
On random graphs. Publ. Math. Debrecen, 1959. V. 6. P. 290-297.
8.
Kolkin
V. F.
Sluchaynie graphi (in Russian). – Ì.: FizMatLin, 2004. – 256 c.
9.
Bollobas
B.
Random
Graphs. – Cambridge University Press, 2001. – 520 p.
10.
Janson
S., Luczak T., Rucinski A.
Random graphs. – New York: Wiley, 2000. – 333 p.
11.
Raygorodski
A. M.
Modelirovaniesluchaynihgrafoviihprimenenie. TrudiMFTI, 2010. Ò. 2. ¹ 4. Ñ.
130-140.(in Russian)
12.
Raygorodski
A. M.
Modeli Interneta (in Russian). – Dolgoprudniy, «Intellekt», 2013. – 64 p.
13.
Barabasi
L.-A., Albert R.
Emergence of scaling in random networks. Science, 1999. V. 286. P.
509-512.
14.
Leskovec
J., Chakrabarti D., Kleinberg J., Faloutsos C., Gharamani Z.
Kronecker graphs: an
approach to modeling networks. J. Machine Learning Research, 2010. V. 11.
P. 985-1042.
15.
Hardy
G.H., Ramanujan S.
Asymptotic Formulae in Combinatory Analysis. Proc. London Math. Soc.,
1918, vol. 17, pp. 75-115.
16.
Vershik
A. M., Kerov S. V.
Asymptotika mery Plansherelia symmetricheskoi gruppy predelnaia forma
tablits Yunga (in Russian). Doklady AN USSR, 1977. V. 223 ¹ 6. P.
1024-1027.
17.
Kislitsyn
A. A., Orlov Yu.N.
Investigation of the nearest neighbor graph statistics (in Russian).
Keldysh Institute Preprints. 2021. ¹ 85. 23 p.