ISSN 2079-3537      

 
 
 
                                                                                                                                                                                                                                                                                                                                                                                                                                                                             

Scientific Visualization, 2025, volume 17, number 2, pages 110 - 122, DOI: 10.26583/sv.17.2.08

Analysis of the Error Structure in Identifying the Author of a Text Using the Nearest Neighbor Graphs

Author: M.Yu. Kislitsyna1

Keldysh Institute of Applied Mathematics RAS

1 ORCID: 0000-0002-2542-8914, voronina.miu@yandex.ru

 

Abstract

In this paper, the nearest neighbor graph method is used to analyze the relationship between a large number of multidimensional vectors, which represents the distribution of letter combinations (n-grams) in the text, where n is 3. The task is the authorship attribution problem, which belongs to the field of natural language processing. The graph of the nearest neighbors is built according to the pattern distribution of the authors and visualizes the points of concentration and sparsity, which allows to identify the structure of text classification errors. The corpus consists of more 8 thousand authors and more 100 thousand literary texts in Russian including translations. Thus, this is one of the most extensive experiments with literary texts in Russian. All authors have at least five works in the corpus, each of which contains more than 10 thousand letters. The author's pattern is calculated by averaging the 3-gram statistics of his texts. The error structure associated with the proximity of texts and authors to the average pattern of lexicon is determined using graphs. It is shown that the densest centers of the graph are close to the average lexicon pattern with varying degrees of proximity. The text recognition error of such authors is about two times higher than the error of authors who are far from the lexicon. Some literary genres, such as philosophical ones, are localized at special distances.

 

Keywords: The authorship attribution problem, nearest neighbors graph, an author.