Research Article: Graphemic-phonetic diachronic linguistic invariance of the frequency and of the Index of Coincidence as cryptanalytic tools

Date Published: March 19, 2019

Publisher: Public Library of Science

Author(s): Vicente Jara Vera, Carmen Sánchez Ávila, Lixiang Li.


Languages have inherent characteristics that make them their own and differentiated entities within their phyla and families. Even messages written in any language and later encrypted by cryptographic systems do not lose all of their characteristics, there remain aspects that help the cryptanalyst to recover them without knowing the decryption keys. For the characterization of the languages we will consider the frequencies of their graphemic and phonetic units and the Index of Coincidence, tools of fundamental utility in the field of Cryptography. Their diachronic invariance or survival over time in one language and their ability to discriminate against other languages will be analized. In order to do so, we will examine a total of 101 languages of which 261 texts have been taken. All of them are very diverse in style and time, taking us through a wide linguistic and temporal spectrum that will cover the period from the 6th century BC to the present day.

Partial Text

Cryptography is the applied science that designs and implements information protection systems by transforming the original messages in readable language into encrypted messages, impossible to decypher without the decryption key, although susceptible to cryptoanalytic attacks.

From the total of phyla and languages [16] we will consider the Indo-European, Uralic, Altaic and Caucasian phyla as present on the great continent of Eurasia. In addition, we will study the Afro-Asiatic, Nilo-Saharan, Niger-Congo and Khoisan phyla from the African continent and Arabian peninsula. Next to them we must also consider the Austronesian Phylum, which extends from Oceania, Southeast Asia, Polynesia and the Pacific Islands to the island of Madagascar, a Phylum of which we will only take Malagasy language as a sample.

Among the 101 languages we will take the case of Latin and Spanish or Castilian to highlight their selection of texts in greater detail and study the variations in the graphemic and phonetic values of their Frequency and their Index of Coincidence over time.

We do not intend to deal with variations in time of languages in this study as it is a very complex subject and with so many aspects to be taken into account beyond those we can collect and analyze here. However, not even some of them or even a single language in particular, is rich and complex enough, something that is outside the aim of this paper [76]. However, we will make some notes on various aspects referring to some of the tools used in Cryptology, such as the Frequency and the Index of Coincidence, applied to both graphemes and phonemes.

Throughout history, cryptanalysts who faced a cipher text used to do it during periods of conflict, which allowed the languages used by the contenders to be known and thus the base language of the original text. What is more, the consideration of a particular language forced to study it in its synchronic properties, relegating a temporal and diacronic analysis, of little usefullness in those circumstances.




Leave a Reply

Your email address will not be published.