Posted on 14 June 2016 by Anton Alipov

Categories: Art, History

Contextual Analysis of Voynich Objects - Part II

NOTE: This article is currently in the state of public draft. Please feel free to add your comments below.

1. Introduction

This article continues research initiated in the previous post (hereinafter also referred to as "Part I") and moves on to discuss Voynich plants following the same contextual analysis methodology. As noted in the cited article, there are 129 botanical folios in the VMS, which are: f1v, f2r - f11v, f13r - f56v, f57r, f65r, f65v, f66v, f87r, f87v, f90r1, f90r2, f90v1, f90v2, f93r - f94v, f95r1, f95r2, f95v1, f95v2, f96r, f96v. Each of those seems to discuss a certain plant (f42r and f87v seem to discuss two different plants each), and none of the plants have been reliably identified up to date. Generally, the purpose of this post is to pick up a subset of plants to focus the identification effort on - in other words, those plants which (probably) have more chances to be identified than others (which may be rare or even imaginary). Let us call this target subset of plants the "focal set".

As in Part I, we use the folio notation scheme adopted in the Voynich Query Processor (VQP) for the sake of consistency, let alone that we also use this tool extensively in the present article.

We use the term "vord" as a synonym for "Voynichese word".


2. Hypotheses

For the sake of this post we adopt the following hypotheses from Part I:

H1. The VMS is a meaningful text and not a random text hoax

H3. A given vord represents the same plain text on any page

H4. The Voynich "botanical section" folios discuss plants

The reader is encouraged to refer to Part I for detailed discussion of these hypotheses.


3. General idea

The idea of detecting the focal set of plants relies upon the concepts of "useful" plants and "rare" plants. The concept of "usefulness" was already introduced in Part I in regard to the Voynich "stars" (labeled objects in f68r1 and f68r2). In particular, a star is considered "useful" if it is met with anywhere beyond the diagrams of f68r1 or f68r2. Likewise, let us consider the plant "useful" if it is met with anywhere beyond the folio discussing that plant. (The reader may inquire how we do know where the plant's name is located in the botanical folio. We will touch this problem below).

Recall that in Part I we introduced the notion of "applications" of Voynich objects. Using this terminology, useful plants may have applications other than botanical (e.g. if they are mentioned in belneological or recipe sections), while useless plants have only botanical application by definition. Of course, a certain plant may be mentioned in more than one botanical folio - thus being useful but, at the same time, having no other applications except the botanical one.

In Part I, we listed which stars are mentioned in which botanical folios, assuming that if a star is mentioned in a botanical folio, then the respective plant is "associated" with this star in some way. So let us call the plant "rare" if it is not associated with any Voynich star. The rationale behind this terminology is that if a star's occurrence in a botanical folio stands for association indeed, then failure to associate the plant with any star probably means that the plant is very rare (and thus not used in practice) or imaginary. Anyway, for now this terminology is formal, and only later developments may prove or disprove its lexical validity.

Moving from form to substance to test the aforementioned (in)validity, we suggest that rare or useless plants are not the best candidates for identification, while those plants which are both useful and not rare should comprise the focal set in question. In this way we narrow the scope - instead of spreading botanists' attention over 129 plants, we will focus it (and wait for what happens) on... how many plants?.. let us see.


4. Where is the plant's name?

4.1. Pharmaceutical section labels as potential plant names

Since plants depicted in the "pharmaceutical" section of the VMS have labels, it is a natural assumption that those labels may stand for plant names. Indeed, this is the point of view that some researchers adopt. E.g., Koen Gheuens argues [1] that the pharmaceutical section labels represent plant names, or phrases derivative from those, in a certain "local" language - as opposed to the non-local language presumably used to name plants in the botanical section of the VMS.

However, there is a number of considerations which speak against the pharmaceutical section labels being plant names, as follows.

1) Many of the pharmaceutical section labels are unique vords. E.g. in f88r, even if we exclude the labels that might be attributed to jars, 38% of labels are unique; in f88v, using the same principle, 50% of labels are unique.

2) Furthermore, not all non-unique labels are mentioned in the botanical folios. In f88r, 25% of non-unique labels, and in f88v, 40% of non-unique labels, are those which are mentioned only outside of the botanical folios.

Considering 1) and 2), we can state that for the book, the opening part of which is an extensive herbal, it is strange to have so many herbs not mentioned in the application section. This might be not that strange, however, if the pharmaceutical section is not a derivative one, but a section on its own, especially originally belonging to another author. Marco Ponzi provides [2] an example of a manuscript where the overlap between two sections dealing with plants is quite small.

But here is where yet another consideration comes in:

3) Some labels are re-used through the pharmaceutical section. Like, otoldy is used in f89r1, then in f89r2 (but here it can be attributed to the jar), but also in f99r and f99v, where it labels two roots of entirely different appearance.

Given these three considerations together, one needs a really superfluous explanation to defend the proposition that the labels stand for plant names. Note, though, that the proposition that they may stand for derivative phrases does not suffer. However, we are interested in plant names, not in the derivatives. In other words, we are interested in "teak", and not in "teakwood".

This is not to deny the usefulness of the contextual analysis of the pharmaceutical section labels. Probably we will perform that in another, dedicated, post of this series.

4.2. First vords of botanical folios as potential plant names

For now, we turn to the old assumption (adopted e.g. by Stephen Bax in his analysis) that the first vord of a botanical folio represents the name of the respective plant. While this assumption is not unreasonable in itself, it may be true or false of course, so some kind of test is required to check it. We propose the frequency count for such test. If first vords of botanical folios tend to be frequent in the corpus, then it is less probable that they stand for plants' names, because the latter would be some special terms which ought not occur tens or hundreds of times. Of course some practically important plants may be mentioned quite often in the manuscript, but this picture would not be observed for the majority of plants. So, the large count of the first vord would disprove our assumption. On the other hand, if the first word is generally rare or even unique, then this, at least, does not disprove the "plant name" assumption.

The question arises which count should be considered small or large. To resolve it, let us compare the first vord's count with counts of vords occupying other positions in folios. For example, if the first vord is the plant's name (and thus is infrequent), then the second vord would, most probably, not be the plant's name (or any other specific term) and we can expect it to be more frequent. However, we should provide for cases when the plant's name is a composite one. For an English language example, we have "Rose" and we may have "Rose Mary" (rosemary), "White Rose", "Black Rose" etc. The third vord of a folio is even less likely to be part of the plant's name (if the latter opens the folio indeed), so the count of the third vord is also quite useful for our purpose. To complete the picture, we observe counts of the last vords of folios, which are even more likely to be general words.

Before we move to the results of this test, let us explain the methodology that we use.


5. Methodology

As noted in Section 1 above, VQP was used to obtain frequency counts and to track words' occurrences. Unless explicitly specified otherwise, only exact matches were accounted for. The limitations of VQP were explained in Part I, so mostly they apply to the present investigation as well.

The raw data, unless explicitly noted otherwise, are represented in the attached dataset file. Certain degree of care was dedicated to checking the correctness of the vord readings suggested by the VQP. In the case the VQP's reading was deemed to be wrong, the correct reading was recorded and/or counted; these cases are marked in red font in the dataset file. In the case the VQP's reading was deemed to be ambiguous, no reading was recorded and the respective position was excluded from the total count; these cases are marked in yellow fill in the dataset file.

In no case did the number of ambiguous cases exceed few percent of the overall counts, so those ambiguities have no effect on the conclusions derived from the counts.

When detecting paragraph starts in the recipe section, marginal stars were considered as markers of those in the cases of ambiguity.

In the dataset file, the word "or" formatted in italics stands for the English conjunction "or", not for the Voynichese "or".


6. Vord statistics with respect to the position in the folio

For an introductory reference, using the Voynich Reader software, choosing the Takahashi transcription and excluding dubious words from the count, one finds that the corpus contains 34432 vords, with the dictionary size of 6818 vords. Of course, the former figure is much higher because many vords are non-unique, i.e. they occur more than once in the corpus. 4564 vords of the 6818 vord dictionary are unique. Thus only 4564/34432 = 13,3% of total vord occurrencies are occupied by unique vords.

Let us now turn to the subject statistics.

The sheet "Botanical folios" in the dataset file contains information as to the vord counts in the botanical folios. First of all, it specifies whether the first, the second, the third and the last vords of the folio are unique or not. As it appears, of 128 non-ambiguous first vords 85 (or 66,4%) are unique. In contrast to that, only 33,3% (42 of 126) of second vords, 33,6% (42 of 125) of third vords, and 34,6% (44 of 127) of last vords are unique. (The slight difference in the total counts is explained by the different number of ambiguous readings - which, as explained above, are excluded from consideration).

Not only a much higher percent of first vords of botanical folios are unique (as compared with second, third and last vords), but also those first vords which are not unique exhibit much lower frequency counts than non-first vords. Non-unique first vords have the average count of 5,8. For second vords this figure equals 81, for third vords - 84, and for last vords - 169.

The contrast is apparent. First vords of botanical folios exhibit behavour different from that of the vords in other positions. But is it really an indicator of those vords' standing for specific terms (and not commonplace words)? May we not attribute this behaviour not to the semantics but to the specific position of those vords in a folio? Perhaps the encoding process is such that the first vord of a folio generally exhibits lower degree of repeatability?! To check this, let us look at the behaviour of first vords of folios of other sections. Is it consistently the same? If it is, then that may mean one of the two things:

a) first vords of folios of other sections are also some specific terms, relevant to that sections;


b) first vords of folios owe their uniqueness not to their meaning, but to their position in the folio.

The sheet "Balneo folios" provides statistics for the balneological section. It appears that 60,0% of first vords of balneological folios are unique, and the average frequency count of non-unique first vords is 19.

The sheet "Recipe folios" provides statistics for the recipe section. 72,7% of first vords of recipe folios are unique, and the average frequency count of non-unique first vords is 3.

As to the astrological section, as reported in [3], only 12,5% of those folios which contain standalone text paragraphs have their first vord unique.

The pharmaceutical section was not analyzed.

So while the behaviour is not consistent across all sections of the VMS, the botanical section is not alone in exhibiting this behaviour.

To investigate the potential option b) further, we need to distinguish between folios and paragraphs. Indeed, the first vord of a folio is, at the same time, the first vord of the first paragraph in a folio. So what position matters (if any) - to be the first vord in a paragraph, or only to be the first vord in a folio?

To answer this question, first vords of second paragraphs of recipe folios were analyzed (refer to the sheet "Recipe folios 2nd paragraphs"). 60,9% of them turned to be unique, and the average count of non-unique vords is 125.

First vords of third paragraphs of recipe folios were also analyzed (refer to the sheet "Recipe folios 3d paragraphs"). 59,1% of them turned to be unique, and the average count of non-unique vords is 110.

One may note high counts of non-unique vords here, as compared to the respective figures above. This is caused by occurrences of dain and daiin in two second paragraphs and an occurrence of daiin in one third paragraph. daiin is the most frequent vord (count of 864), and dain also boasts a high count of 211. Excluding daiin and dain from the calculation, one observes average counts of non-unique first vords of 7,4 for second paragraphs and 15,9 for third paragraphs of recipe folios.

First vords of fourth etc. paragraphs of recipe folios were not analyzed.

Additionally, first vords of second paragraphs of balneological folios were briefly analysed (refer to the sheet "Balneo folios"). 50% of them are unique.

50..61% is still a comparatively high figure. So, in respect to the recipe and balneological sections, if position matters, it is the position in the paragraph, not the position in the folio.

Let us now return to the botanical section again (refer to the sheet "Botanical by paragraphs"). The statistics of first vords of paragraphs are as follows:

  • 49% of first vords of all paragraphs are unique;
  • as already indicated above, 66% of first vords of first paragraphs (or, which is the same thing, first vords of folios) are unique;
  • only 31% of first vords of non-first (second, third etc.) paragraphs are unique.

Therefore, if position matters, then the high degree of uniqueness of first vords of botanical folios (as compared to the second, third or last vords of folios) is not due to their position in the paragraph, but rather due to their position in the folio.

Now, therefore, although fourth etc. paragraphs of the recipe section were not analyzed, neither were most non-first paragraphs of other sections, with the assumption "position matters" we already come to contradictory results: for the recipe and balneological sections this would be position in the paragraph, while for the botanical section this would be position in the folio. Assuming a consistent set of encoding rules across the entire VMS, the assumption "position matters" is thus probably wrong. Hence the comparatively high degree of uniqueness of first vords of botanical folios is probably attributed to the specific meaning thereof - which does not disprove the hypothesis of them representing plant names. So we shall explore this hypothesis further.

To emphasize the fact that we have no solid proof of first vords of botanical folios representing plant names, but we rather have a working hypothesis, we shall speak of those first vords not as of "plant names", but as of "potential plant names" (PPN's).


7. Usefulness of Voynich plants

Assuming that the first vord of a botanical folio represents the respective plant's name, let us estimate the "usefulness" of Voynich plants (in the sense of the word introduced in Section 3 above). As indicated in Section 6 above, 43 of 128 non-ambiguos PPN's are non-unique and are mentioned elsewhere beyond their position. It is worth noting that a PPN holding the first position in a given folio is never repeated again in the same folio. Further, some of non-unique PPN's are mentioned in the botanical section exclusively. Only 28 PPN's are mentioned beyond the botanical section and thus only those 28 plants are termed "useful". They are the plants of f3v, 5r, 8r, 11r, 13r, 14r, 19r, 21r, 26v, 28r, 31v, 32v, 37v, 40v, 41v, 42v, 44v, 52v, 54r, 54v, 55r, 56r, 65r, 65v, 87v, 94r, 95r2 and f96r.

Particularly "useful" are the plants of f94r (PPN tchedy) with its 28 occurrences outside the botanical section, f65v (PPN cphy) with its 15 ocurrences outside the botanical section, and f96r (PPN tor) with 10 ocurrences outside the botanical section.


8. Rarity of Voynich plants

As suggested in Section 3 above, we call a Voynich plant "rare" if the respective botanical folio does not mention any of the Voynich "stars". This given, of 129 botanical folios only 37 represent rare plants, hence 92 folios, or 71,3% of the total, introduce "common" (as opposed to "rare") plants. (Refer to Part I for raw data of Voynich stars occurrences in botanical folios).

If we consider rarity of useful plants, then 20 of 28, or 71,4%, are common. The percentage is essentially the same, which suggests that there is no correlation between "usefulness" and "rarity" of Voynich plants.

It is curious, though, that the most useful plant (PPN tchedy) is rare.


9. The focal set of Voynich plants

Selecting Voynich plants that are useful and common at the same time, we get the following list of 20 plants (folio - PPN):

  • f3v - koaiin
  • f5r - kshody
  • f8r - pshol
  • f11r - tshol
  • f13r - torshor
  • f14r - pchodaiin
  • f21r - pchor
  • f28r - pchodar
  • f31v - podair
  • f32v - kcheodaiin
  • f40v - pchedain
  • f42v - tcho
  • f44v - tsho
  • f54r - podaiin
  • f54v - pcheodar
  • f55r - podaiin
  • f65v - cphy
  • f87v - pcheey
  • f95r2 - kshedy
  • f96r - tor

Note that PPN for f54r and f55r is the same, namely, podaiin. This is one of the few cases when PPN's repeat themselves. As suggested above, this might stand for the cases when a plant's name consists of more than one word. In f54r this thus may be podaiin shodal, while in f55r - podaiin shekchy.

Now that we have defined the focal set of the Voynich plants, it is interesting to look at past attempts of identification thereof. If these are really practically important and well known plants, then it is natural to expect various researchers to be more or less in agreement as to what those plants really are. The sheet "Identification" summarizes some past identification attempts:

  • anonymous Finnish biologist [4];
  • O'Neil and Holm as put down by Petersen [5];
  • Sherwood [6];
  • Steve D [7].

The lists by Sherwood and Steve D provide interpretations for all focal set folios, while the first two lists, being outputs from professionals in the field, contain less interpretations. The Finnish biologist makes only four suggestions out of twenty. This might be attributed, though, not to the researcher's unability to recognize the respective plants, but to his/her having paid no attention to these folios, for whatever reason. It is notable, nonetheless, that the four lists are seldom in agreement about identifications. Only two folios exhibit 75% match ratio: f5r (Paris as identified by the Finnish biologist, Petersen and Steve D) and f54r (Cabbage thistle as identified by Sherwood, Steve D and possibly Petersen). No plant from the list is identified unanimously.


10. Summary and further directions

In this article, we performed some checks as to the potential disproval of the hypothesis that first vords of botanical folios represent plant names. Those checks failed to disprove this hypothesis. First vords of botanical folios exhibit behaviour different from vords occupyng other positions, demonstrating high degree of uniqueness and low count of those vords which are non-unique. Comparison with vords occupying other specific positions in the VMS, such as first vords of non-first paragraphs of the botanical section, first vords of paragraphs of the recipe and balneological sections, or first vords of folios of the astrological section, suggests that the peculiar behaviour of first vords of botanical folios is not attributed to their position, but rather to some other reason - such as them representing plant names.

Using introduced concepts of "usefulness" and "rarity" of Voynich plants, we put forward the "focal set" of Voynich plants - a subset of plants with potentially better chances for successful identification.

A couple of valuable sideway observations were made:

  • a PPN holding the first position in a given folio is never repeated again in the same folio - which supports the idea of a certain narration (or encoding) template;
  • first vords of balneological and recipe folios also exhibit high degree of uniqueness.

Some results opposing the proposals of this article include:

  • the most "useful" plant of f94r strangely appears to be rare - i.e., not associated with any of the Voynich stars;
  • there has been very poor agreement in plant identification by different researchers across the proposed focal set - which is not the result expected from the very purpose of the focal set.

The former negative result may be explained if we suppose that not all first vords of botanical folios stand for plant names, but only some of those - albeit (possibly) most of them do. Indeed, it is just natural for a number of deviations from the general template to exist. This reasonable consideration warns a researcher not to treat PPN's too mechanically and, of course, introduces additional uncertainty.

The latter negative result may be explained by the circumstance that past identification attempts may have been performed in an assumption that the Voynich plants are depicted "as is", which is probably not the case. There are the following possibilities, as previously suggested by the Voynich research community:

  • Voynich plants are depicted by memory or by dried exemplars;
  • Voynich plants are depicted with certain portions thereof over-emphasized to reflect the practical importance of those portions;
  • Voynich plants are depicted with certain portions thereof over-emphasized for mnemonical purposes.

It is suggested that a community identification attempt is performed across the focal set, with the following considerations in mind:

  • the above possibilities are taken into account and the plant images are not considered "as is", but rather in comparison with plant images in contemporary herbals;
  • the focal set is attempted to be matched to the set of most important plants mentioned in contemporary herbals or medical books.

The latter consideration is believed to be the key one for the successful Voynich plants identification, as this potentially allows to reduce the search field significantly. This is where the comparative analysis element of the contextual analysis paradigm comes in place.


11. Acknowledgements

The author would like to thank Koen Gheuens and Marco Ponzi for their valuable comments in the course of occasional discussing advance portions of this article on the Voynich Ninja forum.














View/add comments

  1. Koen Gheuens · 14 June 2016, 17:16 · URL

    I like the idea behind this paper, Anton. Kind of marking a limited area for a focused attack on the plants.

    Personally, I believe the VM represents an older botanical tradition, mostly separate from the 15thC herbal tradition. But people who specialize in this field should be able to make use of your findings.

    I did find it informative to read your take on whether or not the first words can be assumed to be plant names.

    One question/remark: in comparing words between the stars folios and the plant folios, don’t you think homonyms could influence the stats? I think that in a relatively limited alphabet like Voynichese, with short words as well, you’ll end up relatively quickly with two words that are spelled the same but mean something else. Would this have an impact on your conclusions?

  2. Anton Alipov · 14 June 2016, 17:43 · URL

    Hi Koen,

    Thx for your comment.

    You are right that it’s not an easy problem to choose a pack of herbs which the Voynich plants should be tried to be matched against. It will be necessarily larger than the focal set, even if we stay within one herbal tradition. One may start with some candidates though, such as Auslasser’s herbal or “Conservatio Sanitatis” by Miechowsky and see where it leads, if anywhere.

    Regarding the homonyms – yes you are essentially right. I touched this problem in Part I, end of Section 13, when trying to trace the narration structure of botanical folios. Homonyms would introduce noise masking the narration structure.

    Applied to the task of Part II, homonyms are not a big problem though, because potential “mistaking” an homonym for a Voynich star only expands the focal set, not narrows it.

  3. D.N. O'Donovan · 27 June 2016, 14:28 · URL


    I have withheld from “voynichimagery” most of the work which I did on the botanical section, but am so impressed with your study that I should like to send the material to you, as possible comparative data set for you to consider.

    I did a full analysis of about forty of the drawings and was able to draw certain conclusions about them which might, conceivably, be of interest.


    btw – your spell-check seems to default to the German “vord” for English “word”

  4. Anton Alipov · 28 June 2016, 13:54 · URL

    Hi Diane,

    Thx for your comment and for your suggestion to provide the data. If any of your identifications happen to relate to the folios of the “focal set” described in my post, I will include them there. I may be contacted at my surname at IEEE dot org.

    The word “vord” is a shortcut for “Voynichese word” ( I’ve been actively promoting this :-)

  5. D.N.O'Donovan · 16 July 2016, 00:12 · URL

    Hi Anton,

    Sorry – I’d forgotten about this until now.

    As regards your data-range, I might just note that identifications by Sherwood were taken as the basis for subsequent collaboration between Sherwood and ‘Steve D’. I am not certain if it was so, but I’m told that Torasella also worked with that pair, as did Rene.

    In effect, members of that team, and especially Steve D and Sherwood represent one opinion, not several.

  6. Nikolaj · 27 May 2017, 15:56 · URL

    To the question about the Voynich manuscript. The text is written signs. Signs are used instead of letters of the alphabet one of the ancient languages. Moreover, in the text there are 2 levels of encryption. I found the key with which the first section I could read the following words: hemp, wearing hemp; food, food (sheet 20 at the numbering on the Internet); to clean (gut), knowledge, perhaps the desire, to drink, sweet beverage (nectar), maturation (maturity), to consider, to believe (sheet 107); to drink; six; flourishing; increasing; intense; peas; sweet drink, nectar, etc. Is just the short words, 2-3 sign. To translate words with more than 2-3 characters requires knowledge of this ancient language. The fact that some signs correspond to two letters. Thus, for example, a word consisting of three characters can fit up to six letters of which three. In the end, you need six characters to define the semantic word of three letters. Without knowledge of this language make it very difficult even with a dictionary.
    If you are interested, I am ready to send more detailed information, including scans of pages showing the translated words.

Add your comment (preview then submit):