Sunday, February 23, 2014

Murder and NLP: The Taman Shud Case, Part 1

Computational Linguistics Murder Mystery Theatre

Cracking the code found near the body of a dead man could help reveal the identity of his killer.

On December 1, 1948, the body of a dead man was found on Somerton Beach near Adelaide, South Australia. The identity of the man (henceforth, Somerton Man) was nowhere to be found among his belonging, and no one ever came forth to identify him. He seemed otherwise healthy, and the cause of death was possibly a deliberate poisoning. The identity of his killer, if there was one, has not been established. It is possible that he was a Soviet spy, and possible that he was killed for reasons pertaining to espionage, but that is speculative. The case remains one of the strangest unsolved crimes in Australia's history.

A slip of paper was found in the dead man's pocket, containing only the words "Tamam Shud." This was identified as the final line of Omar Khayyam's Rubaiyat, and a man came forward saying that he found a copy of the book in his car near the beach the same day the body was found. It turned out to be the same copy the slip of paper was torn from. Written in the book was the phone number of a nurse who lived nearby, and denied knowing the man, but various claims have been made that she did know him and was lying.

What does this strange murder story have to do with Natural Language Processing?! Also written in the book was a sequence of letters, in five lines. There is some ambiguity in the handwriting, but one reasonable reading of four of them is as follows:

WRGOABABD
WTBIMPANETP
MLIABOAIAQC
ITTMTSAMSTGAB

In addition, a line which begins like the third line above was written and crossed out between the first and second.

These letters are not in any obvious way comprehensible, and it has been supposed that the letters might represent a cipher that, if broken, would shed some light on the case. From here forward, I will call this the Tamam Shud Cipher, or TSC, although it is not clear that it is actually is a cipher, an intentionally coded message, in the literal sense of the term. The TSC has not been decoded in a highly convincing way in whole or in part.

Following many hypotheses regarding the nature of the cipher, previous work, including that of the students of Derek Abbott at the University of Adelaide, has suggested that the letters may be initials from an English text.

In previous work, the frequency of letters in the TSC was compared to letter frequencies in other collections of text. First, with samples of several languages, and next with the initial letters of words in those languages. The second is not the same as the first, because letters occur with different frequencies in different positions in words; for example, 'e' is the most common letter in English, but 's' and 't' begin more words than 'e'. Abbott’s students found that the TSC letter frequencies match those of English initials significantly better than those of English text overall, and also better than initials or text in any of several languages. I have performed similar tests using different source texts and reach the same conclusion.

This provides evidence that the TSC is, in fact, an initialism, a sequence of initials from some specific text – in this case, a short English text. However, this evidence falls short of proof. It shows that an English initialism is the best of the possibilities that were tested, and a quite a few were tested, but it leaves open that an untested possibility would fit the TSC letter frequency just as well or better. It also leaves unexamined the possibilty that the letters in TSC are initials taken from English text but are possibly in another sequence.

And so, more definitive evidence that TSC is an initialism from a specific English text is to examine short sequences of letters in TSC and measure how they rank compared to the sequences of initials from English text in general. Grammatical patterns in English make some sequences of initials more common than the same initials in another sequence, so by performing this study with the TSC and variations of TSC with its letters scrambled in random order, we should see how likely it is that TSC preserves the expected sequences.

We can use an arbitrarily large corpus of English to generate the initial-letter ngrams for English, but the TSC itself is short, and therefore it samples the space of ngrams very sparsely. This means that many kinds of statistical metrics will show a mismatch between TSC ngrams and corpus ngrams even if they are initialisms from the same language. For example, Pearson correlations of bigrams from a known, but short, English initialism and the corpus will come up negative due to the sparseness of the short string’s ngram matrix.

A useful metric that is more sensitive when the string we are testing against a corpus is short is to generate the ngrams within the string, and calculate the mean of how high those ngrams rank among the corpus ngrams. This, in effect, gives the string credit for containing common initial ngrams, but doesn’t punish it for lacking other initial ngrams because it is simply too short to “get around to” them.

I generated 1,000 random shuffles of the TSC, and for the TSC and each shuffle, I calculated the mean rank of the string’s initial ngrams in terms of those generated from a corpus of 5 million words of English literature, for n from 2 to 5. If the sequence of letters in the TSC are initials selected at random from English texts, or if they were generated by some other means altogether, then the mean ranks for the TSC should be about 50th percentile in terms of the mean ngram ranks for the 1,000 random shuffles of TSC. If, however, the TSC was generated as an initialism from a specific English text, then its mean ngram ranks should be significantly higher than 50th percentile. The results follow:

N         TSC Percentile
2          85.6
3          92.3
4          96.4
5          93.6

We see that the results are convincing, particular when n=4 (the matrices begin to become sparse for n=5, weakening the result). For many scientific purposes, 95th percentile is offered as a standard of proof, and these results are fairly convincing that the TSC is an initialism, in correct order, of English text.

However, notes also that the TSC is written as a series of lines which may be linguistically unrelated to one another. If the four lines of TSC are lines of poetry, separate sentences, or in any way excerpts of a longer text, then the ngrams that are generated across the boundaries of lines potentially introduce noise. The idea that the lines are separate entities is further validated by the fact that the crossed-out line occurs in a different order than the similar line which is not crossed out.

So we can repeat the analysis, comparing TSC’s ngrams to those generated from random shuffles of TSC, but excluding the ngrams of TSC that start on one line and end on another. If TSC is not derived from sequences of English initials, we would, again, expect to see it rank about 50th percentile in mean ngram rank among the random shuffles. What we see, instead, greatly strengthens the previous result.

N         TSC Percentile
2          96.9
3          99.2
4          99.2
5          99.2

It is exceedingly unlikely that any other method of generating lines of letters would show this regularity if these lines were not initialisms corresponding to one or more short English texts.


And so, this turns the focus deeper, given that the Tamam Shud cipher very likely is an initialism of some short English text(s), what does it say, and what does that say about the case? The story goes on in my next post.

No comments: