On quantile dictionary words

Draft of 2018.03.16

May include: puzzle ↘ &c.

I’ve been working though some deep changes in the Klapaucius codebase for the last few days, and so things have been quiet here.

But a puzzle presented itself yesterday. Maybe you’ll enjoy it.

Imagine you have a sufficiently large printed English language dictionary. Perhaps it’s unabridged, but it doesn’t have to be. It’s big, though.

Knowing what you do about the English language, and about dictionaries as cultural objects, What word will be defined at the median point in the dictionary? That is, if there are \(N\) words, what definition do you expect to fall at position \(\frac{N}{2}\)?

Now of course you could be the sort of person who misunderstands what the word “puzzle” means, and you could try to find some compiled list of English words and look at that, but it’s not what I’m asking for. I’m asking you to predict what the word will be, before looking at the physical object itself. What letter will it begin with? How many letters at the beginning of the word can you predict with any certainty, given the size of the dictionary itself?

You might think it will start with the letter half-way through the alphabet, but is it the case that every letter begins an equal number of English words? You might think that it will start with the most common word-starting letter in the alphabet, but the dictionary doesn’t arrange words at random, and so there is some bias in the position of those words in the collection. You might think that by about the third or fourth letter, the letters of words are better “mixed”, but is it the case that words’ \(k\)th letters are independent samples?

I’m also asking you to think about “dictionaries” as cultural objects. Not every word defined in a dictionary is presented in alphabetical order, nor is it true that every word defined in a dictionary has only one definition.

Further questions:

How different might the word defined mid-way through the list of all words be, compared to the word defined at the middle of the middle page of the bound book?
If you can predict the median defined word with any accuracy, can you do so for the words exactly \(\frac{1}{5}\) or \(\frac{4}{5}\) of the way through?
What might happen to your prediction of the median word, if you sorted all variations of words that appear in the dictionary? Many words have alternate spellings and plural forms listed in situ in their definitions, and it may or may not be the case that these have their own independent line entries. What fraction of words do have independent entries (in the dictionary you’ve chosen)?
Many dictionaries are illustrated. Are the illustrations equally spaced throughout the book? Are they all the same size? Do the presence of illustrations and plates affect the outcome substantially, when I ask what words are likely to appear on page \(p\) of an \(N\)-page volume? How large an effect might you expect it to have?
There may be something strange about dictionaries as such. Do you think it would make any difference if I asked you to predict the median entry title in an encyclopedia? Can you find (after predicting) the median wiki page in Wikipedia? But in that case, be careful of redirects, because that may affect your prediction.
While we’re here, what is the distribution of redirects among Wikipedia pages like? How does it scale? (This is not another puzzle question, it’s just something I’m curious about right now).

I can’t imagine this whole pile of questions hasn’t been asked before, but it’s damnably difficult to find search terms that work for finding things about searchable objects themselves. So I’d be happy to hear about references to other people’s puzzles, research papers, or blog posts if they’re out there.