John Green had uploaded this video where one of his fans had taken his book “Looking for Alaska” and had sorted all the words in it alphabetically. The title therefore became “Alaska for Looking”.
I decided to do the same with my favorite book Three Men in a Boat (to say nothing of the dog).
To start with, I recreated what Green’s fan had done. The results look quite similar!
Going through the file you’ll definitely appreciate the importance of “boat” in the plot.
Also that this is probably a first person narrated story.
It won’t be difficult to appreciate that male characters outnumber female ones in the plot.
While observing these maybe you wished if you could find out all the unique words with no repetitions. Well, I’ve gotten you covered in that case!
This file is great for learning new words. Like, here are all the words that begin with un. Not gonna lie, the list is unnervingly big!
Btw, if you wanted to get an idea about the context of a book by analysing the least information, you can check out the proper nouns used in the book.
Now with the unique words out of the way, a look into the most used words won’t be a bad thing to look at. The stop words haven’t been removed from the list though.
I won’t say much, but the following chart will give you a good idea about the Pareto Principle and how only some of the words account for the most of English Literature.
In case the graph was too steep for ya, here’s the log-scaled version.
It looks absolutely beautiful and just as we have few words being used a large number of times we also have a large number of words used very few times. This tells us that word count is actually not a very good measure of how much content there is in a paragraph as most of the words are stop words which just help us connect the nouns and verbs to each other.
Now that we’ve checked out many interesting facts about the words, we may concern ourselves with letters and paragraphs. Technically we can deal with lines and pages but those are dependent on the font properties used and therefore the analysis won’t be universal.
Analysing the letters was a simple task. Also, not much information comes out of it. For example, here are the most common letters.
I have removed the grid lines and had the y-axis log-scaled. As you can see the x-axis is case sensitive. As usual though, ‘e’ leads the chart. You can see the steepest jup at the point where most of the lowercase letters end and the uppercase ones begin. Keeping in mind that this is a log graph, the difference is appreciable.
There are 51 letters here. Only capital Z is missing. The fact that capital J exists is a result of the fact that the protagonist is named Jim. That’s why J performs better among uppercases that j performs among lowercases.
The frequency analysis doesn’t differ much from standard english although we do see a bump at H, W and G and a drop at I, S and C. Given that two of the characters are George and Harris and the whole story is based around (rather upon) water, we may pretend that we have an answer. But the more observant among us would rather try to navigate the word list with G and H to uncover the true reason (especially as G and H is often used in succession in English).
With all the letter analysis, just seeing all the letters would be a breath of fresh air. And here’s that file that’ll bring us the much needed respite.
Now, it’s time for the paragraphs. The most logical thing to do would be to sort them according to their length. Which is why I exactly didn’t do that! I instead sorted them alphabetically according to the first letter of their first word. A ghastly thing to do you might say. In the beginning you get all the quotes as they began with ” and not a letter.
However, this isn’t that bad afterall. You can observe all these sentences that began with “So,” and so on!
I also did sort them according to their length.
You gotta appreciate how the initial ones are just single word exclamations and continues on to become longer and longer. Screenshotting the minimap looks low-key kinda awesome!
Finally with all that being said and analysed you have earned a little refreshment. So, here are all the quotes and punctuations in individual files. Enjoy!
Last but not the least, don’t forget to check out the original 3 Men in a Boat Gutenberg Text file and the github repository where I am hosting all these files.