Skip to content

Latest commit

 

History

History
131 lines (87 loc) · 11.2 KB

readme.markdown

File metadata and controls

131 lines (87 loc) · 11.2 KB

Wordlist information

This readme file contains information about word lists commonly used to generate passphrases.

Copies of the word lists are included in this repository, but when possible, please get word lists from their original sources, as the ones included here are likely outdated.

The information in the CSV file in this repo was generated using a tool called Word List Auditor (version 0.2.1).

(As an alternative, if need be, you can try using Tidy by running tidy -AAAA --samples <wordlist-file>.)

See below for an explanation of some of the attributes listed.

Word list comparison

See wordlists-stats-level-4.csv for a comparison of many word lists.

Word list sources


Licensing

Refer to LICENSE file for how this readme file and all CSV files are licensed.

Please refer to the license of each word list before use.

Any and all Orchard Street Wordlists are licensed under the Creative Commons Attribution-ShareAlike 3.0 Unported License.

Explanation of some of the list attributes in spreadsheet

Note: See Word List Auditor's documentation for more information.

Understanding the relationship between word list length and passphrase entropy

The more word a word list has, the "stronger" the passphrases created from will be.

For example, each word from a 7,776-word list adds 12.925 bits of entropy to a passphrase. A 3-word passphrase from such a list will have 38.775 bits of entropy (3 * 12.925). A 6-word passphrase from the same list will have 77.55 bits of entropy (6 * 12.925).

The chart below shows how many words from different list-lengths are required to hit minimum entropy requirements (starting at 55 bits of entropy).

Min entropy 4,000 words 7,776 words 8,000 words 17,576 words
55 bits 5 words 5 words 5 words 4 words
60 bits 6 words 5 words 5 words 5 words
65 bits 6 words 6 words 6 words 5 words
70 bits 6 words 6 words 6 words 5 words
75 bits 7 words 6 words 6 words 6 words
80 bits 7 words 7 words 7 words 6 words

Prefix codes, suffix codes, and uniquely decodable codes

If a word list is uniquely decodable that means that words from the list can be safely combined without a delimiter between each word, e.g. enticingneurosistriflecubeshiningdupe.

As a brief example, if a list has "boy", "hood", and "boyhood" on it, users who specified they wanted two words worth of randomness (entropy) might end up with "boyhood", which an attacker guessing single words would try. Removing the word "boy", which makes the remaining list uniquely decodable, prevents this possibility from occurring.

My understanding is that a good way to determine if a given word list in uniquely decodable or not is to use the Sardinas–Patterson algorithm. This is how the Word List Auditor tool determines if a word list is uniquely decodable and the result of that code is printed next to the label "Uniquely decodable?" in the word information above. (For more on the Sardinas-Patterson algorithm and implementing it in Rust, see this project.)

Removing all prefix words (or all suffix words) is one way to make a list uniquely decodable, but I contest it is not the only way, nor usually the most efficient. I adapted the Sardinas-Patterson algorithm to create, what I believe, is a more efficient method for making a word list uniquely decodable. I used this method to make all of the Orchard Street Wordlists uniquely decodable. You can learn more about uniquely decodable codes and Schlinkert pruning from this blog post.

On maximum shared prefix length

If a word list's maximum shared prefix length is 4, that means that knowing the first 4 characters of any word on the generated list is sufficient to know which word it is.

This is useful if you intend the list to be used by software that uses auto-complete. For example, a user will only have to type the first 4 characters of any word before a program could successfully auto-complete the entire word.

What is "Efficiency per character" and "Assumed entropy per char" and what's the difference?

If we take the entropy per word from a list (log2(list_length)) and divide it by the average word length of words on the list, we get a value we might call "efficiency per character". This just means that, on average, you get E bits per character typed.

If we take the entropy per word from a list (log2(list_length)) and divide it by the length of the shortest word on the list, we get a value we might call "assumed entropy per char" (or character).

For example, if we're looking at the EFF long list, we see that it is 7,776-words long, so we'd assume an entropy of log27776 or 12.925 bits per word. The average word length is 7.0, so the efficiency is 1.8 bits per character. (I got this definition of "efficiency" from an EFF blog post about their list.) And lastly, the shortest word on the list is three letters long, so we'd divide 12.925 by 3 and get an "assumed entropy per character" of about 4.31 bits per character.

I contend that this "assumed entropy per character" value in particular may be useful when we ask the more theoretical question of "how short should the shortest word on a good word list should be?" There may be an established method for determining what this minimum word length should be, but if there is I don't know about it yet! Here's the math I've worked out on my own.

The "brute force line"

Assuming the list is comprised of 26 unique characters, if the shortest word on a word list is shorter than log26(list_length), there's a possibility that a user generates a passphrase such that the formula of entropy_per_word = log2(list_length) will overestimate the entropy per word. This is because a brute-force character attack would have fewer guesses to run through than the number of guesses we'd assume given the word list we used to create the passphrase.

As an example, let's say we had a 10,000-word list that contained the one-character word "a" on it. Given that it's 10,000 words, we'd expect each word to add an additional ~13.28 bits of entropy. That would mean a three-word passphrase would give users 39.86 bits of entropy. However! If a user happened to get "a-a-a" as their passphrase, a brute force method shows that entropy to be only 14.10 bits (4.7 * 3 words). Thus we can say that it falls below the "brute force line", a phrase I made up.

To see if a given generated list falls above or below this line, use the -A/--attributes flag.

Maximum word list lengths to clear the Brute Force Line

Formula:

Where S is the length of the shortest word on the list, 26 is the number of letters in the English alphabet, and M is max list length: M = 2S * log2(26). Conveniently, this simplifies rather nicely to M = 26S.

(or in Python: max_word_list_length = 26**shortest_word_length)

shortest word length max list length
2 676
3 17576
4 456976
5 11881376