This readme file contains information about word lists commonly used to generate passphrases.
Copies of the word lists are included in this repository, but when possible, please get word lists from their original sources, as the ones included here are likely outdated.
The information in the CSV file in this repo was generated using a tool called Word List Auditor (version 0.2.1).
(As an alternative, if need be, you can try using Tidy by running tidy -AAAA --samples <wordlist-file>
.)
See below for an explanation of some of the attributes listed.
See wordlists-stats-level-4.csv for a comparison of many word lists.
- 1Password's word list
- BIPS0039 English list
- Diceware (Beale)
- EFF Short list with unique prefixes
- EFF general short list.
- EFF long list
- Eyeware
- Google Corpus, 20k
- Jack Singleton Diceware
- Jakob Mandula list
- KeePassXC's word list
- Mnemonicode word list (v 0.7).
- Monero English word list.
- New Fandom lists by Aaron Toponce
- Niceware
- NSA RandPassGen list
- Orchard Street Long List (v0.1.4)
- Orchard Street Medium List (v0.1.4)
- Peerio English wordlist
- PGP Wordlist. See also: https://en.wikipedia.org/wiki/PGP_Words.
- Passplum Proposed
- Pokerware Formal
- Pokerware Slang
- Reinhold 8k
- Reinhold's original diceware list
- S/KEY
- SecureDrop
- Uli Fouquet Diceware
- Webplaces 10k Short
- Webplaces 4k Hexadecimal
- Webplaces Combined Diceware
- Webplaces Improved Diceware
Refer to LICENSE file for how this readme file and all CSV files are licensed.
Please refer to the license of each word list before use.
Any and all Orchard Street Wordlists are licensed under the Creative Commons Attribution-ShareAlike 3.0 Unported License.
Note: See Word List Auditor's documentation for more information.
The more word a word list has, the "stronger" the passphrases created from will be.
For example, each word from a 7,776-word list adds 12.925 bits of entropy to a passphrase. A 3-word passphrase from such a list will have 38.775 bits of entropy (3 * 12.925). A 6-word passphrase from the same list will have 77.55 bits of entropy (6 * 12.925).
The chart below shows how many words from different list-lengths are required to hit minimum entropy requirements (starting at 55 bits of entropy).
Min entropy | 4,000 words | 7,776 words | 8,000 words | 17,576 words |
---|---|---|---|---|
55 bits | 5 words | 5 words | 5 words | 4 words |
60 bits | 6 words | 5 words | 5 words | 5 words |
65 bits | 6 words | 6 words | 6 words | 5 words |
70 bits | 6 words | 6 words | 6 words | 5 words |
75 bits | 7 words | 6 words | 6 words | 6 words |
80 bits | 7 words | 7 words | 7 words | 6 words |
If a word list is uniquely decodable that means that words from the list can be safely combined without a delimiter between each word, e.g. enticingneurosistriflecubeshiningdupe
.
As a brief example, if a list has "boy", "hood", and "boyhood" on it, users who specified they wanted two words worth of randomness (entropy) might end up with "boyhood", which an attacker guessing single words would try. Removing the word "boy", which makes the remaining list uniquely decodable, prevents this possibility from occurring.
My understanding is that a good way to determine if a given word list in uniquely decodable or not is to use the Sardinas–Patterson algorithm. This is how the Word List Auditor tool determines if a word list is uniquely decodable and the result of that code is printed next to the label "Uniquely decodable?" in the word information above. (For more on the Sardinas-Patterson algorithm and implementing it in Rust, see this project.)
Removing all prefix words (or all suffix words) is one way to make a list uniquely decodable, but I contest it is not the only way, nor usually the most efficient. I adapted the Sardinas-Patterson algorithm to create, what I believe, is a more efficient method for making a word list uniquely decodable. I used this method to make all of the Orchard Street Wordlists uniquely decodable. You can learn more about uniquely decodable codes and Schlinkert pruning from this blog post.
If a word list's maximum shared prefix length is 4, that means that knowing the first 4 characters of any word on the generated list is sufficient to know which word it is.
This is useful if you intend the list to be used by software that uses auto-complete. For example, a user will only have to type the first 4 characters of any word before a program could successfully auto-complete the entire word.
If we take the entropy per word from a list (log2(list_length)) and divide it by the average word length of words on the list, we get a value we might call "efficiency per character". This just means that, on average, you get E bits per character typed.
If we take the entropy per word from a list (log2(list_length)) and divide it by the length of the shortest word on the list, we get a value we might call "assumed entropy per char" (or character).
For example, if we're looking at the EFF long list, we see that it is 7,776-words long, so we'd assume an entropy of log27776 or 12.925 bits per word. The average word length is 7.0, so the efficiency is 1.8 bits per character. (I got this definition of "efficiency" from an EFF blog post about their list.) And lastly, the shortest word on the list is three letters long, so we'd divide 12.925 by 3 and get an "assumed entropy per character" of about 4.31 bits per character.
I contend that this "assumed entropy per character" value in particular may be useful when we ask the more theoretical question of "how short should the shortest word on a good word list should be?" There may be an established method for determining what this minimum word length should be, but if there is I don't know about it yet! Here's the math I've worked out on my own.
Assuming the list is comprised of 26 unique characters, if the shortest word on a word list is shorter than log26(list_length), there's a possibility that a user generates a passphrase such that the formula of entropy_per_word = log2(list_length) will overestimate the entropy per word. This is because a brute-force character attack would have fewer guesses to run through than the number of guesses we'd assume given the word list we used to create the passphrase.
As an example, let's say we had a 10,000-word list that contained the one-character word "a" on it. Given that it's 10,000 words, we'd expect each word to add an additional ~13.28 bits of entropy. That would mean a three-word passphrase would give users 39.86 bits of entropy. However! If a user happened to get "a-a-a" as their passphrase, a brute force method shows that entropy to be only 14.10 bits (4.7 * 3 words). Thus we can say that it falls below the "brute force line", a phrase I made up.
To see if a given generated list falls above or below this line, use the -A
/--attributes
flag.
Formula:
Where S is the length of the shortest word on the list, 26 is the number of letters in the English alphabet, and M is max list length: M = 2S * log2(26). Conveniently, this simplifies rather nicely to M = 26S.
(or in Python: max_word_list_length = 26**shortest_word_length
)
shortest word length | max list length |
---|---|
2 | 676 |
3 | 17576 |
4 | 456976 |
5 | 11881376 |