Working with Python's list
Thomas J. Kennedy
1 Overview & Basic Operations
The list
is one of most commonly used data structures in Python. The basic operations include:
-
append
to add an entry to the end of a list -
extend
to add multiple values from anotherlist
(or otherIterable
). -
insert
to add an entry in the middle of alist
-
count
to find the number of occurrences of specific value within alist
.
1.1 List Slicing & Indices
The list
provides a slicing syntax that allows not just one value, but a range of values to be retrieved. If we needed to print the first ten (10) values of a list…
print(some_list[:10])
would the items from index zero (0) through nine (9), inclusive. Even more powerful is negative indexing. We can grab the last element with…
print(some_list[:-1])
We could even grab every third element between indices seven (7) and fifteen (15), not including fifteen (15).
print(some_list[7:15:3])
Of course… some of this seems like a solution in search of a problem, at least in isolation.
1.2 Common Operations
Python provides min
, max
, sum
, and sort
operations. (While there are more functions, e.g., filter
, we will focus on the basics.)
2 Case Study
While the mechanics, syntax, and flexibility of lists can be interesting… we actually want to solve a problem. Suppose that we have a list of words in a file named words.txt.
Example 1: words.txtC++ Java Python loop Lua Perl PHP Wikipedia .NET Rust Minecraft Persona Scratch HTML ZFlip 3 RaspberryPi Steam Deck CSS JavaScript
A few of these words are noun phrases (i.e., more than one word).
2.1 A Simple Start
Let us start by outputting a sorted list of words.
Example 2: process_words.py - First AttemptThe full code can be found in the course GitHub Repository
def main(): word_filename = sys.argv[1] with open(word_filename, "r") as word_file: words = [line.strip() for line in word_file] for word in sorted(words): print(f"|{word}|") if __name__ == "__main__": main()
When the program is run using…
python3.11 process_words.py words.txt
it generates the following output…
||
|.NET|
|C++|
|CSS|
|HTML|
|Java|
|JavaScript|
|Lua|
|Minecraft|
|PHP|
|Perl|
|Persona|
|Python|
|RaspberryPi|
|Rust|
|Scratch|
|Steam Deck|
|Wikipedia|
|ZFlip 3|
|loop|
The vertical pipe characters (i.e., ‘|
’) are used to confirm that all leading and trailing whitespace has been removed. Do you notice a problem? The words are sorted lexicographically.
A lexicographical sort uses a defined ordering to arrange (sort) data. Unlike alphabetical order which is restricted to an alphabet, a lexicographical sort can apply any ordering which allows uppercase letters, lowercase letters, numbers and symbols to be included.
We could even apply a lexicographical sort to the Unicode character set which included letters from (almost) every language (and emoji).
In most cases… our intention would be for words to be sorted with case being ignored.
We just need to add a key
argument to the sorted
function call.
for word in sorted(words, key=lambda word: word.lower()):
print(f"|{word}|")
Take note of the lambda function. This is a one-off function that takes a single string (word
) and converts it to lowercase (word.lower()
) before comparing it to other words.
While this will work for English words… the notion of case in other languages can be interesting. The Pythonic approach is to use casefold
.
for word in sorted(words, key=lambda word: word.casefold()):
print(f"|{word}|")
However, the code can actually written more succinctly as
for word in sorted(words, key=str.casefold):
print(f"|{word}|")
We do not need a full lambda function. Since Python treats functions as first-class objects… we can pass in a reference to the function we wish to have invoked on each string. (In C++ this would usually be achieved function pointer and/or templates.)
The updated process_words.py
can be found in the course GitHub Repository
2.2 Filtering Words
Now that we can sort words… suppose that we want to
-
skip the blank word (i.e.,
||
) in the output. -
skip noun phrases (i.e., any line that contains a space between two or more words, numbers, or symbols (e.g., Steam Deck).
-
ignore any word that contains a
.
,+
, or-
We might be tempted to rewrite…
# Grab all tokens that are not an empty string
words = [word for word in words if word]
# Grab all words that do not contain a space
words = [word for word in words if " " not in word]
# Grab all words that do not contain a ., +, or -
words = [word for word in words if "." not in word and "+" not in word and "-" not in word]
A smart alec might even rewrite the last list comprehension to be a nested list comprehension.
# Grab all words that do not contain a ., +, or -
symbol_blacklist = [".", "+", "-"]
words = [word for word in words if all(symbol not in word for symbol in symbol_blacklist)]
But this is not r/iamverysmart
or r/programmerhumour
. We are writing Python code and PEP 20 says…
Readability counts
Simple is better than complex.
If the implementation is hard to explain, it’s a bad idea.
Let us rethink the overuse of list comprehensions.
2.3 Refactoring
Let us rewrite the code so that main
calls an apply_word_filters
function.
Example 3: Updated maindef main(): word_filename = sys.argv[1] with open(word_filename, "r") as word_file: words = [line.strip() for line in word_file] filtered_words = apply_word_filters(words) for word in sorted(filtered_words, key=str.casefold): print(f"|{word}|")
Example 4: apply_word_filtersdef apply_word_filters(words: Iterable[str]) -> Iterable[str]: """ Take a collection of words and apply the following filters: 1. skip empty strings 2. skip noun phrases (i.e., any token that contains a space between two or more words, numbers, or symbols (e.g., Steam Deck). 3. ignore any word that contains a ., +, or - """ # Grab all tokens that are not an empty string words = [word for word in words if word] # Grab all words that do not contain a space words = [word for word in words if " " not in word] # Grab all words that do not contain a ., +, or - symbol_blacklist = [".", "+", "-"] words = [word for word in words if all(symbol not in word for symbol in symbol_blacklist)] return words
The first list comprehension is follows the Python convection for an empty string check.
words = [word for word in words if word]
While C++ or Java programmers might write something along the lines of…
words = [word for word in words if len(word) > 0]
most code reviews and linting tools would flag it. The second comprehension
# Grab all words that do not contain a space
words = [word for word in words if " " not in word]
is good “as is.” While one might be tempted to rewrite it as…
# Grab all words that do not contain a space
words = [word for word in words if word.find(" ") >= 0]
the official str.find
documentation indicates that in
is the accepted approach.
That just leaves the final list comprehension.
symbol_blacklist = [".", "+", "-"]
words = [word for word in words if all(symbol not in word for symbol in symbol_blacklist)]
I think a not
-any
-in
approach would be more readable.
words = [
word for word in words if not any(symbol in word for symbol in symbol_blacklist)
]
That leaves us with a (more-or-less) final filter function.
Example 5: apply_word_filtersDEFAULT_SYMBOL_BLACKLIST = (".", "+", "-") def apply_word_filters( words: Iterable[str], symbol_blacklist: tuple[str] = DEFAULT_SYMBOL_BLACKLIST ) -> Iterable[str]: """ Take a collection of words and apply the following filters: 1. skip empty strings 2. skip noun phrases (i.e., any token that contains a space between two or more words, numbers, or symbols (e.g., Steam Deck). 3. ignore any word that contains a ., +, or - """ words = [word for word in words if word] words = [word for word in words if " " not in word] words = [ word for word in words if not any(symbol in word for symbol in symbol_blacklist) ] return words
Take note of the subtle change to symbol_blacklist
. If a function uses a blacklist…
-
That blacklist should be passed in as an argument.
-
If there is a default (or fallback) blacklist it should be provided as a default argument.
-
A
tuple
should be used for the default since constants should be immutable.
The updated process_words.py
can be found in the course GitHub Repository
2.4 Are We Done with Filtering?
Yes… we are done with filtering words. Let us actually do some analysis (i.e., work in main
). Suppose that we want to output the following table…
Example 6: Desired Table| CSS | Lua | | HTML | PHP | | Java | CSS | | JavaScript | Java | | loop | loop | | Lua | Perl | | Minecraft | Rust | | Perl | HTML | | Persona | Python | | PHP | Persona | | Python | Scratch | | RaspberryPi | Wikipedia | | Rust | Minecraft | | Scratch | JavaScript | | Wikipedia | RaspberryPi |
We would take our existing loop…
for word in sorted(filtered_words, key=str.casefold):
print(f"|{word}|")
and separate the sort operation.
words_sorted_lex = sorted(filtered_words, key=str.casefold)
words_sorted_len = sorted(filtered_words, key=len)
Note how the sorted list from the loop is now stored in variable (words_sorted_lex
where lex
is an abbreviation for lexicographical). There is now a second list (words_sorted_len
) where words are sorted by length.
The loop becomes…
for word_lhs, word_rhs in zip(words_sorted_lex, words_sorted_len):
print(f"| {word_lhs:<18} | {word_rhs:<18} |")
Take note of zip
. The zip
function takes two Iterable
collections and retrieves corresponding pairs of items (e.g., words_sorted_lex[0]
, words_sorted_len[0]
). And… we happen to know that our two word lists have equal length.
The output is a little f-string formatting magic.
2.5 What About min and max?
Suppose that we wanted to grab the three longest words from words_sorted_len
. Since the list is sorted by length… we can simply grab the last three (3) entries.
print()
print("Top 3 Longest Words:")
print()
for word in words_sorted_len[-1:-4:-1]:
print(f" {word}")
Take not of the indices in words_sorted_len[-1:-4:-1]
. We are…
-
Grabbing all elements from the last index (
[-1
) -
down to, but not including, (
:-4:
) -
going backwards (
:-1]
)
That is about as clear as mud. I would prefer…
for word in reversed(words_sorted_len[-3:]):
print(f" {word}")
This is a bit more clear. We are starting at the third from last entry and reversing the order, i,e., $-3, -2, -1$ becomes $-1, -2, -3$.
Now suppose that we want the longest word and shortest word that start with a each letter.
Keep in mind that min
and max
return the first value encountered (e.g., if two words with equal length are both the shortest, min
would return the first one encountered).
Let us start with finding the longest words.
Example 7: Finding the Longest Wordsfor letter in ascii_lowercase: try: longest_word = max( (word for word in filtered_words if word.lower().startswith(letter)), key=len, ) except ValueError as _err: # There were no words that started with "letter" # There is nothing to output continue print(f" ({letter}) - {longest_word}")
The string
module provides the set of lowercase letters in ascii_lowercase
. We start by grabbing each letter.
for letter in ascii_lowercase:
We attempt look for the longest word that starts with letter
try:
longest_word = max(
(word for word in filtered_words if word.lower().startswith(letter)),
key=len,
)
If there are no letters that start with letter
… we skip to the next loop iteration.
except ValueError as _err:
# There were no words that started with "letter"
# There is nothing to output
continue
If a word… we output both letter
and the word.
print(f" ({letter}) - {longest_word}")
Identifying the shortest words is near identical (using min
instead of max
).
Example 8: Finding the Shortest Wordsfor letter in ascii_lowercase: try: longest_word = max( (word for word in filtered_words if word.lower().startswith(letter)), key=len, ) except ValueError as _err: # There were no words that started with "letter" # There is nothing to output continue print(f" ({letter}) - {longest_word}")
The two loops are very similar. With some care…
-
we could refactor the two loops into a reusable function.
-
we could refactor the two loops into a single loop that reuses the list containing words that start with
letter
.
However, I will leave both options as exercises to the reader (i.e., you).
3 The Finished Code
The final version of process_words.py
can be found in the course GitHub Repository