defaultdict

defaultdict is a subclass of dictionaries (dict, see previous post), so it inherits most of its behavior from dict with additional features. To understand how those features make it different, and more convenient in some cases, we'll need to run into some errors.

If we try to count words in a document, the general approach is to create a dictionary where the dictionary keys are words and the dictionary values are counts of those words.

Let's try do do this with a regular dictionary.

First, to setup, we'll take a list of words and split() into individual words. I took this paragraph from another project i'm working on and artificially added some extra words to ensure that certain words appeared more than once (it'll be apparent why soon).


# paragraph
lines = ["This table highlights 538's new NBA statistic, RAPTOR, in addition to the more established Wins Above Replacement (WAR). An extra column, Playoff (P/O) War, is provided to highlight stars performers in the post-season, when the stakes are higher. The table is limited to the top-100 players who have played at least 1,000 minutes minutes the table Wins NBA NBA RAPTOR more players"]

# split paragraphy into individual words
lines = " ".join(lines).split()

type(lines) # list

Now that we have our lines list, we'll create an empty dict called word_counts and have each word be the key and each value be the count of that word.

# empty list
word_counts = {}

# loop through lines to count each word
for word in lines:
    word_counts[word] += 1

# KeyError: 'This'

We received a KeyError for the very first word in lines (i.e. 'This') because the list tried to count a key that didn't exist. We've learned to handle exceptions so we can use try and except.

Here, we're looping through lines and when we try to count a key that doesn't exist, like we did previously, we're now anticipating a KeyError and will set the initial count to 1, then it can continue to loop-through and count the word, which now exists, so it can be incremented up.

# empty list
word_counts = {}

# exception handling
for word in lines:
    try:
        word_counts[word] += 1
    except KeyError:
        word_counts[word] = 1

# call word_counts
# abbreviated for space
word_counts

{'This': 1,
 'table': 3,
 'highlights': 1,
 "538's": 1,
 'new': 1,
 'NBA': 3,
 'statistic,': 1,
 'RAPTOR,': 1,
 'in': 2,
 'addition': 1,
 'to': 3,
 'the': 5,
 'more': 2,
 ...
 'top-100': 1,
 'players': 2,
 'who': 1,
 'have': 1,
 'played': 1,
 'at': 1,
 'least': 1,
 '1,000': 1,
 'minutes': 2,
 'RAPTOR': 1}

Now, there are other ways to achieve the above:

# use conditional flow
word_counts = {}

for word in lines:
    if word in word_counts:
        word_counts[word] += 1
    else:
        word_counts[word] = 1

# use get
for word in lines:
    previous_count = word_counts.get(word, 0)
    word_counts[word] = previous_count + 1

Here's where the author makes the case for defaultdict, arguing that the two aforementioned approaches are unweildy. We'll come back full circle to try our first approach, using defaultdict instead of the traditional dict.

defaultdict is a subclass of dict and must be imported from collections:

from collections import defaultdict

word_counts = defaultdict(int)

for word in lines:
    word_counts[word] += 1

# we no longer get a KeyError
# abbreviated for space
defaultdict(int,
            {'This': 1,
             'table': 3,
             'highlights': 1,
             "538's": 1,
             'new': 1,
             'NBA': 3,
             'statistic,': 1,
             'RAPTOR,': 1,
             'in': 2,
             'addition': 1,
             'to': 3,
             'the': 5,
             'more': 2,
             ...
             'top-100': 1,
             'players': 2,
             'who': 1,
             'have': 1,
             'played': 1,
             'at': 1,
             'least': 1,
             '1,000': 1,
             'minutes': 2,
             'RAPTOR': 1})

Unlike a regular dictionary, when defaultdict tries to look up a key it doesn't contain, it'll automatically add a value for it using the argument we provided when we first created the defaultdict. If you see above, we entered an int as the argument, which allows it to automatically add an integer value.

If you want your defaultdict to have values be lists, you can pass a list as argument. Then, when you append a value, it is automatically contained in a list.

dd_list = defaultdict(list) # defaultdict(list, {})

dd_list[2].append(1)        # defaultdict(list, {2: [1]})

dd_list[4].append('string') # defaultdict(list, {2: [1], 4: ['string']})

You can also pass a dict into defaultdict, ensuring that all appended values are contained in a dict:


dd_dict = defaultdict(dict) # defaultdict(dict, {})

# match key-with-value
dd_dict['first_name'] = 'lebron' # defaultdict(dict, {'first_name': 'lebron'})
dd_dict['last_name'] = 'james'   

# match key with dictionary containing another key-value pair
dd_dict['team']['city'] = 'Los Angeles'

# defaultdict(dict,
#            {'first_name': 'lebron',
#             'last_name': 'james',
#             'team': {'city': 'Los Angeles'}})

Application: Grouping with defaultdict

The follow example is from Real Python, a fantastic resource for all things Python.

It is common to use defaultdict to group items in a sequence or collection, setting the initial parameter (aka .default_factory) set to list.

dep = [('Sales', 'John Doe'),
       ('Sales', 'Martin Smith'),
       ('Accounting', 'Jane Doe'),
       ('Marketing', 'Elizabeth Smith'),
       ('Marketing', 'Adam Doe')]

from collections import defaultdict

dep_dd = defaultdict(list)

for department, employee in dep:
    dep_dd[department].append(employee)

dep_dd
#defaultdict(list,
#            {'Sales': ['John Doe', 'Martin Smith'],
#             'Accounting': ['Jane Doe'],
#             'Marketing': ['Elizabeth Smith', 'Adam Doe']})

What happens when you have duplicate entries? We're jumping ahead slightly to use set handle duplicates and only group unique entries:


# departments with duplicate entries
dep = [('Sales', 'John Doe'),
       ('Sales', 'Martin Smith'),
       ('Accounting', 'Jane Doe'),
       ('Marketing', 'Elizabeth Smith'),
       ('Marketing', 'Elizabeth Smith'),
       ('Marketing', 'Adam Doe'),
       ('Marketing', 'Adam Doe'),
       ('Marketing', 'Adam Doe')]

# use defaultdict with set
dep_dd = defaultdict(set)

# set object has no attribute 'append'
# so use 'add' to achieve the same effect
for department, employee in dep:
    dep_dd[department].add(employee)

dep_dd
#defaultdict(set,
#            {'Sales': {'John Doe', 'Martin Smith'},
#             'Accounting': {'Jane Doe'},
#             'Marketing': {'Adam Doe', 'Elizabeth Smith'}})

Application: Accumulating with defaultdict

Finally, we'll use defaultdict to accumulate values:

incomes = [('Books', 1250.00),
           ('Books', 1300.00),
           ('Books', 1420.00),
           ('Tutorials', 560.00),
           ('Tutorials', 630.00),
           ('Tutorials', 750.00),
           ('Courses', 2500.00),
           ('Courses', 2430.00),
           ('Courses', 2750.00),]

# enter float as argument        
dd = defaultdict(float)  # collections.defaultdict

# defaultdict(float, {'Books': 3970.0, 'Tutorials': 1940.0, 'Courses': 7680.0})
for product, income in incomes:
    dd[product] += income

for product, income in dd.items():
    print(f"Total income for {product}: ${income:,.2f}")

# Total income for Books: $3,970.00
# Total income for Tutorials: $1,940.00
# Total income for Courses: $7,680.00

I can see that defaultdict and dictionaries can be handy for grouping, counting and accumulating values in a column. We'll come back to revisit these foundational concepts once the data science applications are clearer.

In summary, dictionaries and defaultdict can be used to group items, accumulate items and count items. Both can be used even when the key doesn't (yet) exist, but its defaultdict handles this more succintly. For now, we'll stop here and proceed to the next topic: counters.

Paul Apivat Data Journey

Paul Apivat Data Journey

defaultdict

Application: Grouping with defaultdict

Application: Accumulating with defaultdict