One way to answer this question is to use text mining to **tokenize** by either word and count the words by frequency as one measure of popularity.

In the below bar chart, we see frequency of words across all Thai Dishes. **Mu** () which means pork in Thai appears most frequently across all dish types and sub-grouping. Next we have **kaeng** () which means curry. **Phat** () comings in third suggesting "stir-fry" is a popular cooking mode.

As we can see **not** all words refer to raw materials, so we may not be able to answer this question directly.

```
library(tidytext)
library(scales)
# new csv file after data cleaning (see above)
df <- read_csv("../web_scraping/edit_thai_dishes.csv")
df %>%
select(Thai_name, Thai_script) %>%
# can substitute 'word' for ngrams, sentences, lines
unnest_tokens(ngrams, Thai_name) %>%
# to reference thai spelling: group_by(Thai_script)
group_by(ngrams) %>%
tally(sort = TRUE) %>% # alt: count(sort = TRUE)
filter(n > 9) %>%
# visualize
# pipe directly into ggplot2, because using tidytools
ggplot(aes(x = n, y = reorder(ngrams, n))) +
geom_col(aes(fill = ngrams)) +
scale_fill_manual(values = c(
"#c3d66b",
"#70290a",
"#2f1c0b",
"#ba9d8f",
"#dda37b",
"#8f5e23",
"#96b224",
"#dbcac9",
"#626817",
"#a67e5f",
"#be7825",
"#446206",
"#c8910b",
"#88821b",
"#313d5f",
"#73869a",
"#6f370f",
"#c0580d",
"#e0d639",
"#c9d0ce",
"#ebf1f0",
"#50607b"
)) +
theme_minimal() +
theme(legend.position = "none") +
labs(
x = "Frequency",
y = "Words",
title = "Frequency of Words in Thai Cuisine",
subtitle = "Words appearing at least 10 times in Individual or Shared Dishes",
caption = "Data: Wikipedia | Graphic: @paulapivat"
)
```

We can also see words common to both Individual and Shared Dishes. We see other words like **nuea** (beef), **phrik** (chili) and **kaphrao** (basil leaves).

```
# frequency for Thai_dishes (Major Grouping) ----
# comparing Individual and Shared Dishes (Major Grouping)
thai_name_freq <- df %>%
select(Thai_name, Thai_script, major_grouping) %>%
unnest_tokens(ngrams, Thai_name) %>%
count(ngrams, major_grouping) %>%
group_by(major_grouping) %>%
mutate(proportion = n / sum(n)) %>%
select(major_grouping, ngrams, proportion) %>%
spread(major_grouping, proportion) %>%
gather(major_grouping, proportion, c(`Shared dishes`)) %>%
select(ngrams, `Individual dishes`, major_grouping, proportion)
# Expect warming message about missing values
ggplot(thai_name_freq, aes(x = proportion, y = `Individual dishes`,
color = abs(`Individual dishes` - proportion))) +
geom_abline(color = 'gray40', lty = 2) +
geom_jitter(alpha = 0.1, size = 2.5, width = 0.3, height = 0.3) +
geom_text(aes(label = ngrams), check_overlap = TRUE, vjust = 1.5) +
scale_x_log10(labels = percent_format()) +
scale_y_log10(labels = percent_format()) +
scale_color_gradient(limits = c(0, 0.01),
low = "red", high = "blue") + # low = "darkslategray4", high = "gray75"
theme_minimal() +
theme(legend.position = "none",
legend.text = element_text(angle = 45, hjust = 1)) +
labs(y = "Individual Dishes",
x = "Shared Dishes",
color = NULL,
title = "Comparing Word Frequencies in the names Thai Dishes",
subtitle = "Individual and Shared Dishes",
caption = "Data: Wikipedia | Graphics: @paulapivat")
```

We can only learn so much from frequency, so text mining practitioners have created **term frequency - inverse document frequency** to better reflect how important a word is in a document or corpus (further details here).

Again, the words don't necessarily refer to raw materials, so this question can't be fully answered directly here.

The short answer is "yes".

We learned just from frequency and "term frequency - inverse document frequency" not only the most frequent words, but the relative importance within the current set of words that we have tokenized with `tidytext`

. This informs us of not only popular raw materials (Pork), but also dish types (Curries) and other popular mode of preparation (Stir-Fry).

We can even examine the **network of relationships** between words. Darker arrows suggest a stronger relationship between pairs of words, for example "nam phrik" is a strong pairing. This means "chili sauce" in Thai and suggests the important role that it plays across many types of dishes.

We learned above that "mu" (pork) appears frequently. Now we see that "mu" and "krop" are more related than other pairings (note: "mu krop" means "crispy pork"). We also saw above that "khao" appears frequently in Rice dishes. This alone is not surprising as "khao" means rice in Thai, but we see here "khao phat" is strongly related suggesting that fried rice ("khao phat") is quite popular.

```
# Visualizing a network of Bi-grams with {ggraph} ----
library(igraph)
library(ggraph)
set.seed(2021)
thai_dish_bigram_counts <- df %>%
select(Thai_name, minor_grouping) %>%
unnest_tokens(bigram, Thai_name, token = "ngrams", n = 2) %>%
separate(bigram, c("word1", "word2"), sep = " ") %>%
count(word1, word2, sort = TRUE)
# filter for relatively common combinations (n > 2)
thai_dish_bigram_graph <- thai_dish_bigram_counts %>%
filter(n > 2) %>%
graph_from_data_frame()
# polishing operations to make a better looking graph
a <- grid::arrow(type = "closed", length = unit(.15, "inches"))
set.seed(2021)
ggraph(thai_dish_bigram_graph, layout = "fr") +
geom_edge_link(aes(edge_alpha = n), show.legend = FALSE,
arrow = a, end_cap = circle(.07, 'inches')) +
geom_node_point(color = "dodgerblue", size = 5, alpha = 0.7) +
geom_node_text(aes(label = name), vjust = 1, hjust = 1) +
labs(
title = "Network of Relations between Word Pairs",
subtitle = "{ggraph}: common nodes in Thai food",
caption = "Data: Wikipedia | Graphics: @paulapivat"
) +
theme_void()
```

Finally, we may be interested in word relationships *within* individual dishes.

The below graph shows a network of word pairs with moderate-to-high correlations. We can see certain words clustered near each other with relatively dark lines: kaeng (curry), pet (spicy), wan (sweet), khiao (green curry), phrik (chili) and mu (pork). These words represent a collection of ingredient, mode of cooking and description that are generally combined.

```
set.seed(2021)
# Individual Dishes
individual_dish_words <- df %>%
select(major_grouping, Thai_name) %>%
filter(major_grouping == 'Individual dishes') %>%
mutate(section = row_number() %/% 10) %>%
filter(section > 0) %>%
unnest_tokens(word, Thai_name) # assume no stop words
individual_dish_cors <- individual_dish_words %>%
group_by(word) %>%
filter(n() >= 2) %>% # looking for co-occuring words, so must be 2 or greater
pairwise_cor(word, section, sort = TRUE)
individual_dish_cors %>%
filter(correlation < -0.40) %>%
graph_from_data_frame() %>%
ggraph(layout = "fr") +
geom_edge_link(aes(edge_alpha = correlation, size = correlation), show.legend = TRUE) +
geom_node_point(color = "green", size = 5, alpha = 0.5) +
geom_node_text(aes(label = name), repel = TRUE) +
labs(
title = "Word Pairs in Individual Dishes",
subtitle = "{ggraph}: Negatively correlated (r = -0.4)",
caption = "Data: Wikipedia | Graphics: @paulapivat"
) +
theme_void()
```

We have completed an exploratory data project where we scraped, clean, manipulated and visualized data using a combination of Python and R. We also used the `tidytext`

package for basic text mining task to see if we could gain some insights into Thai cuisine using words from dish names scraped off Wikipedia.

For more content on data science, R, Python, SQL and more, find me on Twitter.

]]>Data cleaning is typically non-linear.

We'll manipulate the data to explore, learn *about* the data and see that certain things need cleaning or, in some cases, going back to Python to re-scrape. The columns `a1`

and `a6`

were scraped differently from other columns due to **missing data** found during exploration and cleaning.

For certain links, using `.find(text=True)`

did not work as intended, so a slight adjustment was made.

For this post, `R`

is the tool of choice for cleaning the data.

Here are other data cleaning tasks:

- Changing column names (snake case)

```
# read data
df <- read_csv("thai_dishes.csv")
# change column name
df <- df %>%
rename(
Thai_name = `Thai name`,
Thai_name_2 = `Thai name 2`,
Thai_script = `Thai script`,
English_name = `English name`
)
```

- Remove newline escape sequence (\n)

```
# remove \n from all columns ----
df$Thai_name <- gsub("[\n]", "", df$Thai_name)
df$Thai_name_2 <- gsub("[\n]", "", df$Thai_name_2)
df$Thai_script <- gsub("[\n]", "", df$Thai_script)
df$English_name <- gsub("[\n]", "", df$English_name)
df$Image <- gsub("[\n]", "", df$Image)
df$Region <- gsub("[\n]", "", df$Region)
df$Description <- gsub("[\n]", "", df$Description)
df$Description2 <- gsub("[\n]", "", df$Description2)
```

- Add/Mutate new columns (major_groupings, minor_groupings):

```
# Add Major AND Minor Groupings ----
df <- df %>%
mutate(
major_grouping = as.character(NA),
minor_grouping = as.character(NA)
)
```

- Edit rows for missing data in Thai_name column: 26, 110, 157, 234-238, 240, 241, 246

**Note**: This was only necessary the first time round, after the changes are made to how I scraped `a1`

and `a6`

, this step is **no longer necessary**:

```
# If necessary; may not need to do this after scraping a1 and a6 - see above
# Edit Rows for missing Thai_name
df[26,]$Thai_name <- "Khanom chin nam ngiao"
df[110,]$Thai_name <- "Lap Lanna"
df[157,]$Thai_name <- "Kai phat khing"
df[234,]$Thai_name <- "Nam chim chaeo"
df[235,]$Thai_name <- "Nam chim kai"
df[236,]$Thai_name <- "Nam chim paesa"
df[237,]$Thai_name <- "Nam chim sate"
df[238,]$Thai_name <- "Nam phrik i-ke"
df[240,]$Thai_name <- "Nam phrik kha"
df[241,]$Thai_name <- "Nam phrik khaep mu"
df[246,]$Thai_name <- "Nam phrik pla chi"
```

- save to "edit_thai_dishes.csv"

```
# Write new csv to save edits made to data frame
write_csv(df, "edit_thai_dishes.csv")
```

There are several ways to visualize the data. Because we want to communicate the diversity of Thai dishes, *aside* from Pad Thai, we want a visualization that captures the many, many options.

I opted for a **dendrogram**. This graph assumes hierarchy within the data, which fits our project because we can organize the dishes in grouping and sub-grouping.

We first make a distinction between **individual** and **shared** dishes to show that Pad Thai is not even close to being the best *individual* dish. And, in fact, more dishes fall under the **shared** grouping.

To avoid cramming too much data into one visual, we'll create two separate visualizations for individual vs. shared dishes.

Here is the first **dendrogram** representing 52 individual dish alternatives to Pad Thai.

Creating a dendrogram requires using the `ggraph`

and `igraph`

libraries. First, we'll load the libraries and sub-set our data frame by filtering for Individual Dishes:

```
df <- read_csv("edit_thai_dishes.csv")
library(ggraph)
library(igraph)
df %>%
select(major_grouping, minor_grouping, Thai_name, Thai_script) %>%
filter(major_grouping == 'Individual dishes') %>%
group_by(minor_grouping) %>%
count()
```

We create edges and nodes (i.e., from and to) to create the sub-groupings within Individual Dishes (i.e., Rice, Noodles and Misc):

```
# Individual Dishes ----
# data: edge list
d1 <- data.frame(from="Individual dishes", to=c("Misc Indiv", "Noodle dishes", "Rice dishes"))
d2 <- df %>%
select(minor_grouping, Thai_name) %>%
slice(1:53) %>%
rename(
from = minor_grouping,
to = Thai_name
)
edges <- rbind(d1, d2)
# plot dendrogram (idividual dishes)
indiv_dishes_graph <- graph_from_data_frame(edges)
ggraph(indiv_dishes_graph, layout = "dendrogram", circular = FALSE) +
geom_edge_diagonal(aes(edge_colour = edges$from), label_dodge = NULL) +
geom_node_text(aes(label = name, filter = leaf, color = 'red'), hjust = 1.1, size = 3) +
geom_node_point(color = "whitesmoke") +
theme(
plot.background = element_rect(fill = '#343d46'),
panel.background = element_rect(fill = '#343d46'),
legend.position = 'none',
plot.title = element_text(colour = 'whitesmoke', face = 'bold', size = 25),
plot.subtitle = element_text(colour = 'whitesmoke', face = 'bold'),
plot.caption = element_text(color = 'whitesmoke', face = 'italic')
) +
labs(
title = '52 Alternatives to Pad Thai',
subtitle = 'Individual Thai Dishes',
caption = 'Data: Wikipedia | Graphic: @paulapivat'
) +
expand_limits(x = c(-1.5, 1.5), y = c(-0.8, 0.8)) +
coord_flip() +
annotate("text", x = 47, y = 1, label = "Miscellaneous (7)", color = "#7CAE00")+
annotate("text", x = 31, y = 1, label = "Noodle Dishes (24)", color = "#00C08B") +
annotate("text", x = 8, y = 1, label = "Rice Dishes (22)", color = "#C77CFF") +
annotate("text", x = 26, y = 2, label = "Individual\nDishes", color = "#F8766D")
```

There are approximately **4X** as many *shared* dishes as individual dishes, so the dendrogram should be **circular** to fit the names of all dishes in one graphic.

A wonderful resource I use regularly for these types of visuals is the R Graph Gallery. There was a slight issue in how the **text angles** were calculated so I submitted a PR to fix.

Perhaps distinguishing between individual and shared dishes is too crude, within the dendrogram for 201 shared Thai dishes, we can see further sub-groupings including Curries, Sauces/Pastes, Steamed, Grilled, Deep-Fried, Fried & Stir-Fried, Salads, Soups and other Misc:

```
# Shared Dishes ----
df %>%
select(major_grouping, minor_grouping, Thai_name, Thai_script) %>%
filter(major_grouping == 'Shared dishes') %>%
group_by(minor_grouping) %>%
count() %>%
arrange(desc(n))
d3 <- data.frame(from="Shared dishes", to=c("Curries", "Soups", "Salads",
"Fried and stir-fried dishes", "Deep-fried dishes", "Grilled dishes",
"Steamed or blanched dishes", "Stewed dishes", "Dipping sauces and pastes", "Misc Shared"))
d4 <- df %>%
select(minor_grouping, Thai_name) %>%
slice(54:254) %>%
rename(
from = minor_grouping,
to = Thai_name
)
edges2 <- rbind(d3, d4)
# create a vertices data.frame. One line per object of hierarchy
vertices = data.frame(
name = unique(c(as.character(edges2$from), as.character(edges2$to)))
)
# add column with group of each name. Useful to later color points
vertices$group = edges2$from[ match(vertices$name, edges2$to)]
# Add information concerning the label we are going to add: angle, horizontal adjustment and potential flip
# calculate the ANGLE of the labels
vertices$id=NA
myleaves=which(is.na(match(vertices$name, edges2$from)))
nleaves=length(myleaves)
vertices$id[myleaves] = seq(1:nleaves)
vertices$angle = 360 / nleaves * vertices$id + 90
# calculate the alignment of labels: right or left
vertices$hjust<-ifelse( vertices$angle < 275, 1, 0)
# flip angle BY to make them readable
vertices$angle<-ifelse(vertices$angle < 275, vertices$angle+180, vertices$angle)
# plot dendrogram (shared dishes)
shared_dishes_graph <- graph_from_data_frame(edges2)
ggraph(shared_dishes_graph, layout = "dendrogram", circular = TRUE) +
geom_edge_diagonal(aes(edge_colour = edges2$from), label_dodge = NULL) +
geom_node_text(aes(x = x*1.15, y=y*1.15, filter = leaf, label=name, angle = vertices$angle, hjust= vertices$hjust, colour= vertices$group), size=2.7, alpha=1) +
geom_node_point(color = "whitesmoke") +
theme(
plot.background = element_rect(fill = '#343d46'),
panel.background = element_rect(fill = '#343d46'),
legend.position = 'none',
plot.title = element_text(colour = 'whitesmoke', face = 'bold', size = 25),
plot.subtitle = element_text(colour = 'whitesmoke', margin = margin(0,0,30,0), size = 20),
plot.caption = element_text(color = 'whitesmoke', face = 'italic')
) +
labs(
title = 'Thai Food is Best Shared',
subtitle = '201 Ways to Make Friends',
caption = 'Data: Wikipedia | Graphic: @paulapivat'
) +
#expand_limits(x = c(-1.5, 1.5), y = c(-0.8, 0.8)) +
expand_limits(x = c(-1.5, 1.5), y = c(-1.5, 1.5)) +
coord_flip() +
annotate("text", x = 0.4, y = 0.45, label = "Steamed", color = "#F564E3") +
annotate("text", x = 0.2, y = 0.5, label = "Grilled", color = "#00BA38") +
annotate("text", x = -0.2, y = 0.5, label = "Deep-Fried", color = "#DE8C00") +
annotate("text", x = -0.4, y = 0.1, label = "Fried &\n Stir-Fried", color = "#7CAE00") +
annotate("text", x = -0.3, y = -0.4, label = "Salads", color = "#00B4F0") +
annotate("text", x = -0.05, y = -0.5, label = "Soups", color = "#C77CFF") +
annotate("text", x = 0.3, y = -0.5, label = "Curries", color = "#F8766D") +
annotate("text", x = 0.5, y = -0.1, label = "Misc", color = "#00BFC4") +
annotate("text", x = 0.5, y = 0.1, label = "Sauces\nPastes", color = "#B79F00")
```

For more content on data science, R, Python, SQL and more, find me on Twitter.

]]>"Let's order Thai."

"Great, what's your go-to dish?"

"Pad Thai.

This has bugged me for years and is the genesis for this project.

People need to know they have other choices aside from Pad Thai. Pad Thai is one of 53 individual dishes and stopping there risks missing out on at least 201 shared Thai dishes (source: wikipedia).

This project is an opportunity to build a data set of Thai dishes by scraping tables off Wikipedia. We will use Python for web scraping and R for visualization. Web scraping is done in `Beautiful Soup`

(Python) and pre-processed further with `dplyr`

and visualized with `ggplot2`

.

Furthermore, we'll use the `tidytext`

package in R to explore the names of Thai dishes (in English) to see if we can learn some interest things from text data.

Finally, there is an opportunity to make an open source contribution.

The project repo is here.

The purpose of this analysis is to generate questions.

Because **exploratory analysis** is iterative, these questions were generated in the process of manipulating and visualizing data. We can use these questions to structure the rest of the post:

- How might we organized Thai dishes?
- What is the best way to organized the different dishes?
- Which raw material(s) are most popular?
- Which raw materials are most important?
- Could you learn about Thai food just from the names of the dishes?

We scraped over 300 Thai dishes. For each dish, we got:

- Thai name
- Thai script
- English name
- Region
- Description

First, we'll use the following Python libraries/modules:

```
import requests
from bs4 import BeautifulSoup
import urllib.request
import urllib.parse
import urllib.error
import ssl
import pandas as pd
```

We'll use `requests`

to send an HTTP requests to the wikipedia url we need. We'll access network sockets using 'secure sockets layer' (SSL). Then we'll read in the html data to parse it with **Beautiful Soup**.

Before using **Beautiful Soup**, we want to understand the structure of the page (and tables) we want to scrape under **inspect element** on the browser (note: I used Chrome). We can see that we want the `table`

tag, along with `class`

of wikitable sortable.

The main function we'll use from **Beautiful Soup** is `findAll()`

and the three parameters are `th`

(Header Cell in HTML table), `tr`

(Row in HTML table) and `td`

(Standard Data Cell).

First, we'll save the table headers in a list, which we'll use when creating an empty `dictionary`

to store the data we need.

```
header = [item.text.rstrip() for item in all_tables[0].findAll('th')]
table = dict([(x, 0) for x in header])
```

Initially, we want to scrape one table, knowing that we'll need to repeat the process for all 16 tables. Therefore we'll use a *nested loop*. Because all tables have 6 columns, we'll want to create 6 empty lists.

We'll scrape through all table rows `tr`

and check for 6 cells (which we should have for 6 columns), then we'll *append* the data to each empty list we created.

```
# loop through all 16 tables
a = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]
# 6 empty list (for 6 columns) to store data
a1 = []
a2 = []
a3 = []
a4 = []
a5 = []
a6 = []
# nested loop for looping through all 16 tables, then all tables individually
for i in a:
for row in all_tables[i].findAll('tr'):
cells = row.findAll('td')
if len(cells) == 6:
a1.append([string for string in cells[0].strings])
a2.append(cells[1].find(text=True))
a3.append(cells[2].find(text=True))
a4.append(cells[3].find(text=True))
a5.append(cells[4].find(text=True))
a6.append([string for string in cells[5].strings])
```

You'll note the code for `a1`

and `a6`

are slightly different. In retrospect, I found that `cells[0].find(text=True)`

did **not** yield certain texts, particularly if they were links, therefore a slight adjustment is made.

The strings tag returns a `NavigableString`

type object while text returns a `unicode`

object (see stack overflow explanation).

After we've scrapped the data, we'll need to store the data in a `dictionary`

before converting to `data frame`

:

```
# create dictionary
table = dict([(x, 0) for x in header])
# append dictionary with corresponding data list
table['Thai name'] = a1
table['Thai script'] = a2
table['English name'] = a3
table['Image'] = a4
table['Region'] = a5
table['Description'] = a6
# turn dict into dataframe
df_table = pd.DataFrame(table)
```

For `a1`

and `a6`

, we need to do an extra step of joining the strings together, so I've created two additional corresponding columns, `Thai name 2`

and `Description2`

:

```
# Need to Flatten Two Columns: 'Thai name' and 'Description'
# Create two new columns
df_table['Thai name 2'] = ""
df_table['Description2'] = ""
# join all words in the list for each of 328 rows and set to thai_dishes['Description2'] column
# automatically flatten the list
df_table['Description2'] = [
' '.join(cell) for cell in df_table['Description']]
df_table['Thai name 2'] = [
' '.join(cell) for cell in df_table['Thai name']]
```

After we've scrapped all the data and converted from `dictionary`

to `data frame`

, we'll write to CSV to prepare for data cleaning in R (**note**: I saved the csv as thai_dishes.csv, but you can choose a different name).

For more content on data science, R, Python, SQL and more, find me on Twitter.

]]>NLP is subfield of linguistic, computer science and artificial intelligence (wiki), and you could spend years studying it.

However, I wanted a quick dive to a get an intuition for how NLP works, and we'll do that via **sentiment analysis**, categorizing text by their polarity.

We can't help but feel motivated to see insights about our *own* social media post, so we'll turn to a well known platform.

To find out, I downloaded 14 years of posts to apply **text** and **sentiment** analysis. We'l use `Python`

to read and parse `json`

data from Facebook.

We'll perform tasks such as tokenization and normalization aided by Python's **Natural Language Toolkit**, `NLTK`

. Then, we'll use the `Vader`

module (Hutto & Gilbert, 2014) for rule-based (lexicon) **sentiment analysis**.

Finally, we'll transition our work flow to `R`

and the `tidyverse`

for **data manipulation** and **visualization**.

First, you'll need to download your own Facebook data by following: Setting & Privacy > Setting > Your Facebook Information > Download Your Information > (select) Posts.

Below, I named my file `your_posts_1.json`

, but you can change this.
We'll use Python's `json`

module read in data. We can get a feel for the data with `type`

and `len`

.

```
import json
# load json into python, assign to 'data'
with open('your_posts_1.json') as file:
data = json.load(file)
type(data) # a list
type(data[0]) # first object in the list: a dictionary
len(data) # my list contains 2166 dictionaries
```

Here are the Python libraries we use in this post:

```
import pandas as pd
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from nltk.stem import LancasterStemmer, WordNetLemmatizer # OPTIONAL (more relevant for individual words)
from nltk.corpus import stopwords
from nltk.probability import FreqDist
import re
import unicodedata
import nltk
import json
import inflect
import matplotlib.pyplot as plt
```

Natural Language Tookkit is a popular Python platform for working with human language data. While it has over 50 lexical resources, we'll use the Vader Sentiment Lexicon, that is *specifically* attuned to sentiments expressed in social media.

Regex (regular expressions) will be used to remove punctuation.

Unicode Database will be used to remove non-ASCII characters.

JSON module helps us to read in json from Facebook.

Inflect helps us to convert numbers to words.

Pandas is a powerful data manipulation and data analysis tool for when we save our text data into a data frame and write to csv.

After we have our data, we'll dig through to get actual **text data** (our posts).

We'll store this in a list.

**Note**: the `data`

key occasionally returns an empty array and we want to skip over those by checking `if len(v) > 0`

.

```
# create empty list
empty_lst = []
# multiple nested loops to store all post in empty list
for dct in data:
for k, v in dct.items():
if k == 'data':
if len(v) > 0:
for k_i, v_i in vee[0].items():
if k_i == 'post':
empty_lst.append(v_i)
print("This is the empty list: ", empty_lst)
print("\nLength of list: ", len(empty_lst))
```

We now have a list of strings.

We'll loop through our list of strings (empty_lst) to tokenize each *sentence* with `nltk.sent_tokenize()`

. We want to split the text into individual sentences.

This yields a list of list, which we'll flatten:

```
# - list of list, len: 1762 (each list contain sentences)
nested_sent_token = [nltk.sent_tokenize(lst) for lst in empty_lst]
# flatten list, len: 3241
flat_sent_token = [item for sublist in nested_sent_token for item in sublist]
print("Flatten sentence token: ", len(flat_sent_token))
```

For context on the functions used in this section, check out this article by Matthew Mayo on Text Data Preprocessing.

First, we'll remove non-ASCII characters (`remove_non_ascii(words)`

) including: `#`

, `-`

, `'`

and `?`

, among many others. Then we'll lowercase (`to_lowercase(words)`

), remove punctuation (`remove_punctuation(words)`

), replace numbers (`replace_numbers(words)`

), and remove stopwords (`remove_stopwords(words)`

).

Example stopwords are: your, yours, yourself, yourselves, he, him, his, himself etc.

This allows us to have each sentence be on equal playing field.

```
# Remove Non-ASCII
def remove_non_ascii(words):
"""Remove non-ASCII character from List of tokenized words"""
new_words = []
for word in words:
new_word = unicodedata.normalize('NFKD', word).encode(
'ascii', 'ignore').decode('utf-8', 'ignore')
new_words.append(new_word)
return new_words
# To LowerCase
def to_lowercase(words):
"""Convert all characters to lowercase from List of tokenized words"""
new_words = []
for word in words:
new_word = word.lower()
new_words.append(new_word)
return new_words
# Remove Punctuation , then Re-Plot Frequency Graph
def remove_punctuation(words):
"""Remove punctuation from list of tokenized words"""
new_words = []
for word in words:
new_word = re.sub(r'[^\w\s]', '', word)
if new_word != '':
new_words.append(new_word)
return new_words
# Replace Numbers with Textual Representations
def replace_numbers(words):
"""Replace all interger occurrences in list of tokenized words with textual representation"""
p = inflect.engine()
new_words = []
for word in words:
if word.isdigit():
new_word = p.number_to_words(word)
new_words.append(new_word)
else:
new_words.append(word)
return new_words
# Remove Stopwords
def remove_stopwords(words):
"""Remove stop words from list of tokenized words"""
new_words = []
for word in words:
if word not in stopwords.words('english'):
new_words.append(word)
return new_words
# Combine all functions into Normalize() function
def normalize(words):
words = remove_non_ascii(words)
words = to_lowercase(words)
words = remove_punctuation(words)
words = replace_numbers(words)
words = remove_stopwords(words)
return words
```

The below screen cap gives us an idea of the difference between sentence **normalization** vs **non-normalization**.

```
sents = normalize(flat_sent_token)
print("Length of sentences list: ", len(sents)) # 3194
```

**NOTE**: The process of stemming and lemmatization makes more sense for individuals words (over sentences), so we won't use them here.

You can use the `FreqDist()`

function to get the most common sentences. Then, you could plot a line chart for a visual comparison of the most frequent sentences.

Although simple, counting frequencies can yield some insights.

```
from nltk.probability import FreqDist
# Find frequency of sentence
fdist_sent = FreqDist(sents)
fdist_sent.most_common(10)
# Plot
fdist_sent.plot(10)
```

We'll use the `Vader`

module from `NLTK`

. Vader stands for:

Valence, Aware, Dictionary and sEntiment Reasoner.

We are taking a **Rule-based/Lexicon** approach to sentiment analysis because we have a fairly large dataset, but lack labeled data to build a robust training set. Thus, Machine Learning would **not** be ideal for this task.

To get an intuition for how the `Vader`

module works, we can visit the github repo to view `vader_lexicon.txt`

(source). This is a **dictionary** that has been empirically validated. Sentiment ratings are provided by 10 independent human raters (pre-screened, trained and checked for inter-rater reliability).

Scores range from (-4) Extremely Negative to (4) Extremely Positive, with (0) as Neutral. For example, "die" is rated -2.9, while "dignified" has a 2.2 rating. For more details visit their (repo).

We'll create two empty lists to store the sentences and the polarity scores, separately.

`sentiment`

captures each sentence and `sent_scores`

, which initializes the `nltk.sentiment.vader.SentimentIntensityAnalyzer`

to calculate **polarity_score** of each sentence (i.e., negative, neutral, positive).

`sentiment2`

captures each polarity and value in a list of tuples.

The below screen cap should give you a sense of what we have:

After we have appended each sentence (`sentiment`

) and their polarity scores (`sentiment2`

, negative, neutral, positive), we'll **create data frames** to store these values.

Then, we'll write the data frames to **CSV** to transition to `R`

. Note that we set index to false when saving for CSV. Python starts counting at 0, while `R`

starts at 1, so we're better off re-creating the index as a separate column in `R`

.

**NOTE**: There are more efficient ways for what I'm doing here. My solution is to save two CSV files and move the work flow over to `R`

for further data manipulation and visualization. This is primarily a personal preference for handling data frames and visualizations in `R`

, but I should point out this *can* be done with `pandas`

and `matplotlib`

.

```
# nltk.download('vader_lexicon')
sid = SentimentIntensityAnalyzer()
sentiment = []
sentiment2 = []
for sent in sents:
sent1 = sent
sent_scores = sid.polarity_scores(sent1)
for x, y in sent_scores.items():
sentiment2.append((x, y))
sentiment.append((sent1, sent_scores))
# print(sentiment)
# sentiment
cols = ['sentence', 'numbers']
result = pd.DataFrame(sentiment, columns=cols)
print("First five rows of results: ", result.head())
# sentiment2
cols2 = ['label', 'values']
result2 = pd.DataFrame(sentiment2, columns=cols2)
print("First five rows of results2: ", result2.head())
# save to CSV
result.to_csv('sent_sentiment.csv', index=False)
result2.to_csv('sent_sentiment_2.csv', index=False)
```

From this point forward, we'll be using `R`

and the `tidyverse`

for data manipulation and visualization. `RStudio`

is the IDE of choice here. We'll create an `R Script`

to store all our data transformation and visualization process. We should be in the same directory in which the above CSV files were created with `pandas`

.

We'll load the two CSV files we saved and the `tidyverse`

library:

```
library(tidyverse)
# load data
df <- read_csv("sent_sentiment.csv")
df2 <- read_csv('sent_sentiment_2.csv')
```

We'll create another column that matches the index for the first data frame (sent_sentiment.csv). I save it as `df1`

, but you could overwrite the original `df`

if you wanted.

```
# create a unique identifier for each sentence
df1 <- df %>%
mutate(row = row_number())
```

Then, for the second data frame (sent_sentiment_2.csv), we'll create another column matching the index, but also use `pivot_wider`

from the `tidyr`

package. **NOTE**: You'll want to `group_by`

label first, then use `mutate`

to create a unique identifier.

We'll then use `pivot_wider`

to ensure that all polarity values (negative, neutral, positive) have their own columns.

By creating a unique identifier using `mutate`

and `row_number()`

, we'll be able to join (`left_join`

) by row.

Finally, I save the operation to `df3`

which allows me to work off a fresh new data frame for visualization.

```
# long-to-wide for df2
# note: first, group by label; then, create a unique identifier for each label then use pivot_wider
df3 <- df2 %>%
group_by(label) %>%
mutate(row = row_number()) %>%
pivot_wider(names_from = label, values_from = values) %>%
left_join(df1, by = 'row') %>%
select(row, sentence, neg:compound, numbers)
```

First, we'll visualize the positive and negative polarity scores separately, across all 3194 sentences (your numbers will vary).

Here are positivity scores:

Here are negativity scores:

When I sum positivity and negativity scores to get a ratio, it's approximately 568:97 or 5.8x more positive than negative according to the `Vader`

(Valance Aware Dictionary and Sentiment Reasoner).

The `Vader`

module will take in every sentence and assign a valence score from -1 (most negative) to 1 (most positive). We can classify sentences as `pos`

(positive), `neu`

(neutral) and `neg`

(negative) or as a composite (`compound`

) score (i.e., normalized, weighted composite score). For more details, see vader-sentiment documentation.

Here is a chart to see *both* positive and negative scores together (positive = blue, negative = red, neutral = black).

Finally, we can also use `histograms`

to see the distribution of negative and positive sentiment among the sentences:

It turns out the `Vader`

module is fully capable of analyzing sentences with punctuation, word-shape (capitalization for emphasis), slang and even utf-8 encoded emojis.

So to see if there would be any difference if we implemented sentiment analysis **without normalization**, I re-ran all the analyses above.

Here are the two version of data for comparison. Top for normalization and bottom for non-normalized.

While there are expected slight differences, they are only slight.

I downloaded 14 years worth of Facebook posts to run a rule-based sentiment analysis and visualize the results, using a combination of `Python`

and `R`

.

I enjoyed using both for this project and sought to play to their strengths. I found parsing JSON straight-forward with Python, but once we transition to data frames, I was itching to get back to R.

Because we lacked labeled data, using a rule-based/lexicon-approach to sentiment analysis made sense. Now that we have a label for valence scores, it may be possible to take a machine learning approach to predict the valence of future posts.

- Hutto, C.J. & Gilbert, E.E. (2014). VADER: A Parsimonious Rule-based Model for Sentiment Analysis of Social Media Text. Eighth International Conference on Weblogs and Social Media (ICWSM-14). Ann Arbor, MI, June 2014.

For more content on data science, machine learning, R, Python, SQL and more, find me on Twitter.

]]>This post uses various R libraries and functions to help you explore your Twitter Analytics Data. The first thing to do is download data from analytics.twitter.com. The assumption here is that you're already a Twitter user and have been using for at least 6 months.

Once there, you'll click on the `Tweets`

tab, which should bring you to your Tweet activity with the option to **Export data**:

Once you click on **Export data**, you'll choose "By day", which provides your impressions and engagements metrics for everyday (you'll also select the time period, in the drop down menu right next to Export data - the default is "Last 28 Days").

**Note**: The other option is to choose "By Tweet" and that will download the text of each Tweet along with associated metrics. We could potentially do fun text analysis with this, but we'll save that for another post.

For this post, I downloaded all *available* data, which goes five months back.

After downloading, you'll want to **read** in the data and, in our case, **combine** all five months into one data frame, we'll use the `readr`

package and `read_csv()`

function contained in `tidyverse`

. Then we'll use `rbind()`

to combine five data frames by rows:

```
library(tidyverse)
# load data from September to mid-January
df1 <- read_csv("./daily_tweet_activity/daily_tweet_activity_metrics_paulapivat_20200901_20201001_en.csv")
df2 <- read_csv("./daily_tweet_activity/daily_tweet_activity_metrics_paulapivat_20201001_20201101_en.csv")
df3 <- read_csv("./daily_tweet_activity/daily_tweet_activity_metrics_paulapivat_20201101_20201201_en.csv")
df4 <- read_csv("./daily_tweet_activity/daily_tweet_activity_metrics_paulapivat_20201201_20210101_en.csv")
df5 <- read_csv("./daily_tweet_activity/daily_tweet_activity_metrics_paulapivat_20210101_20210112_en.csv")
# combining ALL five dataframes into ONE, by rows
df <- rbind(df1, df2, df3, df4, df5)
```

Twitter analytics tracks several metric that are broadly grouped under Engagements, including: retweets, replies, likes, user profile clicks, url clicks, hashtag clicks, detail expands, media views and media engagements.

There are other metrics like "app opens" and "promoted engagements", which are services I have not used and so do not have any data available.

It's useful to have a guiding question as it helps focus your exploration. Let's say, I was interested in whether one of my tweets prompted a reader to click on my profile. The metric for this is `user profile clicks`

.

My initial guiding question for this post is:

Which metrics are most strongly correlated with User Profile Clicks?

You could simply use the `cor.test()`

function, which comes with base R, to go one by one between *each* metric and `User Profile Click`

. For example, below we calculate the correlation between three pairs of variables, `User Profile Clicks`

and `retweets`

, `replies`

and `likes`

, separately. After awhile, this can get tedious.

```
cor.test(x = df$`user profile clicks`, y = df$retweets)
cor.test(x = df$`user profile clicks`, y = df$replies)
cor.test(x = df$`user profile clicks`, y = df$likes)
```

A quicker way to explore the relationship between pairs of metrics throughout a dataset is to use a **correlelogram**.

We'll start with base R. You'll want to limit the number of variables you visualize so the correlelogram doesn't become too cluttered. Here are four variables that correlate the highest with `User Profile Clicks`

:

```
# four columns are selected along with user profile clicks to plot
df %>%
select(8, 12, 19:20, `user profile clicks`) %>%
plot(pch = 20, cex = 1.5, col="#69b3a2")
```

Here's a visual:

Here are another four metrics with *moderate* relationships:

```
df %>%
select(6:7, 10:11, `user profile clicks`) %>%
plot(pch = 20, cex = 1.5, col="#69b3a2")
```

Visually, you can see the moderate relationship scatter plots are more dispersed, with a less identifiable direction.

While base R is dependable, we can get more informative plots with the `GGally`

package. Here are the four highly correlated variables with `User Profile Clicks`

:

```
library(GGally)
# GGally, Strongest Related
df %>%
select(8, 12, 19:20, `user profile clicks`) %>%
ggpairs(
diag = NULL,
title = "Strongest Relationships with User Profile Clicks: Sep 2020 - Jan 2021",
axisLabels = c("internal"),
xlab = "Value"
)
```

Here's the correlelogram between the four most highly correlated variables with `user profile clicks`

:

Here are the moderately correlated variables with `User Profile Clicks`

:

As you can see, not only do these provide scatter plots, but they also show the numerical values of the correlation between each pair of variables, which is much more informative than base R.

Now, its entirely possible that the pattern of correlation in your data is different as the initial patterns we're seeing here are not meant to generalize to a different dataset.

For more content on data science, machine learning, R, Python, SQL and more, find me on Twitter.

]]>In this post, we'll explore Gradient Descent from the ground up starting conceptually, then using code to build up our intuition brick by brick.

While this post is part of an ongoing series where I document my progress through Data Science from Scratch by Joel Grus, for this post I am drawing on external sources including Aurlien Geron's Hands-On Machine Learning to provide a context for why and when gradient descent is used.

We'll also be using external libraries such as `numpy`

, that are generally avoided in Data Science from Scratch, to help highlight concepts.

While the book introduces gradient descent as a standalone topic, I find it more intuitive to reason about it within the context of a regression problem.

In any modeling task, there is error, and our objective is minimize the errors so that when we develop models from our training data, we'll have some confidence that the predictions will work in testing and completely new data.

We'll train a *linear regression model*. Our dataset will only have three data points. To create the model, we'll setting up parameters (slope and intercept) that best "fits" the data (i.e., best-fitting line), for example:

We know the values for both `x`

and `y`

, so we can calculate the slope and intercept directly through the **normal equation**, which is the analytical approach to finding regression coefficients (slope and intercept):

```
# Normal Equation
import numpy as np
import matplotlib.pyplot as plt
x = np.array([2, 4, 5])
y = np.array([45, 85, 105])
# computing Normal Equation
x_b = np.c_[np.ones((3, 1)), x] # add x0 = 1 to each of three instances
theta = np.linalg.inv(x_b.T.dot(x_b)).dot(x_b.T).dot(y)
# array([ 5., 20.])
theta
```

The key line is `np.linalg.inv()`

which computes the multiplicative inverse of a matrix.

Our slope is 20 and intercept is 5 (i.e., `theta`

).

We could also have used the more familiar "rise over run" ((85 - 45) / (4 - 2)) or (40/2) or 20, but we want to illustrate the **normal equation** which should come in handy when we go beyond the simplistic three data point example.

We could also use the `LinearRegression`

class from `sklearn`

to call the least squares (`np.linalg.lstsq()`

) function directly:

```
# Least Squares
from sklearn.linear_model import LinearRegression
import numpy as np
x = np.array([2, 4, 5])
y = np.array([45, 85, 105])
x = x.reshape(-1, 1) # reshape because sklearn expect 2D array
x_b = np.c_[np.ones((3, 1)), x] # add x0 = 1 to each of three instances
theta, residuals, rank, s = np.linalg.lstsq(x_b, y, rcond=1e-6)
# array([ 5., 20.])
print("theta:", theta)
```

This appraoch also yields the slope (20) and intercept (5) directly.

We know the parameters of `x`

and `y`

in our example, but we want to see how **learning from data** would work. Here's the equation we're working with:

```
y = 20 * x + 5
```

And here's what it looks like (intercept = 5, slope = 20)

The **normal equation** and the **least squares** approach can handle large training sets efficiently, but when your model has a large number of features or too many training instances to fit into memory, **gradient descent** is an often used alternative.

Moreover, linear least squares assume the errors have a normal distribution and the relationship in the data is linear (this is where closed-form solutions like the normal equation excel). When the data is non-linear, an iterative solution (gradient descent) can be used.

With linear regression we seek to minimize the sum-of-squares differences between the observed data and the predicted values (aka the error), in a **non-iterative** fashion.

Alternatively, we use gradient descent to find the slope and intercept that minimizes the average squared error, however, in an **interative fashion**.

The process for gradient descent is to start with a **random** slope and intercept, then compute the gradient of the mean squared error, while adjusting the slope/intercept (`theta`

) in the direction that continues to minimize the error. This is repeated iteratively until we find a point where errors are *most* minimized.

**NOTE**: This section builds heavily on a previous post on linear algebra. You'll want to read this post to get a feel for the functions used to construct the functions we see in this post.

```
from typing import TypeVar, List, Iterator
import math
import random
import matplotlib.pyplot as plt
from typing import Callable
from typing import List
import numpy as np
x = np.array([2, 4, 5])
# instead of putting y directly, we'll use the equation: 20 * x + 5, which is a direct representation of its relationship to x
# y = np.array([45, 85, 105])
# both x and y are represented in inputs
inputs = [(x, 20 * x + 5) for x in range(2, 6)]
```

First, we'll start with random values for the slope and intercept; we'll also establish a learning rate, which controls how much a change in the model is warranted in response to the estimated error each time the model parameters (slope and intercept) are updated.

```
# 1. start with a random value for slope and intercept
theta = [random.uniform(-1, 1), random.uniform(-1, 1)]
learning_rate = 0.001
```

Next, we'll compute the mean of the gradients, then adjust the slope/intercept in the direction of minimizing the gradient, which is based on the error.

You'll note that this for-loop has 100 iterations. The more iterations we go through, the more that errors are minimized and the more we approach a slope/intercept where the model "fits" the data better.

You can see in this list, `[linear_gradient(x, y, theta) for x, y in inputs]`

, that our `linear_gradient`

function is applied to the known `x`

and `y`

values in the list of tuples, `inputs`

, along with random values for slope/intercept (`theta`

).

We multiply each `x`

value with a random value for slope, then add a random value for intercept. This yields the initial prediction. Error is the gap between the initial prediction and *actual* `y`

values. We minimize the squared error by using its gradient.

```
# start with a function that determines the gradient based on the error from a single data point
def linear_gradient(x: float, y: float, theta: Vector) -> Vector:
slope, intercept = theta
predicted = slope * x + intercept # model prediction
error = (predicted - y) # error is (predicted - actual)
squared_error = error ** 2 # minimize squared error
grad = [2 * error * x, 2 * error] # using its gradient
return grad
```

The `linear_gradient`

function along with initial parameters are then passed to `vector_mean`

, which utilize `scalar_multiply`

and `vector_sum`

:

```
def vector_mean(vectors: List[Vector]) -> Vector:
"""Computes the element-wise average"""
n = len(vectors)
return scalar_multiply(1/n, vector_sum(vectors))
def scalar_multiply(c: float, v: Vector) -> Vector:
"""Multiplies every element by c"""
return [c * v_i for v_i in v]
def vector_sum(vectors: List[Vector]) -> Vector:
"""Sum all corresponding elements (componentwise sum)"""
# Check that vectors is not empty
assert vectors, "no vectors provided!"
# Check the vectors are all the same size
num_elements = len(vectors[0])
assert all(len(v) == num_elements for v in vectors), "different sizes!"
# the i-th element of the result is the sum of every vector[i]
return [sum(vector[i] for vector in vectors)
for i in range(num_elements)]
```

This yields the gradient. Then, each `gradient_step`

is determined as our function adjusts the initial random `theta`

values (slope/intercept) in the direction that minimizes the error.

```
def gradient_step(v: Vector, gradient: Vector, step_size: float) -> Vector:
"""Moves `step_size` in the `gradient` direction from `v`"""
assert len(v) == len(gradient)
step = scalar_multiply(step_size, gradient)
return add(v, step)
def add(v: Vector, w: Vector) -> Vector:
"""Adds corresponding elements"""
assert len(v) == len(w), "vectors must be the same length"
return [v_i + w_i for v_i, w_i in zip(v, w)]
```

All this comes together in this **for-loop** to print out how the slope and intercept change with each iteration (we start with 100):

```
for epoch in range(100): # start with 100 <--- change this figure to try different iterations
# compute the mean of the gradients
grad = vector_mean([linear_gradient(x, y, theta) for x, y in inputs])
# take a step in that direction
theta = gradient_step(theta, grad, -learning_rate)
print(epoch, grad, theta)
slope, intercept = theta
#assert 19.9 < slope < 20.1, "slope should be about 20"
#assert 4.9 < intercept < 5.1, "intercept should be about 5"
print("slope", slope)
print("intercept", intercept)
```

At 100 iterations, the slope is 18.87 and intercept is 4.87 and the gradient is -32.48 (error for the slope) and -8.45 (error for the intercept). These numbers suggest that we need to decrease the slope and intercept from our random starting point, but our emphasis needs to be on decreasing the slope.

At 200 iterations, the slope is 19.97 and intercept is 4.86 and the gradient is -1.76 (error for the slope) and -0.48 (error for the intercept). Our errors have been reduced significantly.

At 1000 iterations, the slope is 19.97 (not much difference from 200 iterations) and intercept is 5.09 and the gradients are markedly lower at -0.004 (error for the slope) and 0.02 (error for the intercept). Here the errors may not be much different from zero and we are near our optimal point.

In summary, the **normal equation** and **regression** approaches gave us a slope of 20 and intercept of 5. With gradient descent, we approached these values with each successive iterations, 1000 iterations yielding **less error** than 100 or 200 iterations.

As mentioned above, the functions used to compute the gradients and adjust the slope/intercept build on functions we explored in this post. Here's a visual showing how the functions we used to iteratively arrive at the slope and intercept through gradient descent was built:

Gradient descent is an optimization technique often used in machine learning and in this post, we built some intuition around how it works by applying it to a simple linear regression problem, favoring code over math (which we'll return to in a later post). Gradient Descent is useful if you are expecting computational complexity due to the number of features or training instances.

We placed gradient descent in context, in comparison to a more analytical approach, normal equation and the least squares method, both of which are non-iterative.

Furthermore, we saw how the functions used in this post can be traced back to a previous post on linear algebra, thus giving us a big picture view of how the building blocks of data science and an intuition for areas we'll need to explore at a deeper, perhaps at a more mathematical, level.

This post is part of an ongoing series where I document my progress through Data Science from Scratch by Joel Grus.

For more content on data science, machine learning, R, Python, SQL and more, find me on Twitter.

]]>This is a quick walk through of using the `sunburstR`

package to create sunburst plots in R. The original document is written in `RMarkdown`

, which is an interactive version of markdown.

The following code can be run in **RMarkdown** or an **R script**. For interactive visuals, you'll want to use RMarkdown.

The two main libraries are `tidyverse`

(mostly `dplyr`

so you can just load that if you want) and `sunburstR`

. There are other packages for sunburst plots including: plotly and ggsunburst (of ggplot), but we'll explore sunburstR in this post.

```
library(tidyverse)
library(sunburstR)
```

The data is from week 50 of TidyTuesday, exploring the BBC's top 100 influential women of 2020.

The `head()`

function presents the first six rows in a dataframe.

```
women <- read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-12-08/women.csv')
head(women)
```

The original dataset organized 100 women by category, country, role and description. I found that for employing the sunburst plot, I would want to group countries together by continents.

I manually added country names to continent vectors, then added a new column to the `women`

dataframe to conditionally add continent name.

We could then focus on six continents rather than 65 separate countries.

```
# add continent as character vector
asia <- c('Afghanistan', 'Bangladesh', 'China', 'Exiled Uighur from Ghulja (in Chinese, Yining)', 'Hong Kong', 'India', 'Indonesia', 'Iran', 'Iraq/UK', 'Japan', 'Kyrgyzstan', 'Lebanon', 'Malaysia', 'Myanmar', 'Nepal', 'Pakistan', 'Singapore', 'South Korea', 'Syria', 'Thailand', 'UAE', 'Vietnam', 'Yemen')
south_america <- c('Argentina', 'Brazil', 'Colombia', 'Ecuador', 'Peru', 'Venezuela')
oceania <- c('Australia')
europe <- c('Belarus', 'Finland', 'France', 'Germany', 'Italy', 'Netherlands', 'Northern Ireland', 'Norway', 'Republic of Ireland', 'Russia', 'Turkey', 'UK', 'Ukraine', 'Wales, UK')
africa <- c('Benin', 'DR Congo', 'Egypt', 'Ethiopia', 'Kenya', 'Morocco', 'Mozambique', 'Nigeria', 'Sierra Leone', 'Somalia', 'Somaliland', 'South Africa', 'Tanzania', 'Uganda', 'Zambia', 'Zimbabwe')
north_america <- c('El Salvador', 'Jamaica', 'Mexico', 'US')
# add new column for continent
women <- women %>%
mutate(continent = NA)
# add continents to women dataframe
women$continent <- ifelse(women$country %in% asia, 'Asia', women$continent)
women$continent <- ifelse(women$country %in% south_america, 'South America', women$continent)
women$continent <- ifelse(women$country %in% oceania, 'Oceania', women$continent)
women$continent <- ifelse(women$country %in% europe, 'Europe', women$continent)
women$continent <- ifelse(women$country %in% africa, 'Africa', women$continent)
women$continent <- ifelse(women$country %in% north_america, 'North America', women$continent)
women
```

The key to using the `sunburstR`

package with this specific dataset is the wrangling that happens to filter by continents we created above. We'll also want to get rid of dashes with `mutate_at`

as dashes are structurally needed to render the sunburst plots.

Below, I've filtered the `women`

data frame into Africa and Asia (the same could be done for North and South America and Europe as well).

The **two most important** operations here are the creation of the `path`

and `V2`

columns that will later be parameters for rendering the sunburst plots.

```
# Filter for Africa
africa_name <- women %>%
select(continent, category, role, name) %>%
# remove dash within dplyr pipe
mutate_at(vars(3, 4), funs(gsub("-", "", .))) %>%
filter(continent=='Africa') %>%
mutate(
path = paste(continent, category, role, name, sep = "-")
) %>%
slice(2:100) %>%
mutate(
V2 = 1
)
# Filter for Asia
asia_name <- women %>%
select(continent, category, role, name) %>%
# remove dash within dplyr pipe
mutate_at(vars(3, 4), funs(gsub("-", "", .))) %>%
filter(continent=='Asia') %>%
mutate(
path = paste(continent, category, role, name, sep = "-")
) %>%
slice(2:100) %>%
mutate(
V2 = 1
)
```

Ultimately, I found the information best presented by continent as the *base* of the sunburst plot, followed by category, specific roles and the names of each of the 100 women honored by the BBC.

Moreover, by presenting the data by continent, you can focus on just five specific color as you decide on a palette.

I wouldn't recommend trying to pick a color for each role or name; it becomes too unweildy. Just pick five colors for the two inner most rings of the sunburst plot and it'll shuffle the rest of the colors.

```
# Africa
sunburst(data = data.frame(xtabs(V2~path, africa_name)), legend = FALSE,
colors = c("D99527", "6F7239", "CE4B3C", "C8AC70", "018A9D"))
```

```
# Asia
sunburst(data = data.frame(xtabs(V2~path, asia_name)), legend = FALSE,
colors = c("#e6e0ae", "#dfbc5e", "#ee6146", "#d73c37", "#b51f09"))
```

Here's what the plot would look like on **RMarkdown** as you hover over it:

And that's it for visualizing the BBC's top 100 influential women in 2020 with the `sunburstR`

package.

For more content on data science, visualization, in R and Python, find me on Twitter.

]]>This is a continuation of my progress through Data Science from Scratch by Joel Grus. We'll use a classic coin-flipping example in this post because it is simple to illustrate with both **concept** and **code**. The goal of this post is to connect the dots between several concepts including the Central Limit Theorem, hypothesis testing, p-Values and confidence intervals, using python to build our intuition.

Terms like "null" and "alternative" hypothesis are used quite frequently, so let's set some context. The "null" is the **default** position. The "alternative", alt for short, is something we're *comparing to* the default (null).

The classic coin-flipping exercise is to test the *fairness* off a coin. If a coin is fair, it'll land on heads 50% of the time (and tails 50% of the time). Let's translate into hypothesis testing language:

**Null Hypothesis**: Probability of landing on Heads = 0.5.

**Alt Hypothesis**: Probability of landing on Heads != 0.5.

Each coin flip is a **Bernoulli trial**, which is an experiment with two outcomes - outcome 1, "success", (probability *p*) and outcome 0, "fail" (probability *p - 1*). The reason it's a Bernoulli trial is because there are only two outcome with a coin flip (heads or tails). Read more about Bernoulli here.

Here's the code for a single Bernoulli Trial:

```
def bernoulli_trial(p: float) -> int:
"""Returns 1 with probability p and 0 with probability 1-p"""
return 1 if random.random() < p else 0
```

When you **sum the independent Bernoulli trials**, you get a **Binomial(n,p)** random variable, a variable whose *possible* values have a probability distribution. The **central limit theorem** says as **n** or the *number* of independent Bernoulli trials get large, the Binomial distribution approaches a normal distribution.

Here's the code for when you sum all the Bernoulli Trials to get a Binomial random variable:

```
def binomial(n: int, p: float) -> int:
"""Returns the sum of n bernoulli(p) trials"""
return sum(bernoulli_trial(p) for _ in range(n))
```

**Note**: A single 'success' in a Bernoulli trial is 'x'. Summing up all those x's into X, is a Binomial random variable. Success doesn't imply desirability, nor does "failure" imply undesirability. They're just terms to count the cases we're looking for (i.e., number of heads in multiple coin flips to assess a coin's fairness).

Given that our **null** is (p = 0.5) and **alt** is (p != 0.5), we can run some independent bernoulli trials, then sum them up to get a binomial random variable.

Each `bernoulli_trial`

is an experiment with either 0 or 1 as outcomes. The `binomial`

function sums up **n** bernoulli(0.5) trails. We ran both twice and got different results. Each bernoulli experiment can be a success(1) or faill(0); summing up into a binomial random variable means we're taking the probability p(0.5) *that a coin flips head* and we ran the experiment 1,000 times to get a random binomial variable.

The first 1,000 flips we got 510. The second 1,000 flips we got 495. We can repeat this process many times to get a *distribution*. We can plot this distribution to reinforce our understanding. To this we'll use `binomial_histogram`

function. This function picks points from a Binomial(n,p) random variable and plots their histogram.

```
from collections import Counter
import matplotlib.pyplot as plt
def normal_cdf(x: float, mu: float = 0, sigma: float = 1) -> float:
return (1 + math.erf((x - mu) / math.sqrt(2) / sigma)) / 2
def binomial_histogram(p: float, n: int, num_points: int) -> None:
"""Picks points from a Binomial(n, p) and plots their histogram"""
data = [binomial(n, p) for _ in range(num_points)]
# use a bar chart to show the actual binomial samples
histogram = Counter(data)
plt.bar([x - 0.4 for x in histogram.keys()],
[v / num_points for v in histogram.values()],
0.8,
color='0.75')
mu = p * n
sigma = math.sqrt(n * p * (1 - p))
# use a line chart to show the normal approximation
xs = range(min(data), max(data) + 1)
ys = [normal_cdf(i + 0.5, mu, sigma) -
normal_cdf(i - 0.5, mu, sigma) for i in xs]
plt.plot(xs, ys)
plt.title("Binomial Distribution vs. Normal Approximation")
plt.show()
# call function
binomial_histogram(0.5, 1000, 10000)
```

This plot is then rendered:

What we did was sum up independent `bernoulli_trial`

(s) of 1,000 coin flips, where the probability of head is p = 0.5, to create a `binomial`

random variable. We then repeated this a large number of times (N = 10,000), then plotted a histogram of the distribution of all binomial random variables. And because we did it so many times, it approximates the standard normal distribution (smooth bell shape curve).

Just to demonstrate how this works, we can generate several `binomial`

random variables:

If we do this 10,000 times, we'll generate the above histogram. You'll notice that because we are testing whether the coin is fair, the probability of heads (success) *should* be at 0.5 and, from 1,000 coin flips, the **mean**(`mu`

) should be a 500.

We have another function that can help us calculate `normal_approximation_to_binomial`

:

```
import random
from typing import Tuple
import math
def normal_approximation_to_binomial(n: int, p: float) -> Tuple[float, float]:
"""Return mu and sigma corresponding to a Binomial(n, p)"""
mu = p * n
sigma = math.sqrt(p * (1 - p) * n)
return mu, sigma
# call function
# (500.0, 15.811388300841896)
normal_approximation_to_binomial(1000, 0.5)
```

When calling the function with our parameters, we get a mean `mu`

of 500 (from 1,000 coin flips) and a standard deviation `sigma`

of 15.8114. Which means that 68% of the time, the binomial random variable will be 500 +/- 15.8114 and 95% of the time it'll be 500 +/- 31.6228 (see 68-95-99.7 rule)

Now that we have seen the results of our "coin fairness" experiment plotted on a binomial distribution (approximately normal), we will be, for the purpose of testing our hypothesis, be interested in the probability of its realized value (binomial random variable) lies **within or outside a particular interval**.

This means we'll be interested in questions like:

- What's the probability that the binomial(n,p) is below a threshold?
- Above a threshold?
- Between an interval?
- Outside an interval?

First, the `normal_cdf`

(normal cummulative distribution function), which we learned in a previous post, *is* the probability of a variable being *below* a certain threshold.

Here, the probability of X (success or heads for a 'fair coin') is at 0.5 (`mu`

= 500, `sigma`

= 15.8113), and we want to find the probability that X falls below 490, which comes out to roughly 26%

```
def normal_cdf(x: float, mu: float = 0, sigma: float = 1) -> float:
return (1 + math.erf((x - mu) / math.sqrt(2) / sigma)) / 2
normal_probability_below = normal_cdf
# probability that binomal random variable, mu = 500, sigma = 15.8113, is below 490
# 0.26354347477247553
normal_probability_below(490, 500, 15.8113)
```

On the other hand, the `normal_probability_above`

, probability that X falls *above* 490 would be
1 - 0.2635 = 0.7365 or roughly 74%.

```
def normal_probability_above(lo: float,
mu: float = 0,
sigma: float = 1) -> float:
"""The probability that an N(mu, sigma) is greater than lo."""
return 1 - normal_cdf(lo, mu, sigma)
# 0.7364565252275245
normal_probability_above(490, 500, 15.8113)
```

To make sense of this we need to recall the binomal distribution, that approximates the normal distribution, but we'll draw a vertical line at 490.

We're asking, given the binomal distribution with `mu`

500 and `sigma`

at 15.8113, what is the probability that a binomal random variable falls below the threshold (left of the line); the answer is approximately 26% and correspondingly falling above the threshold (right of the line), is approximately 74%.

We may also wonder what the probability of a binomial random variable **falling between 490 and 520**:

Here is the function to calculate this probability and it comes out to approximately 63%. *note*: Bear in mind the full area under the curve is 1.0 or 100%.

```
def normal_probability_between(lo: float,
hi: float,
mu: float = 0,
sigma: float = 1) -> float:
"""The probability that an N(mu, sigma) is between lo and hi."""
return normal_cdf(hi, mu, sigma) - normal_cdf(lo, mu, sigma)
# 0.6335061861416337
normal_probability_between(490, 520, 500, 15.8113)
```

Finally, the area outside of the interval should be 1 - 0.6335 = 0.3665:

```
def normal_probability_outside(lo: float,
hi: float,
mu: float = 0,
sigma: float = 1) -> float:
"""The probability that an N(mu, sigma) is not between lo and hi."""
return 1 - normal_probability_between(lo, hi, mu, sigma)
# 0.3664938138583663
normal_probability_outside(490, 520, 500, 15.8113)
```

In addition to the above, we may also be interested in finding (symmetric) intervals around the mean that account for a *certain level of likelihood*, for example, 60% probability centered around the mean.

For this operation we would use the `inverse_normal_cdf`

:

```
def inverse_normal_cdf(p: float,
mu: float = 0,
sigma: float = 1,
tolerance: float = 0.00001) -> float:
"""Find approximate inverse using binary search"""
# if not standard, compute standard and rescale
if mu != 0 or sigma != 1:
return mu + sigma * inverse_normal_cdf(p, tolerance=tolerance)
low_z = -10.0 # normal_cdf(-10) is (very close to) 0
hi_z = 10.0 # normal_cdf(10) is (very close to) 1
while hi_z - low_z > tolerance:
mid_z = (low_z + hi_z) / 2 # Consider the midpoint
mid_p = normal_cdf(mid_z) # and the CDF's value there
if mid_p < p:
low_z = mid_z # Midpoint too low, search above it
else:
hi_z = mid_z # Midpoint too high, search below it
return mid_z
```

First we'd have to find the cutoffs where the upper and lower tails each contain 20% of the probability. We calculate `normal_upper_bound`

and `normal_lower_bound`

and use those to calculate the `normal_two_sided_bounds`

.

```
def normal_upper_bound(probability: float,
mu: float = 0,
sigma: float = 1) -> float:
"""Returns the z for which P(Z <= z) = probability"""
return inverse_normal_cdf(probability, mu, sigma)
def normal_lower_bound(probability: float,
mu: float = 0,
sigma: float = 1) -> float:
"""Returns the z for which P(Z >= z) = probability"""
return inverse_normal_cdf(1 - probability, mu, sigma)
def normal_two_sided_bounds(probability: float,
mu: float = 0,
sigma: float = 1) -> Tuple[float, float]:
"""
Returns the symmetric (about the mean) bounds
that contain the specified probability
"""
tail_probability = (1 - probability) / 2
# upper bound should have tail_probability above it
upper_bound = normal_lower_bound(tail_probability, mu, sigma)
# lower bound should have tail_probability below it
lower_bound = normal_upper_bound(tail_probability, mu, sigma)
return lower_bound, upper_bound
```

So if we wanted to know what the cutoff points were for a 60% probability around the mean and standard deviation (`mu`

= 500, `sigma`

= 15.8113), it would be between **486.69 and 513.31**.

Said differently, this means roughly 60% of the time, we can expect the binomial random variable to fall between 486 and 513.

```
# (486.6927811021805, 513.3072188978196)
normal_two_sided_bounds(0.60, 500, 15.8113)
```

Now that we have a handle on the binomial normal distribution, thresholds (left and right of the mean), and cut-off points, we want to make a **decision about significance**. Probably the most important part of *statistical significance* is that it is a decision to be made, not a standard that is externally set.

Significance is a decision about how willing we are to make a *type 1* error (false positive), which we explored in a previous post. The convention is to set it to a 5% or 1% willingness to make a type 1 error. Suppose we say 5%.

We would say that out of 1,000 coin flips, 95% of the time, we'd get between 469 and 531 heads on a "fair coin" and 5% of the time, outside of this 469-531 range.

```
# (469.0104394712448, 530.9895605287552)
normal_two_sided_bounds(0.95, 500, 15.8113)
```

If we recall our hypotheses:

**Null Hypothesis**: Probability of landing on Heads = 0.5 (fair coin)

**Alt Hypothesis**: Probability of landing on Heads != 0.5 (biased coin)

Each binomial distribution (test) that consist of 1,000 bernoulli trials, each *test* where the number of heads falls outside the range of 469-531, we'll **reject the null** that the coin is fair. And we'll be wrong (false positive), 5% of the time. It's a false positive when we **incorrectly reject** the null hypothesis, when it's actually true.

We also want to avoid making a type-2 error (false negative), where we **fail to reject** the null hypothesis, when it's actually false.

**Note**: Its important to keep in mind that terms like *significance* and *power* are used to describe **tests**, in our case, the test of whether a coin is fair or not. Each test is the sum of 1,000 independent bernoulli trials.

For a "test" that has a 95% significance, we'll assume that out of a 1,000 coin flips, it'll land on heads between 469-531 times and we'll determine the coin is fair. For the 5% of the time it lands outside of this range, we'll determine the coin to be "unfair", but we'll be wrong because it actually is fair.

To calculate the power of the test, we'll take the assumed `mu`

and `sigma`

with a 95% bounds (based on the assumption that the probability of the coin landing on heads is 0.5 or 50% - a fair coin). We'll determine the lower and upper bounds:

```
lo, hi = normal_two_sided_bounds(0.95, mu_0, sigma_0)
lo # 469.01026640487555
hi # 530.9897335951244
```

And if the coin was *actually biased*, we should reject the null, but we fail to. Let's suppose the actual probability that the coin lands on heads is 55% ( **biased** towards head):

```
mu_1, sigma_1 = normal_approximation_to_binomial(1000, 0.55)
mu_1 # 550.0
sigma_1 # 15.732132722552274
```

Using the same range 469 - 531, where the coin is assumed 'fair' with `mu`

at 500 and `sigma`

at 15.8113:

If the coin, in fact, had a bias towards head (p = 0.55), the distribution would shift right, but if our 95% significance test remains the same, we get:

The probability of making a type-2 error is 11.345%. This is the probability that we're see that the coin's distribution is within the previous interval 469-531, thinking we should accept the null hypothesis (that the coin is fair), but in actuality, failing to see that the distribution has shifted to the coin having a *bias* towards heads.

```
# 0.11345199870463285
type_2_probability = normal_probability_between(lo, hi, mu_1, sigma_1)
```

The other way to arrive at this is to find the probability, under the *new* `mu`

and `sigma`

(new distribution), that X (number of successes) will fall *below* 531.

```
# 0.11357762975476304
normal_probability_below(531, mu_1, sigma_1)
```

So the probability of making a type-2 error or the probability that the *new* distribution falls below 531 is approximately 11.3%.

The **power to detect** a type-2 error is 1.00 minus the probability of a type-2 error (1 - 0.113 = 0.887), or 88.7%.

```
power = 1 - type_2_probability # 0.8865480012953671
```

Finally, we may be interested in **increasing power** to detect a type-2 error. Instead of using a `normal_two_sided_bounds`

function to find the cut-off points (i.e., 469 and 531), we could use a *one-sided test* that rejects the null hypothesis ('fair coin') when X (number of heads on a coin-flip) is much larger than 500.

Here's the code, using `normal_upper_bound`

:

```
# 526.0073585242053
hi = normal_upper_bound(0.95, mu_0, sigma_0)
```

This means shifting the upper bounds from 531 to 526, providing more probability in the upper tail. This means the probability of a type-2 error goes down from 11.3 to 6.3.

```
# previous probability of type-2 error
# 0.11357762975476304
normal_probability_below(531, mu_1, sigma_1)
# new probability of type-2 error
# 0.06356221447122662
normal_probability_below(526, mu_1, sigma_1)
```

And the new (stronger) **power to detect** type-2 error is 1.0 - 0.064 = 0.936 or 93.6% (up from 88.7% above).

p-Values represent *another way* of deciding whether to accept or reject the Null Hypothesis. Instead of choosing bounds, thresholds or cut-off points, we could compute the probability, assuming the Null Hypothesis is true, that we would see a value *as extreme as* the one we just observed.

Here is the code:

```
def two_sided_p_values(x: float, mu: float = 0, sigma: float = 1) -> float:
"""
How likely are we to see a value at least as extreme as x (in either
direction) if our values are from an N(mu, sigma)?
"""
if x >= mu:
# x is greater than the mean, so the tail is everything greater than x
return 2 * normal_probability_above(x, mu, sigma)
else:
# x is less than the mean, so the tail is everything less than x
return 2 * normal_probability_below(x, mu, sigma)
```

If we wanted to compute, assuming we have a "fair coin" (`mu`

= 500, `sigma`

= 15.8113), what is the probability of seeing a value like 530? (**note**: We use 529.5 instead of 530 below due to continuity correction)

Answer: approximately 6.2%

```
# 0.06207721579598835
two_sided_p_values(529.5, mu_0, sigma_0)
```

The p-value, 6.2% is higher than our (hypothetical) 5% significance, so we don't reject the null. On the other hand, if X was slightly more extreme, 532, the probability of seeing that value would be approximately 4.3%, which is less than 5% significance, so we would reject the null.

```
# 0.04298479507085862
two_sided_p_values(532, mu_0, sigma_0)
```

For one-sided tests, we would use the `normal_probability_above`

and `normal_probability_below`

functions created above:

```
upper_p_value = normal_probability_above
lower_p_value = normal_probability_below
```

Under the `two_sided_p_values`

test, the extreme value of 529.5 had a probability of 6.2% of showing up, but not low enough to reject the null hypothesis.

However, with a one-sided test, `upper_p_value`

for the same threshold is now 3.1% and we would reject the null hypothesis.

```
# 0.031038607897994175
upper_p_value(529.5, mu_0, sigma_0)
```

A *third* approach to deciding whether to accept or reject the null is to use confidence intervals. We'll use the 530 as we did in the p-Values example.

```
p_hat = 530/1000
mu = p_hat
sigma = math.sqrt(p_hat * (1 - p_hat) / 1000) # 0.015782902141241326
# (0.4990660982192851, 0.560933901780715)
normal_two_sided_bounds(0.95, mu, sigma)
```

The confidence interval for a coin flipping heads 530 (out 1,000) times is (0.4991, 0.5609). Since this interval **contains** the p = 0.5 (probability of heads 50% of the time, assuming a fair coin), we do not reject the null.

If the extreme value were *more* extreme at 540, we would arrive at a different conclusion:

```
p_hat = 540/1000
mu = p_hat
sigma = math.sqrt(p_hat * (1 - p_hat) / 1000)
(0.5091095927295919, 0.5708904072704082)
normal_two_sided_bounds(0.95, mu, sigma)
```

Here we would be 95% confident that the mean of this distribution is contained between 0.5091 and 0.5709 and this **does not** contain 0.500 (albiet by a slim margin), so we reject the null hypothesis that this is a fair coin.

**note**: Confidence intervals are about the *interval* not probability p. We interpret the confidence interval as, if you were to repeat the experiment many times, 95% of the time, the "true" parameter, in our example p = 0.5, would lie within the observed confidence interval.

We used several python functions to build intuition around statistical hypothesis testing. To highlight this "from scratch" aspect of the book here is a diagram tying together the various python function used in this post:

This post is part of an ongoing series where I document my progress through Data Science from Scratch by Joel Grus.

For more content on data science, machine learning, R, Python, SQL and more, find me on Twitter.

]]>`Itertools`

are a core set of fast, memory efficient tools for creating iterators for efficient looping (read the documentation here).

One (of many) uses for `itertools`

is to create a `permutations()`

function that will return all possible combinations of items in a list.

I was working on a project that involved user funnels with different stages and we were wondering how many different "paths" a user *could* take, so this was naturally a good fit for using **permutations**.

*Sample Funnel*

In our hypothetical example, we're looking at a funnel with three stages for a total of 6 permutations. Here is the formula:

If you're using a sales/marketing funnel, you'll have in mind what your funnel would look like so you may **not** want all possible paths, but if you're interested in exploring potentially *overlooked* paths, read on.

Here's the python documentation for `itertools`

, and `permutations`

specifically. We'll break down the code to better understand what's going on in this function.

**note:** I found a clearer alternative after the fact. Feel free to skip to the final section below, although there is value in comparing the two versions.

We'll start off with the `iterable`

which is a `list`

with three strings. The `permutations`

function takes in two parameters, the `iterable`

and `r`

which is the number of items from the list that we're interested in finding the combination of. If we have three items in the list, we generally want to find *all possible* combinations of those three items.

Here is the code, and subsequent breakdown:

```
# list of length 3
list1 = ['stage 1', 'stage 2', 'stage 3']
# iterable is the list
# r = number of items from the list to find combinations of
def permutations(iterable, r=None):
"""Find all possible order of a list of elements"""
# permutations('ABCD',2)--> AB AC AD BA BC BD CA CB CD DA DB DC
# permutations(range(3))--> 012 021 102 120 201 210
# permutations(list1, 6)--> ...720 permutations
pool = tuple(iterable)
n = len(pool)
r = n if r is None else r
if r > n:
return
indices = list(range(n)) # [0, 1, 2]
cycles = list(range(n, n-r, -1)) # [3, 2, 1]
yield tuple(pool[i] for i in indices[:r])
while n:
for i in reversed(range(r)):
cycles[i] -= 1
if cycles[i] == 0:
indices[i:] = indices[i+1:] + indices[i:i+1]
cycles[i] = n - i
else:
j = cycles[i]
indices[i], indices[-j] = indices[-j], indices[i]
yield tuple(pool[i] for i in indices[:r])
break
else:
return
#permutations(list1, 6)
perm = permutations(list1, 3)
count = 0
for p in perm:
count += 1
print(p)
print("there are:", count, "permutations.")
```

The first thing we do is take the `iterable`

input parameter is turn it from a `list`

into a `tuple`

.

```
pool = tuple(iterable)
```

There are several reasons to do this. First, `tuples`

are *faster* than `lists`

; the `permutations()`

function will do several operations to the input so changing it to a `tuple`

allows faster operations and because `tuples`

are *immutable*, we can do a bunch of different operations without fear that we might *inadvertently* change the list.

We then create `n`

from the length of `pool`

(in our case it's 3) and the additional `r`

parameter, which defaults to `None`

is also 3 as we're interested in seeing **all combinations** of a list of three elements.

We also have a line that ensures that `r`

can never be greater than the number of elements in the `iterable`

(list).

```
if r > n:
return
```

Next, we create `indices`

and `cycles`

. Indices are basically the index of each item, starting with 0 to 2, for three items. Cycles uses `range(n, n-r, -1)`

, which in our case is `range(3, 3-3, -1)`

; this means **start** at three and **end** at zero, in -1 **steps**.

The next chunk of code is a `while-loop`

that will continue for the length of the list, `n`

(note the `break`

at the bottom to exit out of this loop).

After each `if-else`

cycle, a new set of `indices`

are created, which then gets looped through with `pool`

, the interable parameter input, which changes the order of the elements in the list.

You'll note in the commented code above, `cycles`

start off at [3,2,1] and `indices`

start off at [0,1,2]. Each loop through the code changes the `indices`

where `indices[i:]`

successively gets longer [2], then [1,2], then [1,2,3]. While `cycles`

changes as it trends toward [1,1,1], which point the code breaks out of the loop.

```
while n:
for i in reversed(range(r)):
cycles[i] -= 1
if cycles[i] == 0:
indices[i:] = indices[i+1:] + indices[i:i+1]
cycles[i] = n - i
else:
j = cycles[i]
indices[i], indices[-j] = indices[-j], indices[i]
yield tuple(pool[i] for i in indices[:r])
break
else:
print("return:")
```

The `permutations(iterable, r)`

function actually creates a `generator`

so we need to loop through it again to print out all the permutations of the list.

```
<generator object permutations at 0x7fe19400fdd0>
```

We add another for-loop at the bottom to print out all the permutations:

```
perm = permutations(list1, 3)
count = 0
for p in perm:
count += 1
print(p)
print("there are:", count, "permutations.")
```

Here is our result:

As is often the case, there is a better way I found in retrospect from this stack overflow (h/t to Eric O Lebigot):

```
def all_perms(elements):
if len(elements) <= 1:
yield elements # Only permutation possible = no permutation
else:
# Iteration over the first element in the result permutation:
for (index, first_elmt) in enumerate(elements):
other_elmts = elements[:index] + elements[index+1:]
for permutation in all_perms(other_elmts):
yield [first_elmt] + permutation
```

The `enumerate`

built-in function obviates the need to separately create `cycles`

and `indices`

. The local variable `other_elmts`

separates the other elements in the list from the `first_elmt`

, then the second for-loop recursively finds the permutation of the other elements before adding with the `first_elmt`

on the final line, yielding all possible permutations of a list. As with the previous case, the result of this function is a `generator`

which requires looping through and printing the permutations.

I found this much easier to digest than the documentation version.

Permutations can be useful when you have varied user journeys through your product and you want to figure out all the possible paths. With this short python script, you can easily print out all options for consideration.

From the perspective of a user funnel, **permutations** allow us to explore all possible *paths* a user might take. For our hypothetical example, a three-step funnel yields six possible paths a user could navigate from start to finish.

Knowing permutations should also **give us pause** when deciding whether to add another "step" to a funnel. Going from a three-step funnel to a four-step funnel increases the number of possible paths from six to 24 - a quadruple increase.

Not only does this increase **friction** between your user and the 'end goal' (conversion), whatever that may be for your product, but it also increases complexity (and potentially confusion) in the user experience.

For more content on data science, machine learning, R, Python, SQL and more, find me on Twitter.

]]>There are several posts that could serve as context (as needed) for the concepts discuss in this post including these posts on:

In this post, we'll cover probability distributions. This is a broad topic so we'll sample a few concepts to get a feel for it. Borrowing from the previous post, we'll chart our medical diagnostic outcomes.

You'll recall that each outcome is the combination of whether someone has a disease, `P(D)`

, or not, `P(not D)`

. Then, they're given a diagnostic test that returns positive, `P(P)`

or negative, `P(not P)`

.

These are discrete outcomes so they can be represented with the **probability mass function**, as opposed to a **probability density function**, which represent a continuous distribution.

Let's take another *hypothetical* scenario of a city where 1 in 10 people have a disease and a diagnostic test has a True Positive of 95% and True Negative of 90%. The probability that a test-positive person *actually* having the disease is 46.50%.

Here's the code:

```
from random import random, seed
seed(0)
pop = 1000 # 1000 people
counts = {}
for i in range(pop):
has_disease = i % 10 == 0 # one in 10 people have disease
# assuming that every person gets tested regardless of any symptoms
if has_disease:
tests_positive = True # True Positive 95%
if random() < 0.05:
tests_positive = False # False Negative 5%
else:
tests_positive = False # True Negative 90%
if random() < 0.1:
tests_positive = True # False Positive 10%
outcome = (has_disease, tests_positive)
counts[outcome] = counts.get(outcome, 0) + 1
for (has_disease, tested_positive), n in counts.items():
print('Has Disease: %6s, Test Positive: %6s, count: %d' %
(has_disease, tested_positive, n))
n_positive = counts[(True, True)] + counts[(False, True)]
print('Number of people who tested positive:', n_positive)
print('Probability that a test-positive person actually has disease: %.2f' %
(100.0 * counts[(True, True)] / n_positive),)
```

Given the probability that someone has the disease (1 in 10), also called the 'prior' in Bayesian terms. We modeled four scenarios where people were given a diagnostic test. Again, the big assumption here is that people get randomly tested. With the true positive and true negative rates stated above, here are the outcomes:

Given these discrete events, we can chart a **probability mass function**, also known as discrete density function. We'll import `pandas`

to help us create `DataFrames`

and `matplotlib`

to chart the **probability mass function**.

We first need to turn the counts of events into a `DataFrame`

and change the column to `item_counts`

. Then, we'll calculate the probability of each event by dividing the count by the total number of people in our hypothetical city (i.e., population: 1000).

**Optional**: Create another column with abbreviations for test outcome (i.e., "True True" becomes "TT"). We'll call this column `item2`

.

```
import pandas as pd
import matplotlib.pyplot as plt
df = pd.DataFrame.from_dict(counts, orient='index')
df = df.rename(columns={0: 'item_counts'})
df['probability'] = df['item_counts']/1000
df['item2'] = ['TT', 'FF', 'FT', 'TF']
```

Here is the `DataFrame`

we have so far:

You'll note that the numbers in the `probability`

column adds up to 1.0 and that the `item_counts`

numbers are the same as the count above when we had calculated the probability of a test-positive person actually having the disease.

We'll use a simple bar chart to chart out the diagnostic probabilities and this is how we'd visually represent the probability mass function - probabilities of each discrete event; each 'discrete event' is a conditional (e.g., probability that someone has a positive test, given that they *have* the disease - TT or probability that someone has a negative test, given that they *don't have* the disease - FF, and so on).

Here's the code:

```
df = pd.DataFrame.from_dict(counts, orient='index')
df = df.rename(columns={0: 'item_counts'})
df['probability'] = df['item_counts']/1000
df['item2'] = ['TT', 'FF', 'FT', 'TF']
plt.bar(df['item2'], df['probability'])
plt.title("Probability Mass Function")
plt.show()
```

While the probability mass function can tell us the probability of each discrete event (i.e., TT, FF, FT, and TF) we can also represent the same information as a **cumulative distribution function** which allows us to see how the probability changes as we add events together.

The cumulative distribution function simply adds the probability from the previous row in a `DataFrame`

in a cumulative fashion, like in the column `probability2`

:

We use the `cumsum()`

function to create the `cumsum`

column which is simply adding the `item_counts`

, with each successive row. When we create the corresponding probability column, `probability2`

, it gets larger until we reach 1.0.

Here's the chart:

This chart tells us that the probability of getting both TT and FF (True, True = True Positive, and False, False = True Negative) is 88.6% which indicates that 11.4% (100 - 88.6) of the time, the diagnostic test will let us down.

More often than not, you'll be interested in *continuous* distributions and you can see better see how the **cumulative distribution function** works.

You're probably familiar with the bell shaped curve or the *normal distribution*, defined solely by its mean (`mu`

) and standard deviation (`sigma`

). If you have a **standard normal distribution** of probability values, the average would be 0 and the standard deviation would be 1.

Code:

```
import math
SQRT_TWO_PI = math.sqrt(2 * math.pi)
def normal_pdf(x: float, mu: float = 0, sigma: float = 1) -> float:
return (math.exp(-(x-mu) ** 2 / 2 / sigma ** 2) / (SQRT_TWO_PI * sigma))
# plot
xs = [x / 10.0 for x in range(-50, 50)]
plt.plot(xs, [normal_pdf(x, sigma=1) for x in xs], '-', label='mu=0, sigma=1')
plt.show()
```

With the **standard normal distribution** curve, you see the average probability is around 0.4. But if you add up the area under the curve (i.e., all probabilities of every possible outcome), you would get 1.0, just like with the medical diagnostic example.

And if you split the bell in half, then flip over the left half, you'll (visually) get the **cumulative distribution function**:

Code:

```
import math
def normal_cdf(x: float, mu: float = 0, sigma: float = 1) -> float:
return (1 + math.erf((x - mu) / math.sqrt(2) / sigma)) / 2
# plot
xs = [x / 10.0 for x in range(-50, 50)]
plt.plot(xs, [normal_cdf(x, sigma=1) for x in xs], '-', label='mu=0,sigma=1')
```

In both cases, the area under the curve for the **standard normal distribution** and the **cumulative distribution function** is 1.0, thus summing the probabilities of all events is one.

This post is part of my ongoing progress through Data Science from Scratch by Joel Grus:

For more content on data science, machine learning, R, Python, SQL and more, find me on Twitter.

]]>note: This article presents a hypothetical situation and is not intended as medical advice.

Now that we have a basic understanding of Bayes' Theorem (please refer to these posts on conditional probability and Bayes' Theorem for context), let's extend the application to a slightly more complex example. This section was inspired by this tweet from Grant Sanderson (of 3Blue1Brown fame):

This is a classic application of Bayes Theorem - the **medical diagnostic scenario**. The above tweet can be re-stated:

What is the probability of you

actually having the disease, given that you tested positive?

This happens to be even more relevant as we're living through a generational pandemic.

Let's start off with a conceptual understanding, using the tools we learned previously. First, we have to keep in mind **testing** and **actually having the disease** are **not independent** events. Therefore, we will use **conditional probability** to express their joint outcomes.

The intuitive visual to illustrate this is the **tree diagram**:

The initial information provided is as follows:

- P(D): Probability of having the disease (covid-19)
- P(P): Probability of testing positive
*P(D|P): Our objective is to find the probability of having the disease, given a positive test

1 in 1,000 actively have covid-19, P(D), this implies...

999 in 1,000 do

**not**actively have covid-19, P(not D)1% or 0.01 false positive (given)

- 10% or 0.1 false negative (given)

The **false positive** is when you *don't* have the disease, but your test (in error) shows up positive. **False negative** is when you *have* the disease, but your test (in error) shows up negative. We are provided this information and have to calculate other values to fill in the tree.

We know that all possible events have to add up to 1, so if 1 in 1,000 actively have the disease, we know that 999 in 1,000 do not have it. If the false negative is 10%, then the **true positive** is 90%. If the false positive is 1%, then the **true negative** is 99%. From our calculations, the tree can be updated:

Now that we've filled out the tree, we can use **Bayes' Theorem** to find `P(D|P)`

. Here is Bayes' Theorem that we discussed in the previous section. We have Bayes' Theorem, the denominator, probability of testing positive `P(P)`

and the *second* version of Bayes Theorem in cases were we *do not know* the probability of testing positive (as in the present case):

Then we can plug-in the denominator to get the alternative version of Bayes' Theorem:

Here's how the numbers add up:

- P(D|P) = P(P|D) x P(D) / P(P|D) x P(D) + P(P|not D) x P(not D)
- P(D|P) = (0.9 x 0.001) / ((0.9 x 0.001) + (0.01 x 0.999))
- P(D|P) = 0.0009 / 0.0009 + 0.00999
- P(D|P) = 0.0009 / 0.01089
- P(D|P) ~ 0.08264 or 8.26%

Interestingly, Andrej Karpathy actually responded in the thread and provided an intuitive way to arrive at the same result using Python.

Here's his code (with added comments):

```
from random import random, seed
seed(0)
pop = 10000000 # 10M people
counts = {}
for i in range(pop):
has_covid = i % 1000 == 0 # one in 1,000 people have covid (priors or prevalence of disease)
# The major assumption is that every person gets tested regardless of any symptoms
if has_covid: # Has disease
tests_positive = True # True positive
if random() < 0.1:
tests_positive = False # False negative
else: # Does not have disease
tests_positive = False # True negative
if random() < 0.01:
tests_positive = True # False positive
outcome = (has_covid, tests_positive)
counts[outcome] = counts.get(outcome, 0) + 1
for (has_covid, tested_positive), n in counts.items():
print('has covid: %6s, tests positive: %6s, count: %d' % (has_covid, tested_positive, n))
n_positive = counts[(True, True)] + counts[(False, True)]
print('number of people who tested positive:', n_positive)
print('probability that a test-positive person actually has covid: %.2f' % (100.0 * counts[(True, True)] / n_positive), )
```

We first build a hypothetical population of 10 million. If the **prior** of disease is 1 in 1,000, a population of 10 million should find 10000 people with covid. You can see how this works with this short snippet:

```
pop = 10000000
counts = 0
for i in range(pop):
has_covid = i % 1000 == 0
if has_covid:
counts = counts + 1
print(counts, "people have the disease in a population of 10 million")
```

Nested in the `for-loop`

are `if-statements`

that segment the population (10M) into one of four categories True Positive, False Negative, True Negative, False Positive. Each category is counted and stored in a `dict`

called `counts`

. Then another `for-loop`

is used to loop through this dictionary to print out all the categories:

```
has covid: True, tests positive: True, count: 9033
has covid: False, tests positive: False, count: 9890133
has covid: False, tests positive: True, count: 99867
has covid: True, tests positive: False, count: 967
number of people who tested positive: 108900
probability that a test-positive person actually has covid: 8.29
```

Finally, we want the number of people who *have* the disease *and* tested positive (True Positive, 9033) divided by the number of people who tested positive, regardless of whether they actually have the disease (True Positive (9033) + False Positive (99867) = 108,900) and this comes out to approximately 8.29.

Although the code was billed as "simple code to build intuition", I found that Bayes' Theorem *is* the intuition.

The key to Bayes' Theorem is that it encourages us to update our beliefs when presented with new evidence. But what if there's evidence we missed in the first place?

If you look back at the original tweet, there are important details about symptoms that, if we wanted to be more realistic, should be accounted for.

You feel fatigued and have a slight sore throat.

Here, instead of assuming that prevalence of the disease (1 in 1,000 people have covid-19) is the prior, we might ask what probability that someone who is symptomatic has the disease?

Let's suppose we change from 1 in 1,000 to 1 in 100. We could change just one line of code (while everything else remains the same):

```
for i in range(pop):
has_covid = i % 100 == 0 # update info: 1/1000 have covid, but 1/100 with symptoms have covid
```

The probability that someone with a positive test actually has the disease jumps from 8.29% to 47.61%

```
has covid: True, tests positive: True, count: 180224
has covid: False, tests positive: False, count: 19601715
has covid: False, tests positive: True, count: 198285
has covid: True, tests positive: False, count: 19776
number of people who tested positive: 378509
probability that a test-positive person with symptoms actually has covid: 47.61
```

Thus, being symptomatic means our **priors** should be adjusted and our **beliefs** about the likelihood that a positive test means we have the disease (`P(D|P)`

) should be updated accordingly (in this case, it goes way up).

Hypothetically, if we have family or friends living in an area where 1 in 1,000 people have covid-19 and they (god forbid) got tested and got a positive result, you could tell them that their probability of actually having the disease, given a positive test was around 8.268.29%.

However, whats useful about the Bayesian approach is that it encourages us to incorporate new information and update our beliefs accordingly. So if we find out our family or friend is also *symptomatic*, we could advise them of the higher probability (~47.61%).

Finally, we may also advise our family/friends to get tested **again**, because as much as test-positive person would hope they got a false positive, chances are low. And even lower, is getting a false positive *twice*.

For more content on data science, machine learning, R, Python, SQL and more, find me on Twitter.

]]>This post is a in continuation of my coverage of Data Science from Scratch by Joel Grus.

It picks up from the previous post, so be sure to check that out for proper context.

Building on our understanding of conditional probability we'll get into Bayes' Theorem. We'll spend some time understanding the concept before we implement an example in code.

Previously, we established an understanding of **conditional** probability, but building up with **marginal** and **joint** probabilities. We explored the conditional probabilities of two outcomes:

Outcome 1: What is the probability of the event "both children are girls" (B) conditional on the event "the older child is a girl" (G)?

The probability for outcome one is roughly 50% or (1/2).

Outcome 2: What is the probability of the event "both children are girls" (B) conditional on the event "at least one of the children is a girl" (L)?

The probability for outcome two is roughly 33% or (1/3).

**Bayes' Theorem** is simply *an alternate* way of calculating conditional probability.

Previously, we used the **joint** probability to calculate the **conditional** probability.

Here's the conditional probability for outcome 1, using a joint probability:

- P(G) = 'Probability that first child is a girl' (1/2)
P(B) = 'Probability that both children are girls' (1/4)

P(B|G) = P(B,G) / P(G)

- P(B|G) = (1/4) / (1/2) =
**1/2**or roughly**50%**

Technically, we *can't* use joint probability because the two events are *not independent*.

To clarify, the probability of the older child being a certain gender and the probability of the younger child being a certain gender *is* independent, but `P(B|G)`

the 'probability of *both* child being a girl' and 'the probability of the older child being a girl' are *not independent*; and hence we express it as a *conditional* probability.

So, the joint probability of `P(B,G)`

is just event B,`P(B)`

.

Here's an alternate way to calculate the conditional probability (**without** joint probability):

`P(B|G) = P(G|B) * P(B) / P(G)`

**This is Bayes Theorem**- P(B|G) = 1 * (1/4) / (1/2)
- P(B|G) = (1/4) * (2/1)
- P(B|G) = 1/2 =
**50%**

**note**: P(G|B) is 'the probability that the first child is a girl, given that **both** children are girls is a certainty (1.0)'

The **reverse** conditional probability, can also be calculated, without joint probability:

What is the probability of the older child being a girl, given that both children are girls?

`P(G|B) = P(B|G) * P(G) / P(B)`

**This is Bayes Theorem (reverse case)**- P(G|B) = (1/2) * (1/2) / (1/4)
- P(G|B) = (1/4) / (1/4)
- P(G|B) = 1 =
**100%**

This is consistent with what we already derived above, namely that P(G|B) is a **certainty** (probability = 1.0), that the older child is a girl, **given that** both children are girls.

We can point out two additional observations / rules:

- While, joint probabilities are
**symmetrical**: P(B,G) == P(G,B), - Conditional probabilities are
**not symmetrical**: P(B|G) != P(G|B)

**Bayes Theorem** is a way of calculating conditional probability *without* the joint probability, summarized here:

`P(B|G) = P(G|B) * P(B) / P(G)`

**This is Bayes Theorem**`P(G|B) = P(B|G) * P(G) / P(B)`

**This is Bayes Theorem (reverse case)**

You'll note that `P(G)`

is the denominator in the former, and `P(B)`

is the denominator in the latter.

What if, for some reasons, we don't have access to the denominator?

We could derive both `P(G)`

and `P(B)`

in another way using the `NOT`

operator:

- P(G) = P(G,B) + P(G,not B) = P(G|B)
*P(B) + P(G|not B)*P(not B) - P(B) = P(B,G) + P(B,not G) = P(B|G)
*P(G) + P(B|not G)*P(not G)

Therefore, the alternative expression of Bayes Theorem for the probability of *both* children being girls, given that the first child is a girl ( P(B|G) ) is:

- P(B|G) = P(G|B)
*P(B) / ( P(G|B)*P(B) + P(G|not B) * P(not B) ) - P(B|G) = 1
*1/4 / (1*1/4 + 1/3 * 3/4) - P(B|G) = 1/4 / (1/4 + 3/12)
- P(B|G) = 1/4 / 2/4 = 1/4 * 4/2
- P(B|G) = 1/2 or roughly
**50%**

We can check the result in code:

```
def bayes_theorem(p_b, p_g_given_b, p_g_given_not_b):
# calculate P(not B)
not_b = 1 - p_b
# calculate P(G)
p_g = p_g_given_b * p_b + p_g_given_not_b * not_b
# calculate P(B|G)
p_b_given_g = (p_g_given_b * p_b) / p_g
return p_b_given_g
#P(B)
p_b = 1/4
# P(G|B)
p_g_given_b = 1
# P(G|notB)
p_g_given_not_b = 1/3
# calculate P(B|G)
result = bayes_theorem(p_b, p_g_given_b, p_g_given_not_b)
# print result
print('P(B|G) = %.2f%%' % (result * 100))
```

For the probability that the first child is a girl, given that *both* children are girls ( P(G|B) ) is:

- P(G|B) = P(B|G)
*P(G) / ( P(G|B)*P(G) + P(B|not G) * P(not G) ) - P(G|B) = 1/2
*1/2 / ((1/2*1/2) + (0 * 1/2)) - P(G|B) = 1/4 / 1/4
- P(G|B) = 1

Let's unpack Outcome 2.

Outcome 2: What is the probability of the event "both children are girls" (B) conditional on the event "at least one of the children is a girl" (L)?

The probability for outcome two is roughly 33% or (1/3).

We'll go through the same process as above.

We could use **joint** probability to calculate the **conditional** probability. As with the previous outcome, the joint probability of `P(B,G)`

is just event B,`P(B)`

.

- P(B|L) = P(B,L) / P(L) = 1/3

Or, we could use Bayes' Theorem to figure out the **conditional** probability **without joint** probability:

- P(B|L) = P(L|B) * P(B) / P(L)
- P(B|L) = (1 * 1/4) / (3/4)
- P(B|L) = 1/3

And, if there's no `P(L)`

, we can calculate that indirectly, also using Bayes' Theorem:

- P(L) = P(L|B)
*P(B) + P(L|not B)*P(not B) - P(L) = 1
*(1/4) + (2/3)*(3/4) - P(L) = (1/4) + (2/4)
- P(L) = 3/4

Then, we can use `P(L)`

in the way Bayes' Theorem is commonly expressed, when we don't have the denominator:

- P(B|L) = P(L|B)
*P(B) / ( P(L|B)*P(B) + P(L|not B) * P(not B) ) - P(B|L) = 1 * (1/4) / (3/4)
- P(B|L) = 1/3

Now that we've gone through the calculation for two conditional probabilities, `P(B|G)`

and `P(B|L)`

, using Bayes Theorem, and implemented code for one of the scenarios, let's take a step back and assess what this *means*.

I think its useful to understand that probability in general shines when we want to describe uncertainty and that Bayes' Theorem allows us to quantify how much the data we observe, should change our beliefs.

We have two **posteriors**, `P(B|G)`

and `P(B|L)`

, both with equal **priors** and **likelihood**, but with *different* **evidence**.

Said differently, we want to know the 'probability that both children are girls`, given *different* conditions.

In the first case, our condition is 'the first child is a girl' and in the second case, our condition is '*at least one* of the child is a girl'. The question is which condition will increase the probability that **both** children are girls?

Bayes' Theorem allows us to update our belief about the probability in these two cases, as we incorporate varied data into our framework.

What the calculations tell us is that the **evidence** that 'one child is a girl' increases the probability that **both** children are girls *more than* the other piece of **evidence** that 'at least one child is a girl' increases that probability.

And our beliefs should be updated accordingly.

At the end of the day, understanding conditional probability (and Bayes Theorem) comes down to **counting**. For our hypothetical scenarios, we only need one hand:

When we look at the probability table for outcome one, `P(B|G)`

, we can see how the posterior probability comes out to 1/2:

When we look at the probability table for outcome two, `P(B|L)`

, we can see how the posterior probability comes out to 1/3:

This is part of an ongoing series documenting my progress through Data Science from Scratch by Joel Grus:

For more content on data science, machine learning, R, Python, SQL and more, find me on Twitter.

]]>This post is chapter 6 in continuation of my coverage of Data Science from Scratch by Joel Grus. We will work our way towards understanding conditional probability by understanding preceding concepts like marginal and joint probabilities.

At the end, we'll tie all concepts together through code. For those inclined, you can jump to the code towards the bottom of this post.

The first challenge in this section is distinguishing between **two** conditional probability statements.

Here's the setup. We have a family with two (unknown) children with two assumptions. First, each child is equally likely to be a boy or a girl. Second, the gender of the second child is *independent* of the gender of the first child.

Outcome 1: What is the probability of the event "both children are girls" (B) conditional on the event "the older child is a girl" (G)?

The probability for statement one is roughly 50% or (1/2).

Outcome 2: What is the probability of the event "both children are girls" (B) conditional on the event "at least one of the children is a girl" (L)?

The probability for statement two is roughly 33% or (1/3).

But at first glance, they look similar.

The book jumps straight to conditional probabilities, but first, we'll have to look at **marginal** and **joint** probabilities. Then we'll create a **joint probabilities table** and **sum** probabilities to help us figure out the differences. We'll then *resume* with **conditional probabilities**.

Before anything, we need to realize the situation we have is one of **independence**. The gender of one child is **independent** of a second child.

The intuition for this scenario will be different from a **dependent** situation. For example, if we draw two cards from a deck (without replacement), the probabilities are different. The probability of drawing one King is (4/52) and the probability of drawing a second King is now (3/51); the probability of the second event (a second King) is *dependent* on the result of the first draw.

Back to the two unknown children.

We can say the probability of the first child being either a boy or a girl is 50/50. Moreover, the probability of the second child, which is **independent** of the first, is *also* 50/50. Remember, our first assumption is that *each child is equally likely to be a boy or a girl*.

Let's put these numbers in a table. The (1/2) probabilities shown here are called **marginal** probabilities (note how they're at the margins of the table).

Since we have two gender (much like two sides of a flipped coin), we can intuitively figure out *all* possible outcomes:

- first child (Boy), second child (Boy)
- first child (Boy), second child (Girl)
- first child (Girl), second child (Boy)
- first child (Girl), second child (Girl)

There are *4 possible outcomes* so the probability of getting any one of the four outcomes is (1/4). We can actually write these probabilities in the middle of the table, the **joint probabilities**:

To recap, the probability of the first child being either boy or girl is 50/50, simple enough. The probability of the second child being either boy or girl is also 50/50. When put in a table, this yielded the **marginal probability**.

Now we want to know the probability of say, 'first child being a boy and second child being a girl'. This is a **joint probability** because is is the probability that the first child take a specific gender (boy) **AND** the second child take a specific gender (girl).

If two event are **independent**, and in this case they are, their **joint probabilities** are the *product* of the probabilities of **each one happening**.

The probability of the first child being a Boy (1/2) **and** second child being a Girl (1/2); The product of each marginal probability is the joint probability (1/2 * 1/2 = 1/4).

This can be repeated for the other three joint probabilities.

Now we get into **conditional probability** which is the probability of one event happening (i.e., second child being a Boy or Girl) **given that** or **on conditional that** another event happened (i.e., first child being a Boy).

At this point, it might be a good idea to begin writing probability statements *similar* to how it is expressed in mathematics.

A joint probability is the product of each individual event happening (assuming they are independent events). For example we might have two individual events:

- P(1st Child = Boy): 1/2
- P(2nd Child = Boy): 1/2

Here is their **joint probability**:

- P(1st Child = Boy, 2nd Child = Boy)
- P(1st Child = Boy) * P(2nd Child = Boy)
- (1/2 * 1/2 = 1/4)

There is a relationship between **conditional** probabilities and **joint** probabilities.

Here is their **conditional probability**:

- P(2nd Child = Boy | 1st Child = Boy)
- P(1st Child = Boy, 2nd Child = Boy) / P(1st Child = Boy)

This works out to:

- (1/4) / (1/2) = 1/2 or
- (1/4) * (2/1) = 1/2

In other words, the probability that the second child is a boy, given that the first child is a boy is still 50% (this implies that with respect to **conditional** probability, if the events are **independent** it is not different from a single event).

Now we're ready to tackle the two outcomes posed at the beginning of this post.

Outcome 1: What is the probability of the event "both children are girls" (B) conditional on the event "the older child is a girl" (G)?

Let's break it down. First we want the probability of the event that "both children are girls". We'll take the product of two events; the probability that the first child is a girl (1/2) and the probability that the second child is a girl (1/2). So for **both** child to be girls, 1/2 * 1/2 = 1/4

- P(1st Child = Girl, 2nd Child = Girl) = 1/4

Second, we want that to be **given that** the "older child is a girl".

- P(1st Child = Girl) = 1/2

**Conditional probability**:

- P(1st Child = Girl, 2nd Child = Girl) / P(1st Child = Girl)
- (1/4) / (1/2) = (1/4)
*(2/1) = (2/4) =***1/2**or roughly**50%*

Now let's break down the second outcome:

Again, we start with "both children are girls":

- P(1st Child = Girl, 2nd Child = Girl) = 1/4

Then, we have "on condition that at least one of the children is a girl". We'll reference a **joint probability table**. We see that when trying to figure out the probability that "at least one of the children is a girl", we rule out the scenario where **both** children are boys. This is actually the compliment to *at least one child is a girl*. The remaining 3 out of 4 probabilities, fit the condition.

The probability of at least one children being a girl is:

- (1/4) + (1/4) + (1/4) = 3/4

So:

- P(1st Child = Girl, 2nd Child = Girl) / P("at least one child is a girl")
- (1/4) / (3/4) = (1/4)
*(4/3) = (4/12) =***1/3**or roughly**33%*

When two events are **independent**, their **joint probability** is the product of each event:

- P(E,F) = P(E) * P(F)

Their **conditional** probability is the **joint probability** divided by the conditional (i.e., P(F)).

- P(E|F) = P(E,F) / P(F)

And so for our two challenge scenarios, we have:

Challenge 1:

- B = probability that both children are girls
- G = probability that the
*older*children is a girl

This can be stated as: P(B|G) = P(B,G) / P(G)

Challenge 2:

- B = probability that both children are girls
- L = probability that
*at least one*children is a girl

This can be stated as: P(B|L) = P(B,L) / P(L)

Now that we have an intuition and have worked out the problem on paper, we can use code to express conditional probability:

```
import enum, random
class Kid(enum.Enum):
BOY = 0
GIRL = 1
def random_kid() -> Kid:
return random.choice([Kid.BOY, Kid.GIRL])
both_girls = 0
older_girl = 0
either_girl = 0
random.seed(0)
for _ in range(10000):
younger = random_kid()
older = random_kid()
if older == Kid.GIRL:
older_girl += 1
if older == Kid.GIRL and younger == Kid.GIRL:
both_girls += 1
if older == Kid.GIRL or younger == Kid.GIRL:
either_girl += 1
print("P(both | older):", both_girls / older_girl) # 0.5007089325501317
print("P(both | either):", both_girls / either_girl) # 0.3311897106109325
```

We can see that code confirms our intuition by looking at each of the **joint probabilities**

```
either_girl #7,464 / 10,000 ~ roughly 75% or 3/4 probability that there is at least one girl
both_girls #2,472 / 10,000 ~ roughly 25% or 1/4 probability that both children are girls
older_girl #4,937 / 10,000 ~ roughly 50% or 1/2 probability that the first child is a girl
```

Challenge 1:

- P(B|G) = P(B,G) / P(G) or more explicitly:
- P(both_girls | older_girl) = P(both_girls) / P(older_girl)

Challenge 2:

- P(B|L) = P(B,L) / P(L) or more explicitly:
- P(both_girls | either_girl) = P(both_girls) / P(either_girl)

**Conditional** probabilities are **conditional** statements in code.

First, we establish a random function that assigns a `random.choice`

method to assign gender such that each child (i.e.,`Kid`

class instance) *is equally likely to be a boy or a girl*. This is the first assumption of our scenario.

```
import enum, random
class Kid(enum.Enum):
BOY = 0
GIRL = 1
def random_kid() -> Kid:
return random.choice([Kid.BOY, Kid.GIRL])
```

Next we create variables representing **joint distributions**; one variable for both children being girls (`both_girls`

), one variable for only the *older child* being a girl (`older_girl`

), and one for *at least one* child being a girl (`either_girl`

).

First the probability of any one child being a girl is (1/2), consistent with our assumption, we'd expect:

```
older_girl #4,937 / 10,000 ~ roughly 50% or 1/2 probability that the first child is a girl
```

Recall that when we take the **product** of *each* child being a girl (1/2), we can figure out the joint probability of *both* child being a girl (1/4). Thus, we'd expect:

```
both_girls #2,472 / 10,000 ~ roughly 25% or 1/4 probability that both children are girls
```

Finally, recall that if we're trying to calculate that *at least one (of two) children is a girl*, we can rule-out the (1/4) probability that both children are boys leaving (1/4 + 1/4 + 1/4 = 3/4) (see table above). Thus, we'd expect:

```
either_girl #7,464 / 10,000 ~ roughly 75% or 3/4 probability that there is at least one girl
```

To arrive at the numbers we see above, we create 10,000 *simulations* of scenarios where the 1st Child and 2nd Child (see table above) are randomly assigned gender in each of the scenario and conditionally filter through the code to see if certain outcomes are `True`

.

```
random.seed(0)
for _ in range(10000):
younger = random_kid()
older = random_kid()
if older == Kid.GIRL:
older_girl += 1
if older == Kid.GIRL and younger == Kid.GIRL:
both_girls += 1
if older == Kid.GIRL or younger == Kid.GIRL:
either_girl += 1
```

This simulation yields the joint probabilities which are then used to find the conditional probabilities of the two outcomes above:

```
print("P(both | older):", both_girls / older_girl) # 0.5007089325501317
print("P(both | either):", both_girls / either_girl) # 0.3311897106109325
```

For more content on data science, machine learning, R, Python, SQL and more, find me on Twitter.

]]>This post is chapter 5 in continuation of my coverage of Data Science from Scratch by Joel Grus.

It should be noted upfront that everything covered in this post can be done more expediently and efficiently in libraries like NumPy as well as the statistics module in Python.

The primary value of this book, and by extension this post, in my opinion, is the emphasis on **learning** how Python primitives can be used to build tools from the ground up. Here's a visual preview:

Specifically, we'll examine how specific features of the Python language as well as functions we built in a previous post on Vectors in Python (see also Matrices) can be used to build tools used to *describe* data and relationships within data (aka **statistics**).

I think this is pretty cool. Hopefully you agree.

This chapter continues the narrative of you as a newly hired data scientist at DataScienster, the social network for data scientists, and your job is to *describe* how many friends members in this social network has. We have two `lists`

of `float`

to work with. We'll work with `num_friends`

first, then `daily_minutes`

later.

I wanted this post to be self-contained, and in order to do that we'll have to read in a larger than average `list`

of `floats`

. The alternative would be to get the data directly from the book's github repo (statistics.py)

```
num_friends = [100.0,49,41,40,25,21,21,19,19,18,18,16,15,15,15,15,14,14,13,13,13,13,12,12,11,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,8,8,8,8,8,8,8,8,8,8,8,8,8,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1]
daily_minutes = [1,68.77,51.25,52.08,38.36,44.54,57.13,51.4,41.42,31.22,34.76,54.01,38.79,47.59,49.1,27.66,41.03,36.73,48.65,28.12,46.62,35.57,32.98,35,26.07,23.77,39.73,40.57,31.65,31.21,36.32,20.45,21.93,26.02,27.34,23.49,46.94,30.5,33.8,24.23,21.4,27.94,32.24,40.57,25.07,19.42,22.39,18.42,46.96,23.72,26.41,26.97,36.76,40.32,35.02,29.47,30.2,31,38.11,38.18,36.31,21.03,30.86,36.07,28.66,29.08,37.28,15.28,24.17,22.31,30.17,25.53,19.85,35.37,44.6,17.23,13.47,26.33,35.02,32.09,24.81,19.33,28.77,24.26,31.98,25.73,24.86,16.28,34.51,15.23,39.72,40.8,26.06,35.76,34.76,16.13,44.04,18.03,19.65,32.62,35.59,39.43,14.18,35.24,40.13,41.82,35.45,36.07,43.67,24.61,20.9,21.9,18.79,27.61,27.21,26.61,29.77,20.59,27.53,13.82,33.2,25,33.1,36.65,18.63,14.87,22.2,36.81,25.53,24.62,26.25,18.21,28.08,19.42,29.79,32.8,35.99,28.32,27.79,35.88,29.06,36.28,14.1,36.63,37.49,26.9,18.58,38.48,24.48,18.95,33.55,14.24,29.04,32.51,25.63,22.22,19,32.73,15.16,13.9,27.2,32.01,29.27,33,13.74,20.42,27.32,18.23,35.35,28.48,9.08,24.62,20.12,35.26,19.92,31.02,16.49,12.16,30.7,31.22,34.65,13.13,27.51,33.2,31.57,14.1,33.42,17.44,10.12,24.42,9.82,23.39,30.93,15.03,21.67,31.09,33.29,22.61,26.89,23.48,8.38,27.81,32.35,23.84]
daily_hours = [dm / 60 for dm in daily_minutes]
```

The `num_friends`

list is a list of numbers representing "number of friends" a person has, so for example, one person has 100 friends. The first thing we do to describe the data is to create a bar chart plotting the number of people who have 100 friends, 49 friends, 41 friends, and so on.

We'll import `Counter`

from `collections`

and import `matplotlib.pyplot`

.

We'll use `Counter`

to turn `num_friends`

list into a `defaultdict(int)`

-like object mapping keys to counts. For more info, please refer to this previous post on the Counters.

Once we use the `Counter`

collection, a high-performance container datatype, we can use methods like `most_common`

to find the keys with the most common values. Here we see that the five most common *number of friends* are 6, 1, 4, 3 and 9, respectively.

```
from collections import Counter
import matplotlib.pyplot as plt
friend_counts = Counter(num_friends)
# the five most common values are: 6, 1, 4, 3 and 9 friends
# [(6, 22), (1, 22), (4, 20), (3, 20), (9, 18)]
friend_counts.most_common(5)
```

To proceed with plotting, we'll use `friend_counts`

to create a `list comprehension`

that will loop through `friends_count`

and for all **keys** from 0-101 (xs) and print a corresponding **value** (if it exists). This becomes the y-axis to `num_friends`

, which is the x-axis:

```
xs = range(101) # x-axis: largest num_friend value is 100
ys = [friend_counts[x] for x in xs] # y-axis
plt.bar(xs, ys)
plt.axis([0, 101, 0, 25])
plt.title("Histogram of Friend Counts")
plt.xlabel("# of friends")
plt.ylabel("# of people")
plt.show()
```

Here is the plot below. You can see one person with 100 friends.

You can also read more about data visualization here.

Alternatively, we could generate simple statistics to describe the data using built-in Python methods: `len`

, `min`

, `max`

and `sorted`

.

```
num_points = len(num_friends) # number of data points in num_friends: 204
largest_value = max(num_friends) # largest value in num_friends: 100
smallest_value = min(num_friends) # smallest value in num_friends: 1
sorted_values = sorted(num_friends) # sort the values in ascending order
second_largest_value = sorted_values[-2] # second largest value from the back: 49
```

The most common way of describing a set of data is to find it's **mean**, which is the sum of all the values, divided by the number of values. *note* : we'll continue to use type annotations. In my opinion, it helps you be a more deliberate and mindful Python programmer.

```
from typing import List
def mean(xs: List[float]) -> float:
return sum(xs) / len(xs)
assert 7.3333 < mean(num_friends) < 7.3334
```

However, the mean is **notoriously sensitive to outliers** so statisticians often supplement with other measures of central tendencies like **median**. Because the median is the *middle-most value*, it matters whether there is an *even* or *odd* number of data points.

Here, we'll create two private functions for both situations - even and odd number of data points - in calculating the median. First, we'll sort the data values. Then, for *even number* values, we'll find the two middle values and split them. For *odd number* of values, we'll divide the *length* of the dataset by 2 (i.e., 50).

Our median function will return either of the private function `_median_even`

or `_median_odd`

conditionally depending on if the length of a list of numbers is divisible (%2==0) by 2.

```
def _median_even(xs: List[float]) -> float:
"""If len(xs) is even, it's the average of the middle two elements"""
sorted_xs = sorted(xs)
hi_midpoint = len(xs) // 2 # e.g. length 4 => hi_midpoint 2
return (sorted_xs[hi_midpoint - 1] + sorted_xs[hi_midpoint]) / 2
def _median_odd(xs: List[float]) -> float:
"""If len(xs) is odd, its the middle element"""
return sorted(xs)[len(xs) // 2]
def median(v: List[float]) -> float:
"""Finds the 'middle-most' value of v"""
return _median_even(v) if len(v) % 2 == 0 else _median_odd(v)
assert median([1,10,2,9,5]) == 5
assert median([1, 9, 2, 10]) == (2 + 9) / 2
```

Because the median is the *middle-most value*, it does not fully depend on every value in the data. For illustration, hypothetically if we have a another list `num_friends2`

where one person had 10,000 friends, the **mean** would be much more sensitive to that change than the **median** would be.

```
num_friends2 = [10000.0,49,41,40,25,21,21,19,19,18,18,16,15,15,15,15,14,14
,13,13,13,13,12,12,11,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,9,9,9,9
,9,9,9,9,9,9,9,9,9,9,9,9,9,9,8,8,8,8,8,8,8,8,8,8,8,8,8,7,7,7,7,7,7,7,7,7,7
,7,7,7,7,7,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,5,5,5,5,5,5,5,5,5,5
,5,5,5,5,5,5,5,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,3,3,3,3,3,3,3,3,3,3
,3,3,3,3,3,3,3,3,3,3,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,1,1,1,1,1,1,1,1,1,1
,1,1,1,1,1,1,1,1,1,1,1,1]
mean(num_friends2) # more sensitive to outliers: 7.333 => 55.86274509803921
median(num_friends2) # less sensitive to outliers: 6.0 => 6.0
```

You may also used `quantiles`

to describe your data. Whenever you've heard "X percentile", that is a description of quantiles relative to 100. In fact, the median is the 50th percentile (where 50% of the data lies below this point and 50% lies above).

Because `quantile`

is a position from 0-100, the second argument is a float from 0.0 to 1.0. We'll use that float to multiply with the length of the list. Then we'll wrap in `int`

to create an integer index which we'll use on a sorted xs to find the quantile.

```
def quantile(xs: List[float], p: float) -> float:
"""Returns the pth-percentile value in x"""
p_index = int(p * len(xs))
return sorted(xs)[p_index]
assert quantile(num_friends, 0.10) == 1
assert quantile(num_friends, 0.25) == 3
assert quantile(num_friends, 0.75) == 9
assert quantile(num_friends, 0.90) == 13
```

Finally, we have the **mode**, which looks at the most common values. First, we use the `Counter`

method on our list parameter and since Counter is a subclass of `dict`

we have access to methods like `values()`

to find all the values and `items()`

to find key value pairs.

We define `max_count`

to find the max value (22), then the function returns a list comprehension which loops through `counts.items()`

to find the key associated with the max_count (22). That is 1 and 6, meaning twenty-two people (the **mode**) had one or six friends.

```
def mode(x: List[float]) -> List[float]:
"""Returns a list, since there might be more than one mode"""
counts = Counter(x)
max_count = max(counts.values())
return [x_i for x_i, count in counts.items() if count == max_count]
assert set(mode(num_friends)) == {1, 6}
```

Because we had already used Counter on `num_friends`

previously (see `friend_counts`

), we could have just called the `most_common(2)`

method to get the same results:

```
mode(num_friends) # [6, 1]
friend_counts.most_common(2) # [(6, 22), (1, 22)]
```

Aside from our data's central tendencies, we'll also want to understand it's spread or dispersion. The tools to do this are `data_range`

, `variance`

, `standard deviation`

and `interquartile range`

.

Range is a straightforward max value minus min value.

Variance measures how far a set of numbers is from their average value. What's more interesting, for our purpose, is how we need to borrow the functions we had previously built in the vectors and matrices posts to create the variance function.

If you look at its wikipedia page, **variance** is the *squared deviation* of a variable from its mean.

First, we'll need to create the `de_mean`

function that takes a list of numbers and subtract from all numbers in the list, the mean value (this gives us the deviation from the mean).

Then, we'll `sum_of_squares`

all those deviations, which means we'll take all the values, multiply them with itself (square it), then add the values (and divide by length of the list minus one) to get the variance.

Recall that the `sum_of_squares`

is a special case of the `dot`

product function.

```
# variance
from typing import List
Vector = List[float]
# see vectors.py in chapter 4 for dot and sum_of_squares
def dot(v: Vector, w: Vector) -> float:
"""Computes v_1 * w_1 + ... + v_n * w_n"""
assert len(v) == len(w), "vectors must be the same length"
return sum(v_i * w_i for v_i, w_i in zip(v,w))
def sum_of_squares(v: Vector) -> float:
"""Returns v_1 * v_1 + ... + v_n * v_n"""
return dot(v,v)
def de_mean(xs: List[float]) -> List[float]:
"""Translate xs by subtracting its mean (so the result has mean 0)"""
x_bar = mean(xs)
return [x - x_bar for x in xs]
def variance(xs: List[float]) -> float:
"""Almost the average squared deviation from the mean"""
assert len(xs) >= 2, "variance requires at least two elements"
n = len(xs)
deviations = de_mean(xs)
return sum_of_squares(deviations) / (n - 1)
assert 81.54 < variance(num_friends) < 81.55
```

The **variance** is `sum_of_squares`

deviations, which can be tricky to interpret. For example, we have a `num_friends`

with values ranging from 0 to 100.

What does a variance of 81.54 mean?

A more common alternative is the **standard deviation**. Here we take the square root of the variance using Python's `math`

module.

With a standard deviation of 9.03, and we know the mean of `num_friends`

is 7.3, anything below 7 + 9 = 16 or 7 - 9 (0 friends) friends is still *within a standard deviation of the mean*. And we can check by running `friend_counts`

that most people are within a standard deviation of the mean.

On the other hand, we know that someone with 20 friends is **more than one standard deviation** away from the mean.

```
import math
def standard_deviation(xs: List[float]) -> float:
"""The standard deviation is the square root of the variance"""
return math.sqrt(variance(xs))
assert 9.02 < standard_deviation(num_friends) < 9.04
```

However, because the **standard deviation** builds on the **variance**, which is dependent on the **mean**, we know that just like the mean, it can be sensitive to outliers, we can use an alternative called the **interquartile range**, which is based on the **median** and less sensitive to outliers.

Specifically, the interquartile range can be used to examine `num_friends`

between the 25th and 75th percentile. A large chunk of people are going to have *around 6 friends*.

```
def interquartile_range(xs: List[float]) -> float:
"""Returns the difference between the 75%-ile and the 25%-ile"""
return quantile(xs, 0.75) - quantile(xs, 0.25)
assert interquartile_range(num_friends) == 6
```

Now that we describe a single list of data, we'll also want to look at potential relationship between two data sources. For example, we may have a hypothesis that the amount of time spent on the DataScienster social network is somehow related to the number of friends someone has.

We'll examine covariance and correlations next.

If variance is how much a *single* set of numbers deviates from its mean (i.e., see `de_mean`

above), then **covariance** measures how two sets of numbers vary from *their* means. With the idea that if they co-vary the same amount, then they could be related.

Here we'll borrow the `dot`

production function we developed in the vectors in python post.

Moreover, we'll examine if there's a relationship between `num_friends`

and `daily_minutes`

and `daily_hours`

(see above).

```
def covariance(xs: List[float], ys: List[float]) -> float:
assert len(xs) == len(ys), "xs and ys must have same number of elements"
return dot(de_mean(xs), de_mean(ys)) / (len(xs) - 1)
assert 22.42 < covariance(num_friends, daily_minutes) < 22.43
assert 22.42 / 60 < covariance(num_friends, daily_hours) < 22.43 / 60
```

As with variance, a similar critique can be made of **covariance**, you have to do extra steps to interpret it. For example, the covariance of `num_friends`

and `daily_minutes`

is 22.43.

What does that mean? Is that considered a strong relationship?

A more intuitive measure would be a **correlation**:

```
def correlation(xs: List[float], ys: List[float]) -> float:
"""Measures how much xs and ys vary in tandem about their means"""
stdev_x = standard_deviation(xs)
stdev_y = standard_deviation(ys)
if stdev_x > 0 and stdev_y > 0:
return covariance(xs,ys) / stdev_x / stdev_y
else:
return 0 # if no variation, correlation is zero
assert 0.24 < correlation(num_friends, daily_minutes) < 0.25
assert 0.24 < correlation(num_friends, daily_hours) < 0.25
```

By dividing out the standard deviation of both input variables, correlation is always between -1 (perfect (anti) correlation) and 1 (perfect correlation). A correlation of 0.24 is relatively weak correlation (although what is considered weak, moderate, strong depends on the context of the data).

One thing to keep in mind is **simpson's paradox** or when the relationship between two variables change when accounting for a third, **confounding** variable. Moreover, we should keep this clich in mind (it's a clich for a reason): **correlation does not imply causation**.

We are just five chapters in and we can begin to see how we're building the tools *now*, that we'll use later on. Here's a visual summary of what we've covered in this post and how it connects to previous posts, namely vectors, matrices and the python crash course: see setup, functions, lists, dictionaries and more.

This post continues my coverage of this book:

For more content on data science, machine learning, R, Python, SQL and more, find me on Twitter.

]]>**note**: This post uses R, Rmarkdown, and packages in the Tidyverse and other R packages.

I recently came across the **Datasaurus** dataset by Alberto Cairo on #TidyTuesday and wanted to create a series of charts illustrating the lessons associated with this dataset, primarily to: never trust summary statistics alone.

First, some context. Here's Alberto's original tweet from years ago when he created this dataset:

This tweet alone doesn't communicate why we shouldn't trust summary statistics alone, so let's unpack this. First we'll load the various packages and data we'll use.

```
library(tidyverse)
library(ggcorrplot)
library(ggridges)
```

**note** : `datasaurus`

and `datasaurus_dozen`

are identical. The former is provided via #TidyTuesday, the latter from this research paper discussing more advanced concepts beyond the scope of this document (i.e., simulated annealing).

You'll also note that `datasaurus_dozen`

and `datasaurus_wide`

are the same data, organized differently. The former in *long* format and the latter, in *wide* format - see here for details.

For the most part, we'll use `datasaurus_dozen`

throughout this document. We'll use `datasaurus_wide`

when we get to the correlation section.

```
datasaurus <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-10-13/datasaurus.csv')
datasaurus_dozen <- read_tsv('DatasaurusDozen.tsv')
datasaurus_wide <- read_tsv('DatasaurusDozen-wide.tsv')
```

First, we'll note that if we just look at summary statistics (i.e., **mean** and **standard deviation**), we might conclude that these variables are all the *same*. Moreover, within each variable, `x`

and `y`

values have very **similarly low correlations** at ranging from -0.06 to -0.07.

```
datasaurus_dozen %>%
group_by(dataset) %>%
summarize(
x_mean = mean(x),
x_sd = sd(x),
y_mean = mean(y),
y_sd = sd(y),
corr = cor(x,y)
)
```

There are 13 variables, each with X- and Y- axes. Here are the summary statistics. You'll note that all 13 variables have the same **mean** and **standard deviation** for their `X`

and `Y`

values.

You could use `boxplots`

to show *slight* variation in the distribution and **median** values of these 13 variables. However, the **mean** values, indicated with the red circles, are identical.

```
datasaurus_dozen %>%
ggplot(aes(x = dataset, y = x, fill = dataset)) +
geom_boxplot(alpha = 0.6) +
stat_summary(fun = mean, geom = "point", shape = 20, size = 6, color = "red", fill = "red") +
scale_fill_brewer(palette = "Set3") +
theme_classic() +
theme(legend.position = 'none') +
labs(
y = '13 variables',
x = 'X-values',
title = "Boxplots: Slight differences in the distribution and median values (X-axis)",
subtitle = "Identical mean values"
)
```

Here's the same plot for `y`

values:

```
datasaurus_dozen %>%
ggplot(aes(x = dataset, y = y, fill = dataset)) +
geom_boxplot(alpha = 0.6) +
stat_summary(fun = mean, geom = "point", shape = 20, size = 6, color = "red", fill = "red") +
scale_fill_brewer(palette = "Paired") +
theme_classic() +
theme(legend.position = 'none') +
labs(
y = '13 variables',
x = 'Y-values',
title = "Boxplots: Slight differences in the distribution and median values (Y-axis)",
subtitle = "Identical mean values"
)
```

We can begin to get a sense for how these variables are different if we plot the distribution in different ways. The ridgeline plot begins to reveal aspects of the data that were hidden before.

We can begin to see that certain variables have markedly different distribution shapes (i.e., `v_lines`

, `dots`

, `x_shape`

, `wide_lines`

), while having the same **mean** value.

```
datasaurus_dozen %>%
ggplot(aes(x = x, y = dataset, fill = dataset)) +
geom_density_ridges_gradient(scale = 3, quantile_lines = T, quantile_fun = mean) +
scale_fill_manual(values = c('#a6cee3', '#1f78b4', '#b2df8a', '#33a02c', '#fb9a99', '#e31a1c', '#fdbf6f', '#ff7f00', '#cab2d6', '#6a3d9a', '#ffff99', '#b15928', 'grey')) +
theme_classic() +
theme(legend.position = 'none') +
labs(
x = "X-values",
y = "13 variables",
title = "Ridgeline Plot: More variation in the distribution (X-axis)",
subtitle = "Identical mean values"
)
```

For `y`

values, `high_lines`

, `dots`

, `circle`

and `star`

have obviously different distributions from the rest. Again, the **mean** values are identical across variables.

```
datasaurus_dozen %>%
ggplot(aes(x = y, y = dataset, fill = dataset)) +
geom_density_ridges_gradient(scale = 3, quantile_lines = T, quantile_fun = mean) +
scale_fill_manual(values = c('#a6cee3', '#1f78b4', '#b2df8a', '#33a02c', '#fb9a99', '#e31a1c', '#fdbf6f', '#ff7f00', '#cab2d6', '#6a3d9a', '#ffff99', '#b15928', 'grey')) +
theme_classic() +
theme(legend.position = 'none') +
labs(
x = "Y-values",
y = "13 variables",
title = "Ridgeline Plot: More variation in the distribution (Y-axis)",
subtitle = "Identical mean values"
)
```

If you skip visualizing the distribution and central tendencies and go straight to seeing how the variables correlate with each other, you could also miss some fundamental differences in the data.

In particular, the `x`

and `y`

values across all 13 variables are *highly correlated*. With just knowledge of the summary statistics, one could be led to believe that these variables are *highly similar*.

Below is the code to create an (abbreviated) **correlation matrix**.

```
library(ggcorrplot)
# X-values
# selecting rows 2-143
# turning all values from character to numeric
datasaurus_wide_x <- datasaurus_wide %>%
slice(2:143) %>%
select(away, bullseye, circle, dino, dots, h_lines, high_lines, slant_down, slant_up, star, v_lines, wide_lines, x_shape) %>%
mutate_if(is.character, as.numeric)
# Y-values
# selecting rows 2-143
# turning all values from character to numeric
datasaurus_wide_y <- datasaurus_wide %>%
slice(2:143) %>%
select(away_1, bullseye_1, circle_1, dino_1, dots_1, h_lines_1, high_lines_1, slant_down_1, slant_up_1, star_1, v_lines_1, wide_lines_1, x_shape_1) %>%
mutate_if(is.character, as.numeric)
# correlation matrix for X values
corr_x <- round(cor(datasaurus_wide_x), 1)
# correlation matrix for Y values
corr_y <- round(cor(datasaurus_wide_y), 1)
head(corr_x[, 1:6])
```

Here is a correlation between the `x-values`

between all 13 variables. You can see that all variables, aside from `away`

, are highly correlated with each other.

```
# correlation between X-values
ggcorrplot(corr_x, hc.order = TRUE,
type="lower",
outline.color = "white",
ggtheme = ggplot2::theme_gray,
colors = c("#d8b365", "#f5f5f5", "#5ab4ac"),
lab = TRUE)
```

Here is a correlation between the 'y-values' between all 13 variables. Again, aside from `away`

, all the variables are highly correlated with each other.

```
# correlation between Y-values
ggcorrplot(corr_y, hc.order = TRUE,
type="lower",
outline.color = "white",
ggtheme = ggplot2::theme_gray,
colors = c("#ef8a62", "#f7f7f7", "#67a9cf"),
lab = TRUE)
```

At this point, the **boxplots** show us variables with *similar median* and *identical mean*; the **ridgelines** begin to show us that some variables have different distributions. And the **correlation matrix** suggests the variables are more similar than not.

To really see their differences, we'll need to use `facet_wrap`

.

Here we'll use `facet_wrap`

to examine the histogram for `x`

and `y`

values of all 13 variables. We started to see the differences in distribution between variables from the `ridgeline`

plots, but overlapping histograms provide another perspective.

```
# facet histogram (both-values)
datasaurus_dozen %>%
group_by(dataset) %>%
ggplot() +
geom_histogram(aes(x=x, fill='red'), alpha = 0.5, bins = 30) +
geom_histogram(aes(x=y, fill='green'), alpha = 0.5, bins = 30) +
facet_wrap(~dataset) +
scale_fill_discrete(labels = c('y', 'x')) +
theme_classic() +
labs(
fill = 'Axes',
x = '',
y = 'Count',
title = 'Faceted Histogram: x- and y-values'
)
```

However, if there's one thing this dataset is trying to communicate its that there's no substitute for plotting the actual data points. No amount of summary statistics, central tendency or distribution is going to replace **plotting actually data points**.

Once we create the scatter plot with `geom_point`

, we see the big reveal with this dataset. That despite the similarities in central measures, for the most part similar distributions and high correlations, the 13 variables are **wildly different** from each other.

```
datasaurus_dozen %>%
group_by(dataset) %>%
ggplot(aes(x=x, y=y, color=dataset)) +
geom_point(alpha = 0.5) +
facet_wrap(~dataset) +
scale_color_manual(values = c('#a6cee3', '#1f78b4', '#b2df8a', '#33a02c', '#fb9a99', '#e31a1c', '#fdbf6f', '#ff7f00', '#cab2d6', '#6a3d9a', '#ffff99', '#b15928', 'grey')) +
theme_classic() +
theme(legend.position = "none") +
labs(
x = 'X-axis',
y = 'Y-axis',
title = 'Faceted Scatter Plot'
)
```

There are other less common alternatives to the **scatter plot**.

While not as clear as the **scatter plot**, plotting the **contours** of a 2D density estimate does show how very different the variables are from each other, despite similar summary statistics.

```
# contours of a 2D Density estimate
datasaurus_dozen %>%
ggplot(aes(x=x, y=y)) +
geom_density_2d() +
theme_classic() +
facet_wrap(~dataset) +
labs(
x = 'X-axis',
y = 'Y-axis',
title = 'Contours of a 2D density estimate'
)
```

This is a slight variation using `stat_density_2d`

:

```
# stat density 2d
datasaurus_dozen %>%
ggplot(aes(x=x, y=y)) +
stat_density_2d(aes(fill=y), geom = "polygon", colour = 'white') +
theme_classic() +
facet_wrap(~dataset) +
labs(
x = 'X-axis',
y = 'Y-axis',
title = 'Stat Density 2D estimate'
)
```

Using the `density_2d`

plots are quite effective in showing how different the variables are and serve as a nice alternative to the more familiar scatter plot.

Hopefully this vignette illustrates the importance of never trusting summary statistics (alone). Moreover, when visualizing, we should go beyond simply visualizing the data's distribution or central tendency, but plotting the actually data points.

]]>The first thing to note is that `matrices`

are represented as `lists`

of `lists`

which is explicit with type annotation:

```
from typing import List
Matrix = List[List[float]]
```

You might bet wondering if a `list`

of `lists`

is somehow different from a `list`

of `vectors`

we saw previously with the `vector_sum`

function. To see, I used **type annotation** to try to define the arguments *differently*.

Here's the `vector_sum`

function we defined previously:

```
def vector_sum(vectors: List[Vector]) -> Vector:
"""Sum all corresponding elements (componentwise sum)"""
# Check that vectors is not empty
assert vectors, "no vectors provided!"
# Check the vectorss are all the same size
num_elements = len(vectors[0])
assert all(len(v) == num_elements for v in vectors), "different sizes!"
# the i-th element of the result is the sum of every vector[i]
return [sum(vector[i] for vector in vectors)
for i in range(num_elements)]
assert vector_sum([[1,2], [3,4], [5,6], [7,8]]) == [16,20]
```

Here's a **new** function, `vector_sum2`

defined differently with **type annotation**:

```
def vector_sum2(lists: List[List[float]]) -> List:
"""Sum all corresponding list (componentwise sum?)"""
assert lists, "this list is empty!"
# check that lists are the same size
num_lists = len(lists[0])
assert all(len(l) == num_lists for l in lists), "different sizes!"
# the i-th list is the sum of every list[i]
return [sum(l[i] for l in lists)
for i in range(num_lists)]
assert vector_sum2([[1,2], [3,4], [5,6], [7,8]]) == [16,20]
```

I did a variety of things to see if `vector_sum`

and `vector_sum2`

behaved differently, but they appear to be identical:

```
# both are functions
assert callable(vector_sum) == True
assert callable(vector_sum2) == True
# when taking the same argument, they both return a list
type(vector_sum([[1,2], [3,4], [5,6], [7,8]])) #list
type(vector_sum2([[1,2], [3,4], [5,6], [7,8]])) #list
# the same input yields the same output
vector_sum([[1,2],[3,4]]) # [4,6]
vector_sum2([[1,2],[3,4]]) # [4,6]
```

To keep it simple, in the context of **matrices**, you can think of **vectors** as the *rows of the matrix*.

For example, if we represent the small dataset below as a **matrix**, we can think of *columns* as variables like: height, weight, age; and *each row* as a person:

```
sample_data = [[70, 170, 40],
[65, 120, 26],
[77, 250, 19]]
```

By extension of **rows** and **columns**, we can write a function for the shape of a matrix. This below `shape`

function takes in a matrix and returns a `tuple`

with two integers, number of rows and number of columns:

```
from typing import Tuple
def shape(A: Matrix) -> Tuple[int, int]:
"""Returns (# of rows of A, # of columns of A)"""
num_rows = len(A)
num_cols = len(A[0]) if A else 0 # number of elements in first row
return num_rows, num_cols
assert shape([[1,2,3], [4,5,6]]) == (2,3) # 2 rows, 3 columns
assert shape(sample_data) == (3,3)
```

We can actually write functions to grab either a specific *row* or a specific *columns* :

```
Vector = List[float]
# rows
def get_row(A: Matrix, i: int) -> Vector:
"""Returns the i-th row of A (as a Vector)"""
return A[i] # A[i] is already the ith row
# column
def get_column(A: Matrix, i: int) -> Vector:
"""Returns the j-th column of A (as a Vector)"""
return [A_i[j]
for A_i in A]
```

Now, going beyond finding the shape, rows and columns of an existing matrix, we'll also want to **create** matrices and we'll do that using **nested list comprehensions**:

```
from typing import Callable
def make_matrix(num_rows: int,
num_cols: int,
entry_fn: Callable[[int, int], float]) -> Matrix:
"""
Returns a num_rows x num_cols matrix
whose (i,j)-th entry is entry_fn(i, j)
"""
return [[entry_fn(i,j) # given i, create a list
for j in range(num_cols)] # [entry_fn(i, 0), ...]
for i in range(num_rows)] # create one list for each i
```

Then we'll actually *use* the `make_matrix`

function to create a special type of matrix called the `identity matrix`

:

```
def identity_matrix(n: int) -> Matrix:
"""Returns the n x n identity matrix"""
return make_matrix(n, n, lambda i, j: 1 if i == j else 0)
assert identity_matrix(5) == [[1, 0, 0, 0, 0],
[0, 1, 0, 0, 0],
[0, 0, 1, 0, 0],
[0, 0, 0, 1, 0],
[0, 0, 0, 0, 1]]
```

To be sure there are other types of matrices, but in this chapter we're only briefly exploring its construction to prime us.

We know matrices can be used to represent data, each *row* in the dataset being a **vector**. Because we can also know a matrices' *column*, we'll use it to represent linear functions that **map k-dimensional vectors to n-dimensional vectors**.

Finally, matrices can also be used to map *binary relationships*.

On our first day at DataScienster we were given `friendship_pairs`

data:

```
friendship_pairs = [(0,1), (0,2), (1,2), (1,3), (2,3), (3,4),
(4,5), (5,6), (5,7), (6,8), (7,8), (8,9)]
```

These `friendship_pairs`

can also be represented in matrix form:

```
# user 0 1 2 3 4 5 6 7 8 9
friend_matrix = [[0, 1, 1, 0, 0, 0, 0, 0, 0, 0], # user 0
[1, 0, 1, 1, 0, 0, 0, 0, 0, 0], # user 1
[1, 1, 0, 1, 0, 0, 0, 0, 0, 0], # user 2
[0, 1, 1, 0, 1, 0, 0, 0, 0, 0], # user 3
[0, 0, 0, 1, 0, 1, 0, 0, 0, 0], # user 4
[0, 0, 0, 0, 1, 0, 1, 1, 0, 0], # user 5
[0, 0, 0, 0, 0, 1, 0, 0, 1, 0], # user 6
[0, 0, 0, 0, 0, 1, 0, 0, 1, 0], # user 7
[0, 0, 0, 0, 0, 0, 1, 1, 0, 1], # user 8
[0, 0, 0, 0, 0, 0, 0, 0, 1, 0]] # user 9
```

This allows us to check very quickly whether two users are friends or not:

```
assert friend_matrix[0][2] == 1, "0 and 2 are friends"
assert friend_matrix[0][8] == 0, "0 and 8 are not friends"
```

And if we wanted to check for each user's friend, we could:

```
# checking the friends of user at index five (Clive)
friends_of_five = [i
for i, is_friend in enumerate(friend_matrix[5])
if is_friend]
# checking the friends of user at index zero (Hero)
friends_of_zero = [i
for i, is_friend in enumerate(friend_matrix[0])
if is_friend]
assert friends_of_five == [4,6,7]
assert friends_of_zero == [1,2]
```

For more content on data science, machine learning, R, Python, SQL and more, find me on Twitter.

]]>We'll see the **from scratch** aspect of the book* play out as we implement several building block functions to help us work towards defining the *Euclidean Distance* in code:

**note**: This is chapter 4, Linear Algebra, of Data Science from Scratch by Joel Grus.

While we don't see its application immediately, we can expect to see the **Euclidean Distance** used for K-nearest neighbors (classication) or K-means (clustering) to find the "k closest points" (source). (*note* : there are other types of distance formulas used as well.)

En route towards implementing the **Euclidean Distance**, we also implement the **sum of squares** which is a crucial piece for how **regression** works.

Thus, the **from scratch** aspect of this book works on two levels. *Within* this chapter, we're building piece by piece up to an important **distance** and **sum of squares** formula. But we're also building tools we'll use in subsequent chapters.

We start off with implementing functions to **add** and **subtract** two vectors. We also create a function for *component wise sum* of a list of vectors, where a new vector is created whose first element is the sum of all the first elements in the list and so on.

We then create a function to **multiply** a vector by scalar, which we use to compute the *component wise mean* of a list of vectors.

We also create the **dot product** of two vectors or the *sum of their component wise product*, and this is is the generalize version of the **sum of squares**. At this point, we have enough to implement the **Euclidean distance**. Let's take a look at the code:

Vectors are simply a list of numbers:

```
height_weight_age = [70,170,40]
grades = [95,80,75,62]
```

You'll *note* that we do **type annotation** on our code throughout. This is a convention advocated by the author (and as a newcomer to Python, I like the idea of being explicit about data type for a function's input and output).

```
from typing import List
Vector = List[float]
def add(v: Vector, w: Vector) -> Vector:
"""Adds corresponding elements"""
assert len(v) == len(w), "vectors must be the same length"
return [v_i + w_i for v_i, w_i in zip(v,w)]
assert add([1,2,3], [4,5,6]) == [5,7,9]
```

Here's another view of what's going on with the `add`

function:

```
def subtract(v: Vector, w: Vector) -> Vector:
"""Subtracts corresponding elements"""
assert len(v) == len(w), "vectors must be the same length"
return [v_i - w_i for v_i, w_i in zip(v,w)]
assert subtract([5,7,9], [4,5,6]) == [1,2,3]
```

This is pretty much the same as the previous:

```
def vector_sum(vectors: List[Vector]) -> Vector:
"""Sum all corresponding elements (componentwise sum)"""
# Check that vectors is not empty
assert vectors, "no vectors provided!"
# Check the vectorss are all the same size
num_elements = len(vectors[0])
assert all(len(v) == num_elements for v in vectors), "different sizes!"
# the i-th element of the result is the sum of every vector[i]
return [sum(vector[i] for vector in vectors)
for i in range(num_elements)]
assert vector_sum([[1,2], [3,4], [5,6], [7,8]]) == [16,20]
```

Here, a `list`

of vectors becomes *one* vector. If you go back to the `add`

function, it takes **two** vectors, so if we tried to give it four vectors, we'd get a `TypeError`

. So we wrap four vectors in a `list`

and provide *that* as the argument for `vector_sum`

:

```
def scalar_multiply(c: float, v: Vector) -> Vector:
"""Multiplies every element by c"""
return [c * v_i for v_i in v]
assert scalar_multiply(2, [2,4,6]) == [4,8,12]
```

One number is multiplied with *all* numbers in the vector, with the vector retaining its length:

This is similar to componentwise sum (see above); a list of vectors becomes one vector.

```
def vector_mean(vectors: List[Vector]) -> Vector:
"""Computes the element-wise average"""
n = len(vectors)
return scalar_multiply(1/n, vector_sum(vectors))
assert vector_mean([ [1,2], [3,4], [5,6] ]) == [3,4]
```

```
def dot(v: Vector, w: Vector) -> float:
"""Computes v_1 * w_1 + ... + v_n * w_n"""
assert len(v) == len(w), "vectors must be the same length"
return sum(v_i * w_i for v_i, w_i in zip(v,w))
assert dot([1,2,3], [4,5,6]) == 32
```

Here we multiply the elements, then sum their results. Two vectors becomes a single number (`float`

):

```
def sum_of_squares(v: Vector) -> float:
"""Returns v_1 * v_1 + ... + v_n * v_n"""
return dot(v,v)
assert sum_of_squares([1,2,3]) == 14
```

In fact, `sum_of_squares`

is a special case of **dot product**:

```
def magnitude(v: Vector) -> float:
"""Returns the magnitude (or length) of v"""
return math.sqrt(sum_of_squares(v)) # math.sqrt is the square root function
assert magnitude([3,4]) == 5
```

With `magnitude`

we square root the `sum_of_squares`

. This is none other than the pythagorean theorem.

```
def squared_distance(v: Vector, w: Vector) -> float:
"""Computes (v_1 - w_1) ** 2 + ... + (v_n - w_n) ** 2"""
return sum_of_squares(subtract(v,w))
```

This is the distance *between* two vectors, *squared*.

```
import math
def distance(v: Vector, w: Vector) -> float:
"""Also computes the distance between v and w"""
return math.sqrt(squared_distance(v,w))
```

Finally, we square root the `squared_distance`

to get the (euclidean) distance:

We literally built from scratch, albeit with some help from Python's `math`

module, the blocks needed for essential functions that we'll expect to use later, namely: the `sum_of_squares`

and `distance`

.

It's pretty cool to see these foundational concepts set us up to understand more complex machine learning algorithms like **regression**, **k-nearest neighbors (classification)**, **k-means (clustering)** and even touch on the **pythagorean theorem**.

We'll examine matrices next.

For more content on data science, machine learning, R, Python, SQL and more, find me on Twitter.

]]>A couple days back I wrote a post summarizing how much Collections and Comprehension were used. Data was provided in the form of `lists`

, either lists of `dictionaries`

or `tuples`

. And to answer questions *about* the data, the author often used `list comprehensions`

- iterating through lists with a for-loop. I am beginning to see this as a very Python-centric way of approaching problems.

While **not** all data is tabular, so much of it *is* so its reasonable to assume that, more often that not, you'll be dealing with spreadsheet-like tabular data (**note**: I'm open to other perspectives here, feel free to leave a comment below!).

In any case, I had this **itch** to go back to that chapter and ask:

How would I approach the same problem using data frames?

So that's what this post is about.

You can reference these previous posts for context; also keep in mind, this is a slight deviation from Joel Grus' book (for example, I'll be using pandas here and a jupyter notebook here, both of which are not covered in the book).

For review, here's the data you are given as a newly hired data scientist at Data Scienster

```
# users in the network
# stored as a list of dictionaries
users = [
{"id": 0, "name": "Hero"},
{"id": 1, "name": "Dunn"},
{"id": 2, "name": "Sue"},
{"id": 3, "name": "Chi"},
{"id": 4, "name": "Thor"},
{"id": 5, "name": "Clive"},
{"id": 6, "name": "Hicks"},
{"id": 7, "name": "Devin"},
{"id": 8, "name": "Kate"},
{"id": 9, "name": "Klein"}
]
# friendship pairings in the network
# stored as a list of tuples
friendship_pairs = [(0,1), (0,2), (1,2), (1,3), (2,3), (3,4),
(4,5), (5,6), (5,7), (6,8), (7,8), (8,9)]
# interests data
# stored as another list of tuples
interests = [
(0, "Hadoop"), (0, "Big Data"), (0, "HBase"), (0, "Java"),
(0, "Spark"), (0, "Storm"), (0, "Cassandra"),
(1, "NoSQL"), (1, "MongoDB"), (1, "Cassandra"), (1, "HBase"),
(1, "Postgres"), (2, "Python"), (2, "scikit-learn"), (2, "scipy"),
(2, "numpy"), (2, "statsmodels"), (2, "pandas"), (3, "R"), (3, "Python"),
(3, "statistics"), (3, "regression"), (3, "probability"),
(4, "machine learning"), (4, "regression"), (4, "decision trees"),
(4, "libsvm"), (5, "Python"), (5, "R"), (5, "Java"), (5, "C++"),
(5, "Haskell"), (5, "programming langauges"), (6, "statistics"),
(6, "probability"), (6, "mathematics"), (6, "theory"),
(7, "machine learning"), (7, "scikit-learn"), (7, "Mahout"),
(7, "neural networks"), (8, "neural networks"), (8, "deep learning"),
(8, "Big Data"), (8, "artificial intelligence"), (9, "Hadoop"),
(9, "Java"), (9, "MapReduce"), (9, "Big Data")
]
```

Given just these pieces of data, we can create **functions**, use **for-loops** and **list comprehensions** to answer some questions like:

- Who are each user friends with?
- What are the total and average number of connections?
- Which users share the same interest?
- What are the most popular topics in this network?

However, the chapter ends with lists, functions and comprehension. What about **storing data in data frames?**

First we'll store `users`

as a data frame:

```
import pandas as pd
# convert list of dict into dataframe
users_df = pd.DataFrame(users)
users_df
```

Just visually, a `data frame`

looks different from a `list of dictionaries`

:

Your mileage may vary, but *I make sense of the data* very differently when I'm looking at a list vs a data frame. **Rows and columns** are ingrained in how I think about data.

Next, we're given a `list of tuples`

representing friendship pairs and we proceed to turn that into a `dictionary`

by using a `dictionary comprehension`

:

```
# list of tuples
friendship_pairs = [(0,1), (0,2), (1,2), (1,3), (2,3), (3,4),
(4,5), (5,6), (5,7), (6,8), (7,8), (8,9)]
# create a dict, where keys are users id,
# dictionary comprehension
friendships = {user["id"]: [] for user in users}
for i, j in friendship_pairs:
friendships[i].append(j)
friendships[j].append(i)
```

Similar to the previous example, I find that viewing the data as a `data frame`

is *different* from viewing it as a `dictionary`

:

From this point, I'm doing several operations in pandas to **join** the first two tables, such that I have a column with the user's id, user's name and the id of their first, second or, in some cases, third friends (at most people in this network have 3 direct connections).

If you want to know the specific pandas operation, here's the code:

```
# The users_df is fine as is with two columns: id and name (see above)
# We'll transform the friendships_df
# reset_index allows us to add an index column
friendships_df.reset_index(inplace=True)
# add index column
friendships_df = friendships_df.rename(columns = {"id":"new column name"})
# change index column to 'id'
friendships_df = friendships_df.rename(columns = {'index':'id'})
# join with users_df so we get each person's name
users_friendships = pd.merge(users_df, friendships_df, on='id')
```

Once we've joined `users_df`

and `friendships_df`

, we have:

Since we have `users`

and `friendships`

data, we could write a function to help us answer "how many friends does each user have?". In addition, we'll have to create a `list comprehension`

so we loop through each `user`

within `users`

:

```
# function to count how many friend each user has
def number_of_friends(user):
"""How many friends does _user_ have?"""
user_id = user["id"]
friend_ids = friendships[user_id]
return len(friend_ids)
# list comprehension to apply the function for each user
num_friends_by_id = [(user["id"], number_of_friends(user)) for user in users]
# this gives us a list of tuples
num_friends_by_id
[(0, 2),
(1, 3),
(2, 3),
(3, 3),
(4, 2),
(5, 3),
(6, 2),
(7, 2),
(8, 3),
(9, 1)]
```

Again, viewing the data as a `list of tuples`

is different from a `data frame`

, so let's go ahead and turn that into a pandas data frame:

```
# when converting to data frame, we can set the name of the columns to id and num_friends; this sets us up for another join
num_friends_by_id = pd.DataFrame(num_friends_by_id, columns = ['id', 'num_friends'])
```

Because we have an 'id' column, we can join this with our previously created `users_friendships`

data frame:

Once joined with `users_friendships`

using the `merge`

function, we get (`users_friendships2`

):

By now you're familiar with the process. We have a Python **collection**, generally a `list`

of `dictionaries`

or `tuples`

and we want to convert them to a `data frame`

.

We'll repeat this process for the `interests`

variable which is a long `list of tuples`

(see above). We'll convert to data frame, then join with `users_friendships_2`

to get a longer data frame with `interests`

as one of the columns (*note* : picture is cut for space):

The nice thing about **pandas** is that once you have all your data **joined** together in a data frame, you can **query** the data.

For example, I may want to see all users have an interest in "Big Data":

Previously, we would have had to create a function that returns a `list comprehension`

:

```
def data_scientists_who_like(target_interest):
"""Find the ids of all users who like the target interests."""
return [user_id
for user_id, user_interest in interests
if user_interest == target_interest]
data_scientists_who_like("Big Data")
```

The data frame has other advantages, you could also query columns on multiple conditions, here are two ways to query multiple topics:

```
# Option One: Use .query()
user_friendship_topics.query('topic == "machine learning" | topic == "regression" | topic == "decision trees" | topic == "libsvm"')
# Option Two: Use .isin()
user_friendship_topics[user_friendship_topics['topic'].isin(["machine learning", "regression", "decision trees", "libsvm"])]
```

Both options return this data frame:

By querying the data frame, we learned:

- all users interested in these four topics
- users that have interests in common with Thor
- (if needed) the
`num_friends`

that each user has

You can also find out the most popular topics within this network:

```
# groupby topic, tally(count), then reset_index(), then sort
user_friendship_topics.groupby(['topic']).count().reset_index().sort_values('id', ascending=False)
```

You can even `groupby`

two columns (name & topic) to see topic of interests listed by each user:

```
user_friendship_topics.groupby(['name', 'topic']).count()
```

Hopefully you're convinced that **data frames** are a powerful supplement to the more familiar operations in Python like **for-loops** and/or **list comprehensions**; that both are worth knowing well to manipulate data in a variety of formats. (e.g., to access JSON data, Python dictionaries are ideal over data frames).

Chapter 3 of Data Science from Scratch introduces us to visualizing data using matplotlib. This is widely used in the Python ecosystem, although my sense is that people are *just as happy, if not more*, to use other libraries like seaborn, Altair and bokeh. (**note**: seaborn is built on top of matplotlib).

This chapter is fairly brief and is meant as a quick introduction to matplotlib - to get readers familiar with basic charts. Whole books can be written on **data visualization** alone, so this is meant more as an appetizer, rather than a full-course.

There's a fair amount of detail involved in using matplotlib, so we'll break it down to demystify it.

This chapter goes through the main basic charts including Line, Bar, Histograms, and Scatter Plots. **At first glance**, they follow a similar pattern. Data is provided as a `list`

of numbers (usually more than one list). `pyplot`

is imported from `matplotlib`

as `plt`

. The `plt`

module has several **functions** which are accessed to create the plot.

Here's an example line chart visualizing growth in GDP over time:

Here's the code:

```
from matplotlib import pyplot as plt
# the data
years = [1950, 1960, 1970, 1980, 1990, 2000, 2010]
gdp = [300.2, 543.3, 1075.9, 2862.5, 5979.6, 10289.7, 14958.3]
# the plot
plt.plot(years, gdp, color="green", marker='o', linestyle='solid')
plt.title("Nominal GDP")
plt.ylabel("Billions of $")
plt.xlabel("Years")
plt.show()
```

You can somewhat get by with just knowing this. Briefly consulting the documentation will let you see some other *chart types* like so:

Let's say we wanted to convert our **line chart** into a **stacked area chart**, we can just change one line:

```
from matplotlib import pyplot as plt
years = [1950, 1960, 1970, 1980, 1990, 2000, 2010]
gdp = [300.2, 543.3, 1075.9, 2862.5, 5979.6, 10289.7, 14958.3]
plt.stackplot(years, gdp, color="green") # this is the only line we changed
plt.title("Nominal GDP")
plt.ylabel("Billions of $")
plt.xlabel("Years")
plt.show()
```

Here's what the **stacked area chart** version of the previous graph looks like:

To keep things simple, we can change the chart type with just one line and we just need to remember that when converting from chart to chart, we have to be mindful of the parameters that each chart type takes. For example, a **stacked area chart** takes in different parameters than **line charts** (for example, you'll get an `AttributionError`

if you try to use `marker`

in a stacked area chart.)

Here's an example bar chart comparing movies by the number of Academy awards they've won:

Here's a **stem plot** version:

As with the previous example, changing just one **function** from `plt.bar`

to `plt.stem`

gave us a different plot:

```
#---- Original Bar Chart ----#
movies = ["Annie Hall", "Ben-Hur", "Casablanca", "Gandhi", "West Side Story"]
num_oscars = [5,11,3,8,10]
plt.bar(range(len(movies)), num_oscars)
plt.title("My Favorite Movies")
plt.ylabel("# of Academy Awards")
plt.xticks(range(len(movies)), movies)
plt.show()
# ---- Stem Chart ---- #
movies = ["Annie Hall", "Ben-Hur", "Casablanca", "Gandhi", "West Side Story"]
num_oscars = [5,11,3,8,10]
plt.stem(range(len(movies)), num_oscars) # the only change
plt.title("My Favorite Movies")
plt.ylabel("# of Academy Awards")
plt.xticks(range(len(movies)), movies)
plt.show()
```

I'm all for keeping **matplotlib** as simple as possible but one thing the above examples gloss over is the **matplotlib object hierarchy**, which is something worth understanding to get a feel for how the various functions operate.

This next figure is borrowed from Real Python and it nicely highlights the hierarchy inherent in *every* plot:

You'll note the levels: Figure, Axes and Axis. When digging into matplotlib documentation on axes, these levels are brought to the foreground.

To really see this in action, we'll need to code our plot *slightly* differently. For the last chart this chapter examines the **bias-variance tradeoff** which is something we'll learn more about in future chapters, but it highlights the trade-off in trying to simultanenously minimize two sources of error so our algorithm generalizes to new situations.

Here's the code:

```
# BOOK version
variance = [1,2,4,8,16,32,64,128,256]
bias_squared = [256, 128, 64, 32, 16, 8, 4, 2, 1]
total_error = [x + y for x,y in zip(variance, bias_squared)]
xs = [i for i, _ in enumerate(variance)]
plt.plot(xs, variance, 'g-', label='variance')
plt.plot(xs, bias_squared, 'r-', label='bias^2')
plt.plot(xs, total_error, 'b:', label='total error')
plt.legend(loc=9)
plt.xlabel("model complexity")
plt.xticks([])
plt.title("The Bias-Variance Tradeoff")
plt.show()
# ALTERNATE version
variance = [1,2,4,8,16,32,64,128,256]
bias_squared = [256, 128, 64, 32, 16, 8, 4, 2, 1]
total_error = [x + y for x,y in zip(variance, bias_squared)]
xs = [i for i, _ in enumerate(variance)]
fig, ax = plt.subplots(figsize=(8,5))
ax.plot(xs, variance, 'g-', label='variance')
ax.plot(xs, bias_squared, 'r-', label='bias^2')
ax.plot(xs, total_error, 'b:', label='total error')
ax.legend(loc='upper center')
ax.set_xlabel("model complexity")
ax.set_title("The Bias-Variance Tradeoff: Alt Version")
fig.tight_layout()
fig.show()
```

Instead using the `plt`

module, we use `fig`

and `ax`

, here are their data types:

```
type(fig) # matplotlib.figure.Figure
type(ax) # matplotlib.axes._subplots.AxesSubplot
```

This makes *explicit* the **matplotlib object hierarchy**, particularly as we see how we access function at the `axes._subplits.AxesSubplot`

level (the documentation has much more detail).

Here's the chart:

In summary, we learned that matplotlib *can* be fairly simple to use for static, simple plots, but we're better served having *some* understanding of **matplotlib's object hierarchy**. We'll examine more chart types as we proceed with the rest of the chapters.

For more content on data science, machine learning, R, Python, SQL and more, find me on Twitter.

]]>What stands out is the use of collections and comprehension. We'll see this trend continue as data is given to us in the form of a `list`

of `dict`

or `tuples`

.

Often time, we're manipulating the data to make it faster and more efficient to iterate through the data. The tool that comes up quite often is using **defaultdict** to initialize an empty `list`

. Followed by `list comprehensions`

to iterate through data.

Indeed, either we're seeing how the author, specifically, approaches problem or how problems are approached in Python, in general.

What I'm keeping in mind is that there are more than one way to approach data science problems and this is *one* of them.

With that said, let's pick up where the previous post left off.

We have a sense of the *total number of connections* and a sorting of the *most connected* individuals. Now, we may want to design a "people you may know" suggester.

Quick recap, here's what the friendship dictionary looks like.

```
friendships
{
0: [1, 2],
1: [0, 2, 3],
2: [0, 1, 3],
3: [1, 2, 4],
4: [3, 5],
5: [4, 6, 7],
6: [5, 8],
7: [5, 8],
8: [6, 7, 9],
9: [8]
}
```

Again, the first step is to *iterate* over friends and collect friends' friend. The following function returns a **list comprehension**. Let's examine this function line-by-line to understand how it works. It returns friend_of_a_friend (foaf) id for each of the individuals' id, then grabbing the id of *their* friends.

We'll break it down in code below this function:

```
def foaf_ids_bad(user):
"""foaf is short for 'friend of a friend' """
return [foaf_id
for friend_id in friendships[user["id"]]
for foaf_id in friendships[friend_id]]
# Let's take Hero, to see Hero's friends
# we'll call the first key of the friendships dict
# Hero has two friends with ids 1 and 2
friendships[0] # [1,2]
# then we'll loop over *each* of the friends
friendships[1] # [0, 2, 3]
friendships[2] # [0, 1, 3]
# assert that function works
assert foaf_ids_bad(users[0]) == [0, 2, 3, 0, 1, 3]
```

To answer this we'll use a `Counter`

, which we learned is a `dict`

subclass. Moreover, the function `friends_of_friends(user)`

,

```
from collections import Counter
def friends_of_friends(user):
user_id = user["id"]
return Counter(
foaf_id
for friend_id in friendships[user_id] # for each of my friends,
for foaf_id in friendships[friend_id] # find their friends
if foaf_id != user_id # who aren't me
and foaf_id not in friendships[user_id] # and aren't my friends
)
# lets look at Hero
# he has two common friends with Chi
# Chi is neither Hero nor his direct friends
friends_of_friends(users[0]) # Counter({3: 2})
```

In addition to friendship data, we also have **interest** data. Here we see a `list`

of `tuples`

, containing a user_id and a string representing a specific of technology.

```
interests = [
(0, "Hadoop"), (0, "Big Data"), (0, "HBase"), (0, "Java"),
(0, "Spark"), (0, "Storm"), (0, "Cassandra"),
(1, "NoSQL"), (1, "MongoDB"), (1, "Cassandra"), (1, "HBase"),
(1, "Postgres"), (2, "Python"), (2, "scikit-learn"), (2, "scipy"),
(2, "numpy"), (2, "statsmodels"), (2, "pandas"), (3, "R"), (3, "Python"),
(3, "statistics"), (3, "regression"), (3, "probability"),
(4, "machine learning"), (4, "regression"), (4, "decision trees"),
(4, "libsvm"), (5, "Python"), (5, "R"), (5, "Java"), (5, "C++"),
(5, "Haskell"), (5, "programming langauges"), (6, "statistics"),
(6, "probability"), (6, "mathematics"), (6, "theory"),
(7, "machine learning"), (7, "scikit-learn"), (7, "Mahout"),
(7, "neural networks"), (8, "neural networks"), (8, "deep learning"),
(8, "Big Data"), (8, "artificial intelligence"), (9, "Hadoop"),
(9, "Java"), (9, "MapReduce"), (9, "Big Data")
]
```

First thing we'll do is find users with a specific interest. This is function returns a **list comprehension**. It first split each `tuple`

into `user_id`

(integer) and `user_interest`

(string), then conditionally check if the `string`

in the `tuple`

matches the input parameter.

```
def data_scientists_who_like(target_interest):
"""Find the ids of all users who like the target interests."""
return [user_id
for user_id, user_interest in interests
if user_interest == target_interest]
# let's see all user_id who likes "statistics"
data_scientists_who_like("statistics") # [3, 6]
```

We may also want to count the number of times a specific interest comes up. Here's a function for that. We use a basic for-loop and if-statement to check truthiness of `user_interest == target_interest`

.

```
def num_user_with_interest_in(target_interest):
interest_count = 0
for user_id, user_interest in interests:
if user_interest == target_interest:
interest_count += 1
return interest_count
```

A concern is having to examine a whole list of interests for every search. The author proposes building an index from interests to users. Here, a defaultdict is imported, then populated with user_id

```
from collections import defaultdict
# user_ids matched to specific interest
user_ids_by_interest = defaultdict(list)
for user_id, interest in interests:
user_ids_by_interest[interest].append(user_id)
# three users interested in Python
assert user_ids_by_interest["Python"] == [2,3,5]
# list of interests by user_id
interests_by_user_id = defaultdict(list)
for user_id, interest in interests:
interests_by_user_id[user_id].append(interest)
# check all of Hero's interests
assert interests_by_user_id[0] == ['Hadoop', 'Big Data', 'HBase', 'Java', 'Spark', 'Storm', 'Cassandra']
```

We can find who has the most interests in common with a given user. Looks like Klein (#9) has the most common interests with Hero (#0). Here we return a Counter with for-loops and an if-statement.

```
def most_common_interests_with(user):
return Counter(
interested_user_id
for interest in interests_by_user_id[user["id"]]
for interested_user_id in user_ids_by_interest[interest]
if interested_user_id != user["id"]
)
# let's check to see who has the most common interest with Hero
most_common_interests_with(users[0]) # Counter({9: 3, 8: 1, 1: 2, 5: 1})
```

Finally, we can also find which topics are most popular among the network. Previously, we calculated the number of users interested in a particular topic, but now we want to compare the whole list.

```
words_and_counts = Counter(word
for user, interest in interests
for word in interest.lower().split())
```

We're also given anonymous salary and tenure (number of years work experience) data, let's see what we can do with that information. First we'll find the average salary. Again, we'll start by creating a list (defaultdict), then loop through `salaries_and_tenures`

.

```
salaries_and_tenures = [(83000, 8.7), (88000, 8.1),
(48000, 0.7), (76000, 6),
(69000, 6.5), (76000, 7.5),
(60000, 2.5), (83000, 10),
(48000, 1.9), (63000, 4.2)]
salary_by_tenure = defaultdict(list)
for salary, tenure in salaries_and_tenures:
salary_by_tenure[tenure].append(salary)
# find average salary by tenure
average_salary_by_tenure = {
tenure: sum(salaries) / len(salaries)
for tenure, salaries in salary_by_tenure.items()
}
```

The problem is that this is not terribly informative as each tenure value has a different salary. Not even the `average_salary_by_tenure`

is informative, so our next move is to group similar tenure values together.

First, we'll create the groupings/categories using a control-flow, then we'll create a `list`

(`defaultdict`

), and loop through `salaries_and_tenures`

to populate the newly created `salary_by_tenure_bucket`

. Finally calculate the average.

```
def tenure_bucket(tenure):
if tenure < 2:
return "less than two"
elif tenure < 5:
return "between two and five"
else:
return "more than five"
salary_by_tenure_bucket = defaultdict(list)
for salary, tenure in salaries_and_tenures:
bucket = tenure_bucket(tenure)
salary_by_tenure_bucket[bucket].append(salary)
# finally calculate average
average_salary_by_bucket = {
tenure_bucket: sum(salaries) / len(salaries)
for tenure_bucket, salaries in salary_by_tenure_bucket.items()
}
```

One thing to **note** is that the "given" data, in this hypothetical toy example is either in a `list`

of `dictionaries`

or `tuples`

, which may be atypical if we're used to working with tabular data in `dataFrame`

(pandas) or native `data.frame`

in R.

Again, we are reminded that the higher purpose of this book - Data Science from Scratch (by Joel Grus; 2nd Ed) is to eschew libraries in favor of plain python to build everything from the ground up.

Should your goal be to learn how various algorithms work by building them up from scratch, and *in the process* learn how data problems can be solved with python and minimal libraries, this is your book.

Joel Grus does make clear that you would use libraries and frameworks (pandas, scikit-learn, matplotlib etc), rather than coded-from-scratch algorithms when working in production environments and will point out resource for further reading at the end of the chapters.

In the next post, we'll get into visualizing data.

]]>The book opens with a narrative motivating example where you, dear reader, are newly hired to lead data science at *DataSciencester*, a social network exclusively for data scientists.

Joel Grus, the author, explains:

Throughout the book, we'll be learning about data science concepts by solving problems that you encounter at work. Sometimes we'll look at data explicitly supplied by users, sometimes we'll look at data generated through their interactions with the site, and sometimes we'll even look at data from experiments that we'll design...we'll be building our tools from scratch.

You may be wondering why chapter2 precedes chapter1. This chapter is meant as a teaser for the rest of the book and its code not meant for implementation, but I wanted to revisit this first chapter with the python crash course fresh on our minds to highlight some *frequently* used concepts we can expect to see for the rest of the book.

You are just hired as "VP of Networking" and are tasked with finding out which data scientist is the most well connected in the DataSciencster network, you're giving a data dump 👇. It's a list of users, each with a unique id.

```
users = [
{"id": 0, "name": "Hero"},
{"id": 1, "name": "Dunn"},
{"id": 2, "name": "Sue"},
{"id": 3, "name": "Chi"},
{"id": 4, "name": "Thor"},
{"id": 5, "name": "Clive"},
{"id": 6, "name": "Hicks"},
{"id": 7, "name": "Devin"},
{"id": 8, "name": "Kate"},
{"id": 9, "name": "Klein"}
]
```

Of **note** here is that the `users`

variable is a `list`

of `dict`

(dictionaries).

Moving along, we also receive "friendship" data. Of **note** here that this is a `list`

of `tuples`

:

```
friendship_pairs = [(0,1), (0,2), (1,2), (1,3), (2,3), (3,4),
(4,5), (5,6), (5,7), (6,8), (7,8), (8,9)]
```

I had initially (and erroneously) thought of `list`

, `dict`

and `tuple`

as **data types** (like `int64`

, `float64`

, `string`

).

They're rather **collections**, and somewhat unique to Python and more importantly, *informs the way Pythonistas approach and solve problems*.

You may feel that having "friendship" data in a `list`

of `tuple`

is not the easiest way to work with data (nor may it be the best way to represent data, but we'll suspend those thoughts for now). Our first task is to convert this `list`

of `tuple`

into a form that's more workable; the author proposes we turn it into a `dict`

where the `keys`

are user_ids and the `values`

are `list`

of friends.

The argument is that its faster to look things up in a `dict`

rather than a `list`

of `tuple`

(where we'd have to iterate over every `tuple`

). Here's how we'd do that:

```
# Initialize the dict with an empty list for each user id
friendships = { user["id"]: [] for user in users }
# Loop over friendship pairs
# This operation grabs the first, then second integer in each tuple
# It then appends each integer to the newly initialized friendships dict
for i, j in friendship_pairs:
friendships[i].append(j)
friendships[j].append(i)
```

We're *initializing* a `dict`

(called `friendships`

), then looping over `friendship_pairs`

to populate `friendships`

. This is the outcome:

```
friendships
{
0: [1, 2],
1: [0, 2, 3],
2: [0, 1, 3],
3: [1, 2, 4],
4: [3, 5],
5: [4, 6, 7],
6: [5, 8],
7: [5, 8],
8: [6, 7, 9],
9: [8]
}
```

Each `key`

in friendships is matched with a `value`

that is initially an empty list, which then gets populated as we loop over `friendship_pairs`

and systematically append the user_id that is paired together.

To understand how the looping happends and, specifically how each **pair** of user_ids are connected to each other, I created my own mini-toy example. Let's say we're just going to focus on looping through `friendship_pairs`

for the user **Hero** whose id is 0.

```
# we'll set hero to an empty list
hero = []
# for every friendship_pair, if the first integer is 0, which is Hero's id,
# then append the second integer
for x, y in friendship_pairs:
if x == 0:
hero.append(y)
# outcome: we can confirm that Hero is connected to Dunn and Sue
hero # [1,2]
```

The above gave me better intuition for how this works:

```
for i, j in friendship_pairs:
friendships[i].append(j) # Add j as a friend of user i
friendships[j].append(i) # Add i as a friend of user j
```

Here are some other questions we may be interested in:

Look at how the problem is solved. What's notable to me is how we first define a function `number_of_friends(user)`

that returns the number of friends for a particular user.

Then, `total_connections`

is calculated using a **comprehension** (tuple):

```
def number_of_friends(user):
"""How many friends does _user_ have?"""
user_id = user["id"]
friend_ids = friendships[user_id]
return len(friend_ids)
total_connections = sum(number_of_friends(user) for user in users)
```

To be clear, the **(tuple) comprehension** is a pattern where a function is applied over a for-loop, in one line:

```
# (2, 3, 3, 3, 2, 3, 2, 2, 3, 1)
tuple((number_of_friends(user) for user in users))
# you can double check by calling friendships dict and counting the number of friends each user has
friendships
{
0: [1, 2],
1: [0, 2, 3],
2: [0, 1, 3],
3: [1, 2, 4],
4: [3, 5],
5: [4, 6, 7],
6: [5, 8],
7: [5, 8],
8: [6, 7, 9],
9: [8]
}
```

This pattern of using a one-line for-loop (aka comprehension) will come up often. If we add up all the connections, we get 24 and to find the average, we simply divide by the number of users (10) for 2.4, this part is straight-forward.

To answer this question, again, a **list comprehension** is used. The cool thing is that we re-use functions we had previously created (`number_of_friends(user)`

).

```
# Create a list that loops over users dict, applying a previously defined function
num_friends_by_id = [(user["id"], number_of_friends(user)) for user in users]
# Then sort
num_friends_by_id.sort( # Sort the list
key=lambda id_and_friends: id_and_friends[1], # by number friends
reverse=True) # descending order
```

We have just identified how *central* an individual is to the network, and we can expect to explore **degree centrality** and **networks** more in future chapters, but for the purposes of *this* post, we have identified the central role that **collections** (lists, dictionaries, tuples) as well as **comprehensions** play in Python operations.

In the next post, we'll examing how friendship connections may or may not overlap with interests.

If you'd like a Python crash course, with an eye towards data science, you might check out these posts:

]]>The `random`

module is used extensively in data science. Particularly when random numbers need to be generated and we want **reproducible** results the next time we run our model (in Python its `random.seed(x)`

, in R its `set.seed(x)`

), where x is any integer we decide (we just need to be consistent when we revisit our model).

Technically, the module produces **deterministic** results, hence it's pseudorandom, here's an example to highlight how the randomness is deterministic:

```
import random
random.seed(10) # say we use 10
# this variable is from the book
four_randoms = [random.random() for _ in range(4)]
# call four_randoms - same result from Data Science from Scratch
# because the book also uses random.seed(10)
[0.5714025946899135,
0.4288890546751146,
0.5780913011344704,
0.20609823213950174]
# if we use x instead of underscore
# a different set of four "random" numbers is generated
another_four_randoms = [random.random() for x in range(4)]
[0.81332125135732,
0.8235888725334455,
0.6534725339011758,
0.16022955651881965]
```

Reading around from other sources suggests that the underscore "_" is used in a for loop when we don't care about the variable (its a throwaway) and have no plans to use it, for example:

```
# prints 'hello' five times
for _ in range(5):
print("hello")
# we could use x as well
for x in range(5):
print("hello")
```

In the above example, either `_`

or `x`

could have been used and there doesn't seem to be much difference. We could technically *call* `_`

, but its considered bad practice:

```
# bad practice, but prints 0, 1, 2, 3, 4
for _ in range(5):
print(_)
```

Nevertheless, `_`

matters in the context of pseudorandomness because it yields a *different* result:

```
import random
random.seed(10)
# these two yield different results, even with the same random.seed(10)
four_randoms = [random.random() for _ in range(4)]
another_four_randoms = [random.random() for x in range(4)]
```

But back to determinism, or pseudorandomness, we need to *change* the `random.seed(11)`

, then back to `random.seed(10)`

to see this play out:

```
# new random.seed()
random.seed(11)
# reset four_randoms
four_randoms = [random.random() for _ in range(4)]
[0.4523795535098186,
0.559772386080496,
0.9242105840237294,
0.4656500700997733]
# change to previous random.seed()
random.seed(10)
# reset four_randoms (again)
four_randoms = [random.random() for _ in range(4)]
# get previous result (see above)
[0.5714025946899135,
0.4288890546751146,
0.5780913011344704,
0.20609823213950174]
```

Other features of the `random`

module include: `random.randrange`

, `random.shuffle`

, `random.choice`

and `random.sample`

:

```
random.randrange(3,6) # choose randomly between [3,4,5]
# random shuffle
one_to_ten = [1,2,3,4,5,6,7,8,9,10]
random.shuffle(one_to_ten)
print(one_to_ten) # example: [8, 7, 9, 3, 5, 2, 10, 1, 6, 4]
random.shuffle(one_to_ten) # again
print(one_to_ten) # example: [3, 10, 8, 6, 9, 2, 7, 1, 4, 5]
# random choice
list_of_people = (["Bush", "Clinton", "Obama", "Biden", "Trump"])
random.choice(list_of_people) # first time, 'Clinton'
random.choice(list_of_people) # second time, 'Biden'
# random sample
lottery_numbers = range(60) # get a range of 60 numbers
winning_numbers = random.sample(lottery_numbers, 6) # get a random sample of 6 numbers
winning_numbers # example: [39, 24, 2, 37, 0, 15]
# because its pseudorandom, if you want a different set of 6 numbers
# reset the winning_numbers
winning_numbers = random.sample(lottery_numbers, 6)
winning_numbers # a different set of numbers [8, 12, 19, 34, 23, 49]
```

Whole books can be written about `regular expressions`

so the author briefly highlights a couple features that may come in handy, `re.match`

, `re.search`

, `re.split`

and `re.sub`

:

```
import re
re_examples = [
not re.match("a", "cat"), # re.match check the word cat 'starts' letter 'a'
re.search("a", "cat"), # re.search check if word cat 'contains' letter 'a'
not re.search("c", "dog"), # 'dog' does not contain 'c'
3 == len(re.split("[ab]", "carbs")), # 3 equals length of "carbs" once you split out [ab]
"R-D-" == re.sub("[0-9]", "-", "R2D2") # sub out numbers in 'R2D2' with hyphen "-"
]
# test that all examples are true
assert all(re_examples), "all the regex examples should be True"
```

The final line reviews our understanding of testing (`assert`

) and truthiness (`all`

) applied to our regular expression examples, pretty neat.

You may be interested in these other topics in your python crash course:

]]>A key concept that is introduced when discussing the creation of "generators" is using `for`

and `in`

to **iterate** over generators (like lists), but **lazily on demand**. This is formally called lazy evaluation or 'call-by-need' which delays the evaluation of an expression until the value is needed. We can think of this as a form of optimization - avoiding repeating function calls when not needed.

Here's a graphic borrowed from Xiaoxu Gao, check out her post here:

We'll create some `generators`

(customized function/class), but bear in mind that it will be redundant with `range()`

, both of which illustrate lazy evaluation.

```
# Example 1: create natural_numbers() function that incrementally counts numbers
def natural_numbers():
"""returns 1, 2, 3, ..."""
n = 1
while True:
yield n
n += 1
# check it's type
type(natural_numbers()) # generator
# call it, you get: <generator object natural_numbers at 0x7fb4d787b2e0>
natural_numbers()
# the point of lazy evaluation is that it won't do anything
# until you iterate over it (but avoid infinite loop with logic breaks)
for i in natural_numbers():
print(i)
if i == 37:
break
print("exit loop")
# result 1...37 exit loop
```

Here's another example using `range`

, a built-in python function that also uses **lazy evaluation**. Even when you call this `generator`

, it **won't do anything until you iterate over it**.

```
evens_below_30 = (i for i in range(30) if i % 2 == 0)
# check its type - generator
type(evens_below_30)
# call it, you get: <generator object <genexpr> at 0x7fb4d70ef580>
# calling it does nothing
evens_below_30
# now iterate over it with for and in - now it does something
# prints: 0, 2, 4, 6 ... 28
for i in evens_below_30:
print(i)
```

Finally, this section brings up another important key word **enumerate** for when we want to iterate over a `generator`

or `list`

and get both `values`

and `indices`

:

```
# create list of names
names = ['Alice', 'Lebron', 'Kobe', 'Bob', 'Charles', 'Shaq', 'Kenny']
# Pythonic way
for i, name in enumerate(names):
print(f"index: {i}, name: {name}")
# NOT pythonic
for i in range(len(names)):
print(f"index: {i}, name: {names[i]}")
# Also NOT pythonic
i = 0
for name in names:
print(f"index {i} is {names[i]}")
i += 1
```

In my view, the *pythonic way* is much more readable here.

]]>

One of the many cool things about Data Science from Scratch (by Joel Grus) is his use of assertions as a way to "test" code. This is a software engineering practice (see test-driven development) that may not be as pervasive in data science, but I suspect, will see growth in usage and will soon become best practice, if we're not already there.

While there are testing frameworks that deserve their own chapters, throughout *this* book, fortunately the author has provided a simple way to test by way of the `assert`

key word, here's an example:

```
# create function to return the largest value in a list
def largest_item(x):
return max(x)
# assert that our function is working properly
# we will see 'nothing' if things are working properly
assert largest_item([10, 20, 5, 40, 99]) == 99
# an AssertionError will pop up if any other value is used
assert largest_item([10, 20, 5, 40, 99]) == 40
---------------------------------------------------------------------------
AssertionError Traceback (most recent call last)
<ipython-input-21-12dc291d091e> in <module>
----> 1 assert largest_item([10, 20, 5, 40, 99]) == 40
# we can also create an assertion for input values
def largest_item(x):
assert x, "empty list has no largest value"
return max(x)
```

Object-oriented programming could be it's own chapter, so we won't try to shoot for comprehensiveness here. Instead, we'll try to understand it's basics and the `assert`

function is going to help us understand it even better.

First, we'll create a **class** `CountingClicker`

that initializes at count 0, has several methods including a `click`

method to increment the count, a `read`

method to read the present number of count and a `reset`

method to reset the count back to 0.

Then we'll write some `assert`

statements to test that our class method is working as intended.

You'll **note** that there are private methods and public methods. Private methods have the **double underscore** (aka dunder methods), they're generally not called, but python won't stop you. Then we have the more familiar *public* methods. Also, all the methods have to be written **within** the scope of the class `CountingClicker`

.

```
class CountingClicker:
"""A class can/should have a docstring, just like a function"""
def __init__(self, count = 0):
self.count = count
def __repr__(self):
return f"CountingClicker(count = {self.count})"
def click(self, num_times = 1):
"""Click the clicker some number of times."""
self.count += num_times
def read(self):
return self.count
def reset(self):
self.count = 0
```

After we've written the class and associated methods, we can write `assert`

statements to test them. You'll want to write the below statements **in this order** because we're testing the *behavior* of our `CountingClicker`

class.

```
clicker = CountingClicker()
assert clicker.read() == 0, "clicker should start with count 0"
clicker.click()
clicker.click()
assert clicker.read() == 2, "after two clicks, clicker should have count of 2"
clicker.reset()
assert clicker.read() == 0, "after reset, clicker should be back to 0"
```

In summary, we created a class `CountingClicker`

whose methods allow it to display in text (`__repr__`

), `click`

, `read`

and `reset`

.

All these methods belong to the `class`

CountingClicker and will be passed along to new instances of classes - we have yet to see what this will look like as it relates to tasks in data science so we'll revisit this post when we have updates on the applied end.

Then, we tested our class `CountingClicker`

with various `assert`

statements to see if it behaves as intended.

This is one section of a quick python crash course as conveyed from Data Science from Scratch (by Joel Grus), you may be interested in these other posts on:

]]>`map`

and `filter`

functions.
Previously, we saw **if-statements** expressed in one-line, for example:

```
y = []
# Falsy
print("Truthy") if y else print("Falsy")
```

We can also write **for-loops** in one-line. And thats a way to think about `list comprehensions`

.

```
# traditional for-loop; [0, 2, 4]
num = []
for x in range(5):
if x % 2 == 0:
num.append(x)
num # call num
# list comprehension, provides the same thing
# [0, 2, 4]
[x for x in range(5) if x % 2 == 0]
```

Here are some examples from Data Science from Scratch:

```
# [0, 2, 4]
even_numbers = [x for x in range(5) if x % 2 == 0]
# [0, 1, 4, 9, 16]
squares = [x * x for x in range(5)]
# [0, 4, 16]
even_squares = [x * x for x in even_numbers]
```

Dan Bader provides a helpful way to conceptualizing `list comprehensions`

:

```
(values) = [ (expression) for (item) in (collections) ]
```

A good way to understand `list comprehensions`

is to de-construct it back to a regular for-loop:

```
# recreation of even_numbers; [0, 2, 4]
even_bracket = []
for x in range(5):
if x % 2 == 0:
even_bracket.append(x)
# recreation of squares; [0, 1, 4, 9, 16]
square_bracket = []
for x in range(5):
square_bracket.append(x * x)
# recreate even_squares; [0, 4, 16]
square_even_bracket = []
for x in even_bracket:
square_even_bracket.append(x * x)
```

Moreover, list comprehensions also allow for **filtering with conditions**. Again, we can understand this with a brief comparison with the for-loop.

```
# traditional for-loop
filtered_bracket = []
for x in range(10):
if x > 5:
filtered_bracket.append(x * x)
# list comprehension
filtered_comprehension = [x * x
for x in range(10)
if x > 5]
```

The key take-away here is that `list comprehensions`

follow a pattern. Knowing this allows us to better understand how they work.

```
values = [expression
for item in collection
if condition]
```

Python also supports dictionaries or sets comprehension, although we'll have to revisit this post as to **why** we would want to do this in a data wrangling, transformation or analysis context.

```
# {0: 0, 1: 1, 2: 4, 3: 9, 4: 16}
square_dict = {x: x * x for x in range(5)}
# {1}
square_set = {x * x for x in [1,-1]}
```

Finally, comprehensions can include nested for-loops:

```
pairs = [(x,y)
for x in range(10)
for y in range(10)]
```

We will expect to use `list comprehensions`

often, so we'll revisit this section as we see more applications in context.

In the first edition of this book the author introduced these functions, but has since reached enlightenment 🧘, he states:

"On my journey toward enlightenment I have realized that these functions (i.e., map, filter, reduce, partial) are best avoided, and their uses in the book have been replaced with list comprehensions, for loops and other, more Pythonic constructs." (p.36)

He's being facetious, but I was intrigued anyways. So here's an example replacing **map** with **list comprehensions**.

```
# create list of names
names = ['Russel', 'Kareem', 'Jordan', 'James']
# use map function to loop over names and apply an anonymous function
greeted = map(lambda x: 'Hi ' + x, names)
# map returns an iterator (see also lazy evaluation)
print(greeted) # <map object at 0x7fc667c81f40>
# because lazy evaluation, won't do anything unless iterate over it
for name in greeted:
print(name)
#Hi Russel
#Hi Kareem
#Hi Jordan
#Hi James
## List Comprehension way to do this operation
greeted2 = ['Hi ' + name for name in names]
# non-lazy evaluation (or eager)
print(greeted2) # ['Hi Russel', 'Hi Kareem', 'Hi Jordan', 'Hi James']
```

Here's another example replacing **filter** with **list comprehensions**:

```
# create list of integers
numbers = [13, 4, 18, 35]
# filter creates an interator
div_by_5 = filter(lambda num: num % 5 == 0, numbers)
print(div_by_5) # <filter object at 0x7fc667c9ad30>
print(list(div_by_5)) # must convert iterator into a list - [35]
# using list comprehension to achieve the same thing
another_div_by_5 = [num for num in numbers if num % 5 == 0]
# lists do not use lazy evaluation, so it will print out immediately
print(another_div_by_5) # [35]
```

In both cases, it seems `list comprehensions`

is more pythonic and efficient.

For more content on data science, machine learning, R, Python, SQL and more, find me on Twitter.

]]>I believe the main take away from this section is to briefly highlight the various control flows possible.

Here's a traditional if-else statement:

```
x = 5
if x % 2 == 0:
parity = "even"
else:
parity = "odd"
parity # 'odd'
```

The author may, from time to time, opt to use a shorter *ternary* if-else one-liner, like so:

```
parity = "even" if x % 2 == 0 else "odd"
```

The author points out that while **while-loops** exist:

```
x = 0
while x < 10:
print(f"{x} is less than 10")
x += 1
```

**for** and **in** will be used more often (the code below is both shorter and more readable):

```
for x in range(10):
print(f"{x} is less than 10")
```

We'll also **note** that `range(x)`

also goes up to `x-1`

.

Finally, more complex logic *is* possible, although we'll have to revisit exactly when more complex logic is used in a data science context.

```
for x in range(10):
if x == 3:
continue
if x == 5:
break
print(x)
```

Booleans in Python, `True`

and `False`

, are only have the first letter capitalized. And Python uses `None`

to indicate a nonexistent value. We'll try to handle the exception below:

```
1 < 2 # True (not TRUE)
1 > 2 # False (not FALSE)
x = 1
try:
assert x is None
except AssertionError:
print("There was an AssertionError because x is not 'None'")
```

A major takeaway for me is the concept of "truthy" and "falsy". The first thing to note is that anything *after* `if`

implies "is true" which is why if-statements can be used to **check** is a list, string or dictionary is empty:

```
x = [1]
y = []
# if x...is true
# Truthy
if x:
print("Truthy")
else:
print("Falsy")
# if y...is true
# Falsy
print("Truthy") if y else print("Falsy")
```

You'll note the *ternary* version here is slightly less readable. Here are more examples to understand "truthiness".

```
## Truthy example
# create a function that returns a string
def some_func():
return "a string"
# set s to some_func
s = some_func()
# use if-statement to check truthiness - returns 'a'
if s:
first_char = s[0]
else:
first_char = ""
## Falsy example
# another function return empty string
def another_func():
return ""
# set another_func to y (falsy example)
y = another_func()
# when 'truthy' return second value,
# when 'falsy' return first value
first_character = y and y[0]
```

Finally, the author brings up **all** and **any** functions. The former returns `True`

when *every* element is truthy; the latter returns `True`

when *at least one* element is truthy:

```
all([True, 1, {3}]) # True
all([True, 1, {}]) # False
any([True, 1, {}]) # True
all([]) # True
any([]) # False
```

You'll note that the truthiness **within** the list is being evaluated. So `all([])`

suggests there are no 'falsy' elements within the list, because it's empty, so it evaluates to `True`

.

On the other hand, `any([])`

suggests not even one (or at least one) element is 'truthy', because the list is empty, so it evaluates to `False`

.

`Counter`

is a `dict`

**subclass** for counting hashable objects (see doc).
Back to our example in the previous section, we can use `Counter`

instead of `dict`

, specifically for counting:

```
from collections import Counter
# we can count the letters in this paragraph
count_letters = Counter("This table highlights 538's new NBA statistic, RAPTOR, in addition to the more established Wins Above Replacement (WAR). An extra column, Playoff (P/O) War, is provided to highlight stars performers in the post-season, when the stakes are higher. The table is limited to the top-100 players who have played at least 1,000 minutes minutes the table Wins NBA NBA RAPTOR more players")
# call count_letters
count_letters
# returns
Counter({'T': 4,
'h': 19,
'i': 22,
's': 24,
' ': 61,
't': 29,
'a': 20,
'b': 5,
'l': 14,
'e': 35,
'g': 5,
'5': 1,
'3': 1,
'8': 1,
"'": 1,
'n': 13,
'w': 3,
'N': 3,
'B': 3,
'A': 8,
'c': 3,
',': 6,
'R': 6,
'P': 4,
'O': 3,
'd': 7,
'o': 15,
'm': 8,
'r': 13,
'W': 4,
'v': 3,
'p': 8,
'(': 2,
')': 2,
'.': 2,
'x': 1,
'u': 3,
'y': 4,
'f': 3,
'/': 1,
'-': 2,
'k': 1,
'1': 2,
'0': 5})
```

`Counter`

very easily did what `defaultdict(int)`

did previously. We can even call the `most_common`

method to get the most common letters:

```
# get the thirteen most common letters
for letter, count in count_letters.most_common(13):
print(letter, count)
# returns - 13 items
61
e 35
t 29
s 24
i 22
a 20
h 19
o 15
l 14
n 13
r 13
A 8
m 8
```

We had a glimpse of `set`

previously. There are two things the author emphasize with `set`

. First, they're faster than lists for checking membership:

```
lines_list = ["This table highlights 538's new NBA statistic, RAPTOR, in addition to the more established Wins Above Replacement (WAR). An extra column, Playoff (P/O) War, is provided to highlight stars performers in the post-season, when the stakes are higher. The table is limited to the top-100 players who have played at least 1,000 minutes minutes the table Wins NBA NBA RAPTOR more players"]
"zip" in lines_list # False, but have to check every element
lines_set = set(lines_list)
type(lines_set) # set
"zip" in lines_set # Very fast to check
```

Because this was an arbitrary example, it's not obvious that checking membership in `set`

is faster than `list`

so we'll take the author's word for it.

The second highlight for `set`

is to find **distinct** items in a collection:

```
number_list = [1,2,3,1,2,3] # list with six items
item_set = set(number_list) # turn it into a set
item_set # now has three items {1, 2, 3}
turn_into_list = list(item_set) # turn into distinct item list
```

Here's a more applied example of using `set`

to handle duplicate entries. We'll import `defaultdict`

and pass `set`

as a `default_factory`

. This example is inspired by Real Python:

```
from collections import defaultdict
# departments with duplicate entries
dep = [('Sales', 'John Doe'),
('Sales', 'Martin Smith'),
('Accounting', 'Jane Doe'),
('HR', 'Elizabeth Smith'),
('HR', 'Elizabeth Smith'),
('HR', 'Adam Doe'),
('HR', 'Adam Doe'),
('HR', 'Adam Doe')]
# use defaultdict with set
dep_dd = defaultdict(set)
# set object has no attribute 'append'
# so use 'add' to achieve the same effect
for department, employee in dep:
dep_dd[department].add(employee)
dep_dd
#defaultdict(set,
# {'Sales': {'John Doe', 'Martin Smith'},
# 'Accounting': {'Jane Doe'},
# 'HR': {'Adam Doe', 'Elizabeth Smith'}})
```

For more content on data science, machine learning, R, Python, SQL and more, find me on Twitter.

]]>`defaultdict`

is a `dict`

, see previous post), so it `dict`

with additional features. To understand how those features make it different, and more convenient in some cases, we'll need to run into some errors.
If we try to count words in a document, the general approach is to create a dictionary where the dictionary `keys`

are words and the dictionary `values`

are counts of those words.

Let's try do do this with a regular dictionary.

First, to setup, we'll take a list of words and `split()`

into individual words. I took this paragraph from another project i'm working on and artificially added some extra words to ensure that certain words appeared more than once (it'll be apparent why soon).

```
# paragraph
lines = ["This table highlights 538's new NBA statistic, RAPTOR, in addition to the more established Wins Above Replacement (WAR). An extra column, Playoff (P/O) War, is provided to highlight stars performers in the post-season, when the stakes are higher. The table is limited to the top-100 players who have played at least 1,000 minutes minutes the table Wins NBA NBA RAPTOR more players"]
# split paragraphy into individual words
lines = " ".join(lines).split()
type(lines) # list
```

Now that we have our `lines`

list, we'll create an empty `dict`

called `word_counts`

and have each word be the `key`

and each `value`

be the count of that word.

```
# empty list
word_counts = {}
# loop through lines to count each word
for word in lines:
word_counts[word] += 1
# KeyError: 'This'
```

We received a `KeyError`

for the very first word in `lines`

(i.e. 'This') because the **list tried to count a key that didn't exist**. We've learned to handle exceptions so we can use `try`

and `except`

.

Here, we're looping through `lines`

and when we try to count a key that doesn't exist, like we did previously, we're *now* anticipating a `KeyError`

and will set the initial count to 1, then it can continue to loop-through and count the word, which now exists, so it can be incremented up.

```
# empty list
word_counts = {}
# exception handling
for word in lines:
try:
word_counts[word] += 1
except KeyError:
word_counts[word] = 1
# call word_counts
# abbreviated for space
word_counts
{'This': 1,
'table': 3,
'highlights': 1,
"538's": 1,
'new': 1,
'NBA': 3,
'statistic,': 1,
'RAPTOR,': 1,
'in': 2,
'addition': 1,
'to': 3,
'the': 5,
'more': 2,
...
'top-100': 1,
'players': 2,
'who': 1,
'have': 1,
'played': 1,
'at': 1,
'least': 1,
'1,000': 1,
'minutes': 2,
'RAPTOR': 1}
```

Now, there are other ways to achieve the above:

```
# use conditional flow
word_counts = {}
for word in lines:
if word in word_counts:
word_counts[word] += 1
else:
word_counts[word] = 1
# use get
for word in lines:
previous_count = word_counts.get(word, 0)
word_counts[word] = previous_count + 1
```

Here's where the author makes the case for `defaultdict`

, arguing that the two aforementioned approaches are unweildy. We'll come back full circle to try our first approach, using `defaultdict`

instead of the traditional `dict`

.

`defaultdict`

is a subclass of `dict`

and must be imported from `collections`

:

```
from collections import defaultdict
word_counts = defaultdict(int)
for word in lines:
word_counts[word] += 1
# we no longer get a KeyError
# abbreviated for space
defaultdict(int,
{'This': 1,
'table': 3,
'highlights': 1,
"538's": 1,
'new': 1,
'NBA': 3,
'statistic,': 1,
'RAPTOR,': 1,
'in': 2,
'addition': 1,
'to': 3,
'the': 5,
'more': 2,
...
'top-100': 1,
'players': 2,
'who': 1,
'have': 1,
'played': 1,
'at': 1,
'least': 1,
'1,000': 1,
'minutes': 2,
'RAPTOR': 1})
```

Unlike a regular dictionary, when `defaultdict`

tries to look up a key it doesn't contain, it'll automatically add a value for it using the argument we provided when we first created the `defaultdict`

. If you see above, we entered an `int`

as the argument, which allows it to automatically *add an integer value*.

If you want your `defaultdict`

to have `values`

be `lists`

, you can pass a `list`

as argument. Then, when you `append`

a value, it is automatically contained in a `list`

.

```
dd_list = defaultdict(list) # defaultdict(list, {})
dd_list[2].append(1) # defaultdict(list, {2: [1]})
dd_list[4].append('string') # defaultdict(list, {2: [1], 4: ['string']})
```

You can also pass a `dict`

into `defaultdict`

, ensuring that all appended values are contained in a `dict`

:

```
dd_dict = defaultdict(dict) # defaultdict(dict, {})
# match key-with-value
dd_dict['first_name'] = 'lebron' # defaultdict(dict, {'first_name': 'lebron'})
dd_dict['last_name'] = 'james'
# match key with dictionary containing another key-value pair
dd_dict['team']['city'] = 'Los Angeles'
# defaultdict(dict,
# {'first_name': 'lebron',
# 'last_name': 'james',
# 'team': {'city': 'Los Angeles'}})
```

The follow example is from Real Python, a fantastic resource for all things Python.

It is common to use `defaultdict`

to group items in a sequence or collection, setting the initial parameter (aka `.default_factory`

) set to `list`

.

```
dep = [('Sales', 'John Doe'),
('Sales', 'Martin Smith'),
('Accounting', 'Jane Doe'),
('Marketing', 'Elizabeth Smith'),
('Marketing', 'Adam Doe')]
from collections import defaultdict
dep_dd = defaultdict(list)
for department, employee in dep:
dep_dd[department].append(employee)
dep_dd
#defaultdict(list,
# {'Sales': ['John Doe', 'Martin Smith'],
# 'Accounting': ['Jane Doe'],
# 'Marketing': ['Elizabeth Smith', 'Adam Doe']})
```

What happens when you have **duplicate** entries? We're jumping ahead slightly to use `set`

handle duplicates and only group unique entries:

```
# departments with duplicate entries
dep = [('Sales', 'John Doe'),
('Sales', 'Martin Smith'),
('Accounting', 'Jane Doe'),
('Marketing', 'Elizabeth Smith'),
('Marketing', 'Elizabeth Smith'),
('Marketing', 'Adam Doe'),
('Marketing', 'Adam Doe'),
('Marketing', 'Adam Doe')]
# use defaultdict with set
dep_dd = defaultdict(set)
# set object has no attribute 'append'
# so use 'add' to achieve the same effect
for department, employee in dep:
dep_dd[department].add(employee)
dep_dd
#defaultdict(set,
# {'Sales': {'John Doe', 'Martin Smith'},
# 'Accounting': {'Jane Doe'},
# 'Marketing': {'Adam Doe', 'Elizabeth Smith'}})
```

Finally, we'll use `defaultdict`

to accumulate values:

```
incomes = [('Books', 1250.00),
('Books', 1300.00),
('Books', 1420.00),
('Tutorials', 560.00),
('Tutorials', 630.00),
('Tutorials', 750.00),
('Courses', 2500.00),
('Courses', 2430.00),
('Courses', 2750.00),]
# enter float as argument
dd = defaultdict(float) # collections.defaultdict
# defaultdict(float, {'Books': 3970.0, 'Tutorials': 1940.0, 'Courses': 7680.0})
for product, income in incomes:
dd[product] += income
for product, income in dd.items():
print(f"Total income for {product}: ${income:,.2f}")
# Total income for Books: $3,970.00
# Total income for Tutorials: $1,940.00
# Total income for Courses: $7,680.00
```

I can see that `defaultdict`

and `dictionaries`

can be handy for grouping, counting and accumulating values in a column. We'll come back to revisit these foundational concepts once the data science applications are clearer.

In summary, `dictionaries`

and `defaultdict`

can be used to group items, accumulate items and count items. Both can be used even when the `key`

doesn't (yet) exist, but its `defaultdict`

handles this more succintly. For now, we'll stop here and proceed to the next topic: counters.

Some of the things you can do with `dictionaries`

are query keys, values, assign new key/value pairs, check for existence of keys or retrieve certain values.

```
empty_dict = {} # most pythonic
empty_dict2 = dict() # less pythonic
grades = {"Joel": 80, "Grus": 99} # dictionary literal
type(grades) # type check, dict
# use bracket to look up values
grades["Grus"] # 99
grades["Joel"] # 80
# KeyError for looking up non-existent keys
try:
kate_grades = grades["Kate"]
except KeyError:
print("That key doesn't exist")
# use in operator to check existence of key
joe_has_grade = "Joel" in grades
joe_has_grade # true
kate_does_not = "Kate" in grades
kate_does_not # false
# use 'get' method to get values in dictionaries
grades.get("Joel") # 80
grades.get("Grus") # 99
grades.get("Kate") # default: None
# assign new key/value pair using brackets
grades["Tim"] = 93
grades # {'Joel': 80, 'Grus': 99, 'Tim': 93}
```

Dictionaries are good for representing structured data that can be queried. The key take-away here is that in order to iterate through `dictionaries`

to get either `keys`

, `values`

or both, we'll need to use specific methods likes `keys()`

, `values()`

or `items()`

.

```
tweet = {
"user": "paulapivat",
"text": "Reading Data Science from Scratch",
"retweet_count": 100,
"hashtags": ["#66daysofdata", "datascience", "machinelearning", "python", "R"]
}
# query specific values
tweet["retweet_count"] # 100
# query values within a list
tweet["hashtags"] # ['#66daysofdata', 'datascience', 'machinelearning', 'python', 'R']
tweet["hashtags"][2] # 'machinelearning'
# retrieve ALL keys
tweet_keys = tweet.keys()
tweet_keys # dict_keys(['user', 'text', 'retweet_count', 'hashtags'])
type(tweet_keys) # different data type: dict != dict_keys
# retrieve ALL values
tweet_values = tweet.values()
tweet_values # dict_values(['paulapivat', 'Reading Data Science from Scratch', 100, ['#66daysofdata', 'datascience', 'machinelearning', 'python', 'R']])
type(tweet_values) # different data type: dict != dict_values
# create iterable for Key-Value pairs (in tuple)
tweet_items = tweet.items()
# iterate through tweet_items()
for key,value in tweet_items:
print("These are the keys:", key)
print("These are the values:", value)
# cannot iterate through original tweet dictionary
# ValueError: too many values to unpack (expected 2)
for key, value in tweet:
print(key)
# cannot use 'enumerate' because that only provides index and key (no value)
for key, value in enumerate(tweet):
print(key) # print 0 1 2 3 - index values
print(value) # user text retweet_count hashtags (incorrectly print keys)
```

Just like in `lists`

and `tuples`

, you can use the `in`

operator to find membership. The one caveat is you cannot look up *values* that are in `lists`

, unless you use bracket notation to help.

```
# search keys
"user" in tweet # true
"bball" in tweet # false
"paulapivat" in tweet_values # true
'python' in tweet_values # false (python is nested in 'hashtags')
"hashtags" in tweet # true
# finding values inside a list requires brackets to help
'python' in tweet['hashtags'] # true
```

**What is or is not hashable?**

`Dictionary`

keys must be hashable.

`Strings`

are hashable. So we can use `strings`

as dictionary keys, but we **cannot** use `lists`

because they are not hashable.

```
paul = "paul"
type(paul) # check type, str
hash(paul) # -3897810863245179227 ; strings are hashable
paul.__hash__() # -3897810863245179227 ; another way to find the hash
jake = ['jake'] # this is a list
type(jake) # check type, list
# lists are not hashable - cannot be used as dictionary keys
try:
hash(jake)
except TypeError:
print('lists are not hashable')
```

For more content on data science, machine learning, R, Python, SQL and more, find me on Twitter.

]]>`lists`

and NumPy `arrays`

have much overlap, I think it would be useful to use this opportunity to compare and contrast the section, knowing that for Lists are fundamental to Python so I'm going to spend some time exploring their features. For data science, `NumPy arrays`

are used frequently, so I thought it'd be good to implement all `list`

operations covered in this section in `Numpy arrays`

to *tease apart their similarities and differences*.

Below are the similarities.

This implies that whatever can be done in python `lists`

can also be done in numpy `arrays`

, including: getting the *nth* element in the list/array with square brackets, slicing the list/array, iterating through the list/array with *start, stop, step*, using the `in`

operator to find list/array membership, checking length and unpacking list/arrays.

```
# setup
import numpy as np
# create comparables
python_list = [1,2,3,4,5,6,7,8,9]
numpy_array = np.array([1,2,3,4,5,6,7,8,9])
# bracket operations
# get nth element with square bracket
python_list[0] # 1
numpy_array[0] # 1
python_list[8] # 9
numpy_array[8] # 9
python_list[-1] # 9
numpy_array[-1] # 9
# square bracket to slice
python_list[:3] # [1, 2, 3]
numpy_array[:3] # array([1, 2, 3])
python_list[1:5] # [2, 3, 4, 5]
numpy_array[1:5] # array([2, 3, 4, 5])
# start, stop, step
python_list[1:8:2] # [2, 4, 6, 8]
numpy_array[1:8:2] # array([2, 4, 6, 8])
# use in operator to check membership
1 in python_list # true
1 in numpy_array # true
0 in python_list # false
0 in numpy_array # false
# finding length
len(python_list) # 9
len(numpy_array) # 9
# unpacking
x,y = [1,2] # now x is 1, y is 2
w,z = np.array([1,2]) # now w is 1, z is 2
```

Now, here are the differences.

These tasks can be done in python `lists`

, but require a different approach for NumPy `array`

including: modification (extend in list, append for array). Finally, lists can store mixed data types, while NumPy array will convert to string.

```
# python lists can store mixed data types
heterogeneous_list = ['string', 0.1, True]
type(heterogeneous_list[0]) # str
type(heterogeneous_list[1]) # float
type(heterogeneous_list[2]) # bool
# numpy arrays cannot store mixed data types
# numpy arrays turn all data types into strings
homogeneous_numpy_array = np.array(['string', 0.1, True]) # saved with mixed data types
type(homogeneous_numpy_array[0]) # numpy.str_
type(homogeneous_numpy_array[1]) # numpy.str_
type(homogeneous_numpy_array[2]) # numpy.str_
# modifying list vs numpy array
# lists can use extend to modify list in place
python_list.extend([10,12,13]) # [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 12, 13]
numpy_array.extend([10,12,13]) # AttributeError: 'numpy.ndarray'
# numpy array must use append, instead of extend
numpy_array = np.append(numpy_array,[10,12,13])
# python lists can be added with other lists
new_python_list = python_list + [14,15] # [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 12, 13, 14, 15]
numpy_array + [14,15] # ValueError
# numpy array cannot be added (use append instead)
# array([ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 12, 13, 14, 15])
new_numpy_array = np.append(numpy_array, [14,15])
# python lists have the append attribute
python_list.append(0) # [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 12, 13, 0]
# the append attribute for numpy array is used differently
numpy_array = np.append(numpy_array, [0])
```

Python `lists`

and NumPy `arrays`

have much in common, but there are meaningful differences as well.

Now that we know that there *are* meaningful differences, what can we attribute these differences to? This explainer from UCF highlights **performance** differences including:

- Size
- Performance
- Functionality

I'm tempted to go down this 🐇 🕳 of further `lists`

vs `array`

comparisons, but we'll hold off for now.

This is a continuation of my coverage of Data Science from Scratch, by Joel Grus (ch2) where the Python crash course brings us to strings and exceptions.

These topics are more nice-to-know rather than central for data science, so we'll cover them briefly (of course I'll revisit this section if I find that they are more important than I had thought).

For strings, the main concept highlighted is **string interpolation** where a variable is inserted in a string in some fashion. There are several ways to do this, but the **f-string** approach is most up-to-date and recommended.

Here are some examples:

```
# first we'll create variables that are pointed at strings (my first and last names)
first_name = "Paul"
last_name = "Apivat"
# f-string (recommended)
f_string = f"{first_name} {last_name}"
# string addition, 'Paul Apivat'
string_addition = first_name + " " + last_name
# string format, 'Paul Apivat'
string_format = "{0} {1}".format(first_name, last_name)
# percent format (NOT recommended), 'Paul Apivat'
pct_format = "%s %s" %('Paul','Apivat')
```

The author covers exceptions to make the point that theyre not all that bad in Python and its worth handling exceptions yourself to make code more readable. Heres my own example thats slightly different from the book:

```
integer_list = [1,2,3]
heterogeneous_list = ["string", 0.1, True]
# you can sum a list of integers, here 1 + 2 + 3 = 6
sum(integer_list)
# but you cannot sum a list of heterogeneous data types
# doing so raises a TypeError
sum(heterogeneous_list)
# the error crashes your program and is not fun to look at
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-12-3287dd0c6c22> in <module>
----> 1 sum(heterogeneous_list)
TypeError: unsupported operand type(s) for +: 'int' and 'str'
# so the idea is to handle the exceptions with your own messages
try:
sum(heterogeneous_list)
except TypeError:
print("cannot add objects of different data types")
```

At this point, the primary benefit to handling exceptions may be for code readability, so we'll come back to this section if we see more useful examples.

]]>As a relative newcomer to Python (from R), my goals are two fold. First, to go through this book and, as a byproduct, learn python. Second, to look out for and highlight the areas where the Pythonic way of doing things is necessary to accomplish something in the *data science* process.

I'll be on the look out for specific features of the Python language needed to carry out some task in cleaning or pre-processing data, preparing the data for modeling, exploratory data analysis or the mechanics of training, validating and testing models.

In his coverage of functions, Grus emphasizes how in Python functions are *first-class* and can be passed as argument to other functions. I'll be drawing from examples in the book and may supplement with external sources to examine the same concept from another angle.

The illustration of functions being passed as arguments is demonstrated below. A function `double`

is created. A function `apply_to_one`

is created. The `double`

function is pointed at `my_double`

. We pass `my_double`

into the `apply_to_one`

function and set that to `x`

.

Whatever function is passed to `apply_to_one`

, its argument is 1. So passing `my_double`

means we are doubling 1, so `x`

is 2.

But the important thing is that a function got passed to another function (aka higher order functions).

```
def double(x):
"""
this function doubles and returns the argument
"""
return x * 2
def apply_to_one(f):
"""Calls the function f with 1 as its argument"""
return f(1)
my_double = double
# x is 2 here
x = apply_to_one(my_double)
```

Here's an extension of the above example. We create a `apply_to_five`

function that returns a function with the integer 5 as its argument.

```
# extending the above example
def apply_five_to(e):
"""returns the function e with 5 as its argument"""
return e(5)
# doubling 5 is 10
w = apply_five_to(my_double)
```

Since functions are going to be used extensively, here's another more complicated example. I found this from Trey Hunner's site. Two functions are defined - `square`

and `cube`

. Both functions are saved to a list called `operations`

. Another list, `numbers`

is created.

Finally, a for-loop is used to iterate through `numbers`

, and the `enumerate`

property allows access to both index and item in numbers. That's used to find whether the `action`

is a `square`

or `cube`

(operations[0] is `square`

, operations[1] is `cube`

), which is then given as its argument, the items inside the `numbers`

list.

```
# create two functions
def square(n): return n**2
def cube(n): return n**3
# store those functions inside a list, operations, to reference later
operations = [square, cube]
# create a list of numbers
numbers = [2,1,3,4,7,11,18,29]
# loop through the numbers list
# using enumerate the identify index and items
# [i % 2] results in either 0 or 1, that's pointed at action
# using the dunder, name, retrieves the name of the function - either square or cube - from the operations list
# print __name__ along with the item from the numbers list
# action is either a square or cube
for i, n in enumerate(numbers):
action = operations[i % 2]
print(f"{action.__name__}({n}):", action(n))
# more explicit, yet verbose way to write the for-loop
for index, num in enumerate(numbers):
action = operations[index % 2]
print(f"{action.__name__}({num}):", action(num))
```

The for-loop prints out the following:

square(2): 4 cube(1): 1 square(3): 9 cube(4): 64 square(7): 49 cube(11): 1331 square(18): 324 cube(29): 24389

A special example of functions being passed as arguments to other functions is the Python anonymous function `lambda`

. However, with `lambda`

instead of defining functions with `def`

, it is defined *immediately* inside another function. Heres an illustration:

```
# we'll reuse apply_five_to, which takes in a function and provides '5' as the argument
def apply_five_to(e):
"""returns the function e with 5 as its argument"""
return e(5)
# this lambda function adds '4' to any argument
# when passing this lambda function to apply_five_to
# you get y = 5 + 4
y = apply_five_to(lambda x: x + 4)
# we can also change what the lambda function does without defining a separate function
# here the lambda function multiplies the argument by 4
# y = 20
y = apply_five_to(lambda x: x * 4)
```

While `lambda`

functions are convenient and succinct, there seems to be consensus that you should just define a function with `def`

instead.

Here's an external example of `lambda`

functions from Trey Hunner. In this example, a `lambda`

function is used within a `filter`

function that takes in two arguments.

```
# calling help(filter) displays an explanation
class filter(object)
| filter(function or None, iterable) --> filter object
# create a list of numbers
numbers = [2,1,3,4,7,11,18,29]
# the lambda function will return n if it is an even number
# we filter the numbers list using the lambda function
# wrapped in a list, this returns [2,4,18]
list(filter(lambda n: n % 2 == 0, numbers))
```

There are whole books, or at least whole chapters, that can be written about Python functions, but well limit our discussion for now to the idea that functions can be passed as arguments to other functions. Ill report back on this section as we progress through the book.

For more content on data science, machine learning, R, Python, SQL and more, find me on Twitter.

]]>Starting with this post, i'll be documenting my progress through Joel Grus' Data Science from Scratch (DSFS).

As a newcomer to Python (coming from R), it took a minute to understand the Python 2 vs 3, and explore the various tooling options. I tried out Spyder, Pycharm, then finally settled on the Anaconda Distribution platform to access Jupyter notebooks.

Coming into this book, I knew Joel Grus didn't like notebooks. I'm going to wait till I get to the end of the book to make a personal verdict. As a relative newcomer to Python, i'm not attached to notebooks, but have found some features to be nice (i.e., in-line plotting). I'm open to having my mind changed and I'll take the author at his word.

He states explicitly that its good discipline to "work in a virtual environment, and never use the 'base' Python installation" (p. 17). Fortunately, I had already gone through the process of setting up Python 3.8.5. My next task was to setup a virtual environment and install IPython. My IDE of choice is VSCode.

I'm happy to report that the setup process was relatively painless. I learned to setup a virtual environment for any work related to Data Science from Scratch and have started playing around with IPython.

The following are good to know: entering and exiting the virtual environment (I use conda). Entering and exiting an IPython session. Saving the IPython session, specific lines, to a `.py`

file. Opening said `.py`

file directly from terminal within VSCode and making edits. Creating and opening `.py`

file within VSCode.

The commands I use to do the following with commented explanation are as follows:

In the next post, we'll get into functions.

For more content on data science, machine learning, R, Python, SQL and more, find me on Twitter.

]]>