library(reticulate)
A Data-Centric Introduction to Python
Previous attendees have said…
- 14 previous attendees have left feedback
- 93% would recommend this session to a colleague
- 100% said that this session was pitched correctly
- I have no prior knowledge of python so this was a complete beginners session for me.
- A useful taster session with helpful links to further resources
- useful warts and all demonstration of python for python novices
A note on this mixed R/Python Quarto file: the text and Python sections were written in a Jupyter notebook, then converted into a Quarto document by:
quarto convert a_data_centric_introduction_to_python.ipynb
R sections, and the tabsets were added, and the document was then knitted using R Quarto in Rstudio by attaching the reticulate (Ushey, Allaire, and Tang 2024) package:
A data-centric introduction to Python
This is a friendly beginner session introducing users to Python. It’s health-and-social-care opinionated, assumes no previous Python knowledge, and will have lots of scope for practical demonstrations. Given that lots of users in the KIND network will have some prior experience of R, we’ll introduce some key Python features by comparison with R
Session structure
- a brief general-purpose chat for intro to the language
- how to read and write Python (jupyter/VS Code/Posit workbench/positron)
- a side-note about Excel Python
- Python for R developers - a practical demonstration
Python introduction
- “Python is a high-level, general-purpose programming language.”
- massive user-base
- highly extensible and flexible (\(10^5\) modules)
- the second-best language for everything
- multi-paradigm (oop, structured, …)
Reading and writing Python
- you’ll need:
- Python, currently at 3.12
- (almost certainly) something to manage modules - like pip or conda
- (almost certainly) an integrated development environment. Loads of options:
- practical demo of Jupyter labs
- non-free use in posit.cloud
- Rstudio via reticulate / Jupyter
- VSCode, which is pretty well industry standard for the wider Python ecosystem
- positron, which is the new kid for data-flavoured Python work
Excel Python
- Python is coming to Excel, apparently…
- roll-out slower than expected
- gives an alternative to VBA etc
- code gets executed in the cloud, so no infrastructure faff…
- but a potential information governance headache
- on the offchance that you have it available,
=PY()
is the key function
Python for R people
You’re welcome to follow along using the free basic Python set-up at W3schools
- “hello world!”
- indents vs brackets
- Rmarkdown vs Jupyter
- packages vs modules - for data from csv comparison
- basic work with tabular data - for methods
- vector/tibble/list vs list/tuple/dict/set - for vectorisation vs list comps
- pandas for tabular data
- plotting comparison
“hello world!”
Initially, there’s very little to choose between R and Python, and everything is likely to feel very familiar..
print("hello world!")
hello world!
1 + 2
3
= "hello " + "world" + "!"
hw hw
'hello world!'
"hello world"
[1] "hello world"
1 + 2
[1] 3
<- paste("hello", "world", "!")
hw hw
[1] "hello world !"
Indents
- a first big difference: indents matter in Python
- they’re non-optional with proper syntactic function
- broadly correspond to curly brackets in R
= "care"
word
if word == "care":
print("I have found someone from care")
else:
print("No, I haven't found anyone from care")
I have found someone from care
<- "care"
word
if (word == "care") {
print("I have found someone from care")
else {
} print("No, I haven't found anyone from care")
}
[1] "I have found someone from care"
Rmarkdown/Quarto vs Jupyter
- Jupyter provides interactive code- and markdown editing. Compare to the render/knit-based workflow of qmd/Rmd
- web-based, so perhaps more like posit.cloud / workbench than Rstudio
- comparatively harder to edit .ipynb files than .Rmd/.qmd in other tools
Packages vs modules
We’ll load the pandas module in Python, and the readr package in R (Wickham, Hester, and Bryan 2024) to compare and contrast loading external functions. We’ll use those to read some sample data (the KIND book of the week dataset).
= "https://raw.githubusercontent.com/NES-DEW/KIND-community-standards/main/data/KIND_book_of_the_week.csv"
botw_dat
import pandas
= pandas.read_csv(botw_dat) botw
But we also have a lot of options for loading modules. We can alias, most usefully to give us short names for commonly-used functions:
import pandas as pd
= pd.read_csv(botw_dat) botw
We could even load an individual function from a module:
from pandas import read_csv as read_csv
= read_csv(botw_dat)
botw
# one minor bit of cheating - we'll coerce the Year column to numeric
= botw.replace("1979 (1935)", 1979)
botw "Year"] = pandas.to_numeric(botw["Year"]) botw[
There are comparatively fewer options for package loading in R. You’d traditionally attach a whole package using library
:
library(readr)
<- "https://raw.githubusercontent.com/NES-DEW/KIND-community-standards/main/data/KIND_book_of_the_week.csv"
botw_dat
<- read_csv(botw_dat) botw
You can load individual functions by namespacing via ::
:
<- readr::read_csv(botw_dat) botw
It is also possible, although non-standard, to alias individual functions:
<- readr::read_csv
steve <- steve(botw_dat) botw
Fun with tabular data
Doing some basic playing with our tabular data shows that Python uses methods - like a local version of a functions that are specific to certain types of object. While methods can be used in R, in practice most R code relies on functions.
Both shape
and index
are methods that we’ve imported from pandas. They’ll only work in pandas objects, which we’ll talk about more below.
# shape is a method botw.shape
(30, 8)
len(botw.index) # as is index
30
0] # Python is 0-indexed botw.shape[
30
dim(botw) # dim is a function
[1] 30 8
nrow(botw) # as is nrow
[1] 30
dim(botw)[1] # Python is 1-indexed
[1] 30
Data types
- there are four basic data types in Python
- list
- tuple
- dict
- set
= [1,2,3,4,5] # changeable
numbers_list numbers_list
[1, 2, 3, 4, 5]
= (1,2,3,4,5) # unchangeable
numbers_tuple numbers_tuple
(1, 2, 3, 4, 5)
= {"one":1, "two":2, "three":3} # changeable (now), no duplicates
numbers_dict numbers_dict
{'one': 1, 'two': 2, 'three': 3}
= {1,2,3,4,5} # unchangeable, no duplicates
numbers_set numbers_set
{1, 2, 3, 4, 5}
# Modify in place semantics
numbers_list.reverse()
- R has several basic data types, but in practice only three are commonly encountered. These are the vector, the data frame, and the list (confusing!):
<- c(1,2,3,4,5)
numbers_vector numbers_vector
[1] 1 2 3 4 5
<- data.frame(nums = numbers_vector)
numbers_dataframe numbers_dataframe
nums
1 1
2 2
3 3
4 4
5 5
<- list(numbers_vector, numbers_dataframe)
numbers_list numbers_list
[[1]]
[1] 1 2 3 4 5
[[2]]
nums
1 1
2 2
3 3
4 4
5 5
Loops, list comprehensions, and vectorization
There are various methods for repeatedly running code. We’ll demonstrate a couple of simple methods here. Note that both Python and R have rich and powerful functional programming tools available (like map
), but we’ll park those for now.
You’ll need to use loops, or (much nicer) list comprehension in Python. There’s no exact counterpart of R’s vectorized functions:
= []
double_numbers_loop
for n in numbers_list:
* 2)
double_numbers_loop.append(n
double_numbers_loop
[10, 8, 6, 4, 2]
List comprehension
Like a lovely lightweight loop syntax
= [n*2 for n in numbers_list]
double_numbers_list double_numbers_list
[10, 8, 6, 4, 2]
# and, more fancy...
= [n*2 for n in numbers_list if (n%2 == 0) ]
double_even_numbers_list double_even_numbers_list
[8, 4]
By and large, R is at its best with vectorized functions:
<- numbers_vector * 2
double_numbers_vector double_numbers_vector
[1] 2 4 6 8 10
Loops are possible too
<- vector("numeric", length = length(numbers_vector))
double_numbers_loop
for (i in numbers_vector) {
<- i * 2
double_numbers_loop[i] }
R has copy-on-modify semantics, and so care needs to be taken to avoid writing poorly-performing loops. That means that loops are used comparatively rarely in R.
Tabular data basics
- we’ll do a quick overview of pandas, based on their excellent 10 minute overview
- our
botw
object is a DataFrame, which is based on a dict- like tibbles, DataFrames can contain columns of different types
# find out what we're dealing with botw.dtypes
Date object
Author object
Year int64
Title object
ISBN object
Worldcat object
KnowledgeNetwork object
Description object
dtype: object
# shows first few rows botw.head()
Date ... Description
0 06/03/2024 ... The Code Book: The Secret History of Codes and...
1 13/03/2024 ... Here's a book of the week suggestion following...
2 20/03/2024 ... NaN
3 27/03/2024 ... NaN
4 24/04/2024 ... We're looking at regular expressions in the co...
[5 rows x 8 columns]
# effectively counts rows botw.index
RangeIndex(start=0, stop=30, step=1)
# gives column names botw.columns
Index(['Date', 'Author', 'Year', 'Title', 'ISBN', 'Worldcat',
'KnowledgeNetwork', 'Description'],
dtype='object')
# simple summary botw.describe()
Year
count 30.000000
mean 2007.466667
std 14.505013
min 1954.000000
25% 2001.250000
50% 2011.000000
75% 2017.500000
max 2022.000000
"Year") # sorting by column values botw.sort_values(
Date ... Description
20 21/08/2024 ... This week's book of the week was suggested by ...
9 29/05/2024 ... If last week's book was a paean to the use of ...
2 20/03/2024 ... NaN
6 08/05/2024 ... After the discussion last week about the troub...
0 06/03/2024 ... The Code Book: The Secret History of Codes and...
15 10/07/2024 ... There are a lot of statistics textbooks out th...
13 26/06/2024 ... We're still on a mini-exploration of manufactu...
12 19/06/2024 ... Last week's recommendation about agnotology sp...
26 02/10/2024 ... How do you communicate risks? For many of us w...
10 05/06/2024 ... It's now close to twenty years old, and deals ...
28 06/11/2024 ... This week's recommendation comes from Alupha C...
21 28/08/2024 ... This week's book of the week was suggested by ...
7 15/05/2024 ... If I was posh enough to have a Latin motto, it...
1 13/03/2024 ... Here's a book of the week suggestion following...
22 04/09/2024 ... This week's book of the week was suggested by ...
18 07/08/2024 ... This is an excellent introduction to disease g...
16 17/07/2024 ... This book suggestion comes from a conversation...
11 12/06/2024 ... While the word [agnotology](https://simple.wik...
19 14/08/2024 ... This is a fun and thought-provoking set of ess...
17 31/07/2024 ... If you've ever been stunned by an unexpectedly...
3 27/03/2024 ... NaN
4 24/04/2024 ... We're looking at regular expressions in the co...
8 22/05/2024 ... A love-letter to the power of domain knowledge...
14 03/07/2024 ... This week's BotW suggestion comes from Anna Sc...
5 01/05/2024 ... Anyone who works with data knows that our data...
23 11/09/2024 ... Rosalyn Pearson, a Senior Information Analyst ...
29 13/11/2024 ... This is a high-risk recommendation, because I'...
24 18/09/2024 ... A possibly-controversial choice this week, wit...
27 30/10/2024 ... This week's recommendation comes from Kelsey P...
25 25/09/2024 ... A recommendation this week from Vasudha Singh,...
[30 rows x 8 columns]
"Date"] # selecting a column and creating a series botw[
0 06/03/2024
1 13/03/2024
2 20/03/2024
3 27/03/2024
4 24/04/2024
5 01/05/2024
6 08/05/2024
7 15/05/2024
8 22/05/2024
9 29/05/2024
10 05/06/2024
11 12/06/2024
12 19/06/2024
13 26/06/2024
14 03/07/2024
15 10/07/2024
16 17/07/2024
17 31/07/2024
18 07/08/2024
19 14/08/2024
20 21/08/2024
21 28/08/2024
22 04/09/2024
23 11/09/2024
24 18/09/2024
25 25/09/2024
26 02/10/2024
27 30/10/2024
28 06/11/2024
29 13/11/2024
Name: Date, dtype: object
2:4] # subsetting by index using a slice and returning a DataFrame botw[
Date Author ... KnowledgeNetwork Description
2 20/03/2024 David Oldroyd ... NaN NaN
3 27/03/2024 Katrine Marçal ... NaN NaN
[2 rows x 8 columns]
"Date"]] # subsetting entire columns botw[[
Date
0 06/03/2024
1 13/03/2024
2 20/03/2024
3 27/03/2024
4 24/04/2024
5 01/05/2024
6 08/05/2024
7 15/05/2024
8 22/05/2024
9 29/05/2024
10 05/06/2024
11 12/06/2024
12 19/06/2024
13 26/06/2024
14 03/07/2024
15 10/07/2024
16 17/07/2024
17 31/07/2024
18 07/08/2024
19 14/08/2024
20 21/08/2024
21 28/08/2024
22 04/09/2024
23 11/09/2024
24 18/09/2024
25 25/09/2024
26 02/10/2024
27 30/10/2024
28 06/11/2024
29 13/11/2024
4] # subsetting by index using a slice and returning a series botw.loc[
Date 24/04/2024
Author Tom Lean
Year 2016
Title Electronic Dreams: How 1980s Britain Learned t...
ISBN 978-1472918338
Worldcat https://search.worldcat.org/title/907966036
KnowledgeNetwork NaN
Description We're looking at regular expressions in the co...
Name: 4, dtype: object
4, ["Author", "Year"]] # subsetting by index and columns and returning a DataFrame botw.loc[
Author Tom Lean
Year 2016
Name: 4, dtype: object
"Year"] > 2010].sort_values("Year") # subsetting by years, and sorting botw[botw[
Date ... Description
1 13/03/2024 ... Here's a book of the week suggestion following...
18 07/08/2024 ... This is an excellent introduction to disease g...
22 04/09/2024 ... This week's book of the week was suggested by ...
11 12/06/2024 ... While the word [agnotology](https://simple.wik...
16 17/07/2024 ... This book suggestion comes from a conversation...
19 14/08/2024 ... This is a fun and thought-provoking set of ess...
4 24/04/2024 ... We're looking at regular expressions in the co...
3 27/03/2024 ... NaN
17 31/07/2024 ... If you've ever been stunned by an unexpectedly...
8 22/05/2024 ... A love-letter to the power of domain knowledge...
14 03/07/2024 ... This week's BotW suggestion comes from Anna Sc...
5 01/05/2024 ... Anyone who works with data knows that our data...
23 11/09/2024 ... Rosalyn Pearson, a Senior Information Analyst ...
24 18/09/2024 ... A possibly-controversial choice this week, wit...
29 13/11/2024 ... This is a high-risk recommendation, because I'...
25 25/09/2024 ... A recommendation this week from Vasudha Singh,...
27 30/10/2024 ... This week's recommendation comes from Kelsey P...
[17 rows x 8 columns]
"Author"].isin(["Katrine Marçal", "Caroline Criado Perez"])] # finding matching values botw[botw[
Date ... Description
3 27/03/2024 ... NaN
5 01/05/2024 ... Anyone who works with data knows that our data...
[2 rows x 8 columns]
# removes any missing values in the whole DataFrame botw.dropna()
Date ... Description
11 12/06/2024 ... While the word [agnotology](https://simple.wik...
14 03/07/2024 ... This week's BotW suggestion comes from Anna Sc...
15 10/07/2024 ... There are a lot of statistics textbooks out th...
17 31/07/2024 ... If you've ever been stunned by an unexpectedly...
20 21/08/2024 ... This week's book of the week was suggested by ...
22 04/09/2024 ... This week's book of the week was suggested by ...
23 11/09/2024 ... Rosalyn Pearson, a Senior Information Analyst ...
26 02/10/2024 ... How do you communicate risks? For many of us w...
[8 rows x 8 columns]
"Title"].str.lower() # returning the title column as a lower-case series botw[
0 the code book
1 ghost in the wires
2 the arch of knowledge
3 who cooked adam smith's dinner
4 electronic dreams: how 1980s britain learned t...
5 invisible women: exposing data bias in a world...
6 the mismeasure of man (2nd ed)
7 being wrong: adventures in the margin of error
8 bad blood: secrets and lies in a silicon valle...
9 genesis and development of a scientific fact
10 in the beginning was the worm: finding the sec...
11 merchants of doubt
12 harvey's heart: the discovery of blood circula...
13 dark remedy: the impact of thalidomide and its...
14 how emotions are made
15 medical statistics at a glance
16 the half-life of facts
17 weapons of math destruction
18 disease maps: epidemics on the ground
19 the utopia of rules: on technology, stupidity,...
20 how to lie with statistics
21 "clean code: a handbook of agile software cra...
22 thinking, fast and slow
23 the 7 deadly sins of psychology
24 what tech calls thinking: an inquiry into the ...
25 how do you know if you are making a difference...
26 reckoning with risk: learning to live with unc...
27 hybrid humans: dispatches from the frontiers o...
28 asking the right questions: a guide to critica...
29 the alignment problem
Name: Title, dtype: object
"Date"] = pandas.to_datetime(botw["Date"],format='%d/%m/%Y') # fixing publication dates
botw['Date']).month)[["Year"]].mean("Year") # average year of publication by month of botw botw.groupby(pd.DatetimeIndex(botw[
Year
Date
3 2003.0
4 2016.0
5 2004.4
6 2004.5
7 2011.5
8 1997.0
9 2018.0
10 2012.0
11 2013.0
library(dplyr) # we'll need dplyr for this work
str(botw) # shows data types etc
spc_tbl_ [30 × 8] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
$ Date : chr [1:30] "06/03/2024" "13/03/2024" "20/03/2024" "27/03/2024" ...
$ Author : chr [1:30] "Simon Singh" "Kevin Mitnick" "David Oldroyd" "Katrine Marçal" ...
$ Year : chr [1:30] "1999" "2011" "1986" "2016" ...
$ Title : chr [1:30] "The Code Book" "Ghost in the Wires" "The Arch of Knowledge" "Who Cooked Adam Smith's Dinner" ...
$ ISBN : chr [1:30] "978-1857028898" "978-0316037723" "978-0416013313" "978-1846275661" ...
$ Worldcat : chr [1:30] "https://search.worldcat.org/title/59579840" "https://search.worldcat.org/title/773175688" "https://search.worldcat.org/title/12663957" "https://search.worldcat.org/title/933444501" ...
$ KnowledgeNetwork: chr [1:30] NA NA NA NA ...
$ Description : chr [1:30] "The Code Book: The Secret History of Codes and Code-Breaking a book by . (bookshop.org) (to buy online but supp"| __truncated__ "Here's a book of the week suggestion following on from the codes theme from last time. It's the autobiography o"| __truncated__ NA NA ...
- attr(*, "spec")=
.. cols(
.. Date = col_character(),
.. Author = col_character(),
.. Year = col_character(),
.. Title = col_character(),
.. ISBN = col_character(),
.. Worldcat = col_character(),
.. KnowledgeNetwork = col_character(),
.. Description = col_character()
.. )
- attr(*, "problems")=<externalptr>
head(botw) # shows first few rows
# A tibble: 6 × 8
Date Author Year Title ISBN Worldcat KnowledgeNetwork Description
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 06/03/2024 Simon Singh 1999 The … 978-… https:/… <NA> The Code B…
2 13/03/2024 Kevin Mitn… 2011 Ghos… 978-… https:/… <NA> Here's a b…
3 20/03/2024 David Oldr… 1986 The … 978-… https:/… <NA> <NA>
4 27/03/2024 Katrine Ma… 2016 Who … 978-… https:/… <NA> <NA>
5 24/04/2024 Tom Lean 2016 Elec… 978-… https:/… <NA> We're look…
6 01/05/2024 Caroline C… 2019 Invi… 978-… https:/… <NA> Anyone who…
nrow(botw) # counts rows
[1] 30
names(botw) # column names
[1] "Date" "Author" "Year" "Title"
[5] "ISBN" "Worldcat" "KnowledgeNetwork" "Description"
summary(botw)
Date Author Year Title
Length:30 Length:30 Length:30 Length:30
Class :character Class :character Class :character Class :character
Mode :character Mode :character Mode :character Mode :character
ISBN Worldcat KnowledgeNetwork Description
Length:30 Length:30 Length:30 Length:30
Class :character Class :character Class :character Class :character
Mode :character Mode :character Mode :character Mode :character
|>
botw arrange(Year) # native pipe operator in R. Piped code in Python requires modules
# A tibble: 30 × 8
Date Author Year Title ISBN Worldcat KnowledgeNetwork Description
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 21/08/2024 Darrell H… 1954 How … 978-… https:/… https://nhs.pri… "This week…
2 29/05/2024 Ludwig Fl… 1979… Gene… 978-… https:/… <NA> "If last w…
3 20/03/2024 David Old… 1986 The … 978-… https:/… <NA> <NA>
4 08/05/2024 Stephen J… 1996 The … 978-… https:/… <NA> "After the…
5 06/03/2024 Simon Sin… 1999 The … 978-… https:/… <NA> "The Code …
6 10/07/2024 Aviva Pet… 2000 Medi… 978-… https:/… https://nhs.pri… "There are…
7 19/06/2024 Andrew Gr… 2001 Harv… 978-… https:/… <NA> "Last week…
8 26/06/2024 Trent D. … 2001 Dark… 978-… https:/… <NA> "We're sti…
9 02/10/2024 Gerd Gige… 2002 Reck… 978-… https:/… https://nhs.pri… "How do yo…
10 05/06/2024 Andrew Br… 2004 In t… 978-… https:/… <NA> "It's now …
# ℹ 20 more rows
$Date # selecting a column as a vector botw
[1] "06/03/2024" "13/03/2024" "20/03/2024" "27/03/2024" "24/04/2024"
[6] "01/05/2024" "08/05/2024" "15/05/2024" "22/05/2024" "29/05/2024"
[11] "05/06/2024" "12/06/2024" "19/06/2024" "26/06/2024" "03/07/2024"
[16] "10/07/2024" "17/07/2024" "31/07/2024" "07/08/2024" "14/08/2024"
[21] "21/08/2024" "28/08/2024" "04/09/2024" "11/09/2024" "18/09/2024"
[26] "25/09/2024" "02/10/2024" "30/10/2024" "06/11/2024" "13/11/2024"
|>
botw slice(3:4) # subsetting by index using slice and returning a tibble Note different indexing behaviour
# A tibble: 2 × 8
Date Author Year Title ISBN Worldcat KnowledgeNetwork Description
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 20/03/2024 David Oldr… 1986 The … 978-… https:/… <NA> <NA>
2 27/03/2024 Katrine Ma… 2016 Who … 978-… https:/… <NA> <NA>
|>
botw select(Date) # subsetting entire columns
# A tibble: 30 × 1
Date
<chr>
1 06/03/2024
2 13/03/2024
3 20/03/2024
4 27/03/2024
5 24/04/2024
6 01/05/2024
7 08/05/2024
8 15/05/2024
9 22/05/2024
10 29/05/2024
# ℹ 20 more rows
as.character(botw[5,]) # subsetting by index and coercing to a vector. This is pretty non-idiomatic in R
[1] "24/04/2024"
[2] "Tom Lean"
[3] "2016"
[4] "Electronic Dreams: How 1980s Britain Learned to Love the Computer"
[5] "978-1472918338"
[6] "https://search.worldcat.org/title/907966036"
[7] NA
[8] "We're looking at regular expressions in the community meetup today. Regex, as the wikipedia page suggests, have been around for ages - positively archaeological in computing terms. So for the book of the week this week, I wanted to show off one of the most interesting bits of social history I've read: Tom Lean's Electronic Dreams. Lots of the history of computing is either primarily about the technical details, or is a broadly nostalgic look at obsolete tech. This book doesn't do either of those, instead spending its time giving a concise account of how personal computing worked as a social phenomenon. For example, how did people start getting paid to write computer games? What happened when the BBC got involved in personal computing? What happened to the various promises of digital revolutions as a replacement for manufacturing industries."
5,] |>
botw[select(Author, Year) # subsetting by index and columns and returning a tibble
# A tibble: 1 × 2
Author Year
<chr> <chr>
1 Tom Lean 2016
|>
botw filter(Year > 2010) |>
arrange(Year) # subsetting by filtering years, then sorting using dplyr
# A tibble: 17 × 8
Date Author Year Title ISBN Worldcat KnowledgeNetwork Description
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 13/03/2024 Kevin Mit… 2011 Ghos… 978-… https:/… <NA> "Here's a …
2 07/08/2024 Tom Koch 2011 Dise… 978-… https:/… <NA> "This is a…
3 04/09/2024 Daniel Ka… 2011 Thin… 978-… https:/… https://nhs.pri… "This week…
4 12/06/2024 Naomi Ore… 2012 Merc… 978-… https:/… https://nhs.pri… "While the…
5 17/07/2024 Samuel Ar… 2012 The … 978-… https:/… <NA> "This book…
6 14/08/2024 David Gra… 2015 The … 978-… https:/… <NA> "This is a…
7 27/03/2024 Katrine M… 2016 Who … 978-… https:/… <NA> <NA>
8 24/04/2024 Tom Lean 2016 Elec… 978-… https:/… <NA> "We're loo…
9 31/07/2024 Cathy O'N… 2016 Weap… 978-… https:/… https://nhs.pri… "If you've…
10 22/05/2024 John Carr… 2018 Bad … 978-… https:/… <NA> "A love-le…
11 03/07/2024 Lisa Feld… 2018 How … 978-… https:/… https://nhs.pri… "This week…
12 01/05/2024 Caroline … 2019 Invi… 978-… https:/… <NA> "Anyone wh…
13 11/09/2024 Chris Cha… 2019 The … 978-… https:/… https://nhs.pri… "Rosalyn P…
14 18/09/2024 Adrian Da… 2020 What… 978-… https:/… <NA> "A possibl…
15 13/11/2024 Brian Chr… 2020 The … 978-… https:/… <NA> "This is a…
16 25/09/2024 Sarah Mor… 2022 How … 978-… https:/… <NA> "A recomme…
17 30/10/2024 Harry Par… 2022 Hybr… 978-… https:/… <NA> "This week…
which(botw$Author %in% c("Katrine Marçal", "Caroline Criado Perez")),] # finding matching values using base R botw[
# A tibble: 2 × 8
Date Author Year Title ISBN Worldcat KnowledgeNetwork Description
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 27/03/2024 Katrine Ma… 2016 Who … 978-… https:/… <NA> <NA>
2 01/05/2024 Caroline C… 2019 Invi… 978-… https:/… <NA> Anyone who…
|>
botw ::drop_na() # removes any missing values in the whole tibble tidyr
# A tibble: 8 × 8
Date Author Year Title ISBN Worldcat KnowledgeNetwork Description
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 12/06/2024 Naomi Ores… 2012 Merc… 978-… https:/… https://nhs.pri… While the …
2 03/07/2024 Lisa Feldm… 2018 How … 978-… https:/… https://nhs.pri… This week'…
3 10/07/2024 Aviva Petr… 2000 Medi… 978-… https:/… https://nhs.pri… There are …
4 31/07/2024 Cathy O'Ne… 2016 Weap… 978-… https:/… https://nhs.pri… If you've …
5 21/08/2024 Darrell Hu… 1954 How … 978-… https:/… https://nhs.pri… This week'…
6 04/09/2024 Daniel Kah… 2011 Thin… 978-… https:/… https://nhs.pri… This week'…
7 11/09/2024 Chris Cham… 2019 The … 978-… https:/… https://nhs.pri… Rosalyn Pe…
8 02/10/2024 Gerd Giger… 2002 Reck… 978-… https:/… https://nhs.pri… How do you…
$Title |>
botwtolower() # returning the title column as a lower-case vector
[1] "the code book"
[2] "ghost in the wires"
[3] "the arch of knowledge"
[4] "who cooked adam smith's dinner"
[5] "electronic dreams: how 1980s britain learned to love the computer"
[6] "invisible women: exposing data bias in a world designed for men"
[7] "the mismeasure of man (2nd ed)"
[8] "being wrong: adventures in the margin of error"
[9] "bad blood: secrets and lies in a silicon valley startup"
[10] "genesis and development of a scientific fact"
[11] "in the beginning was the worm: finding the secrets of life in a tiny hermaphrodite"
[12] "merchants of doubt"
[13] "harvey's heart: the discovery of blood circulation"
[14] "dark remedy: the impact of thalidomide and its revival as a vital medicine"
[15] "how emotions are made"
[16] "medical statistics at a glance"
[17] "the half-life of facts"
[18] "weapons of math destruction"
[19] "disease maps: epidemics on the ground"
[20] "the utopia of rules: on technology, stupidity, and the secret joys of bureaucracy"
[21] "how to lie with statistics"
[22] "\"clean code: a handbook of agile software craftsmanship\""
[23] "thinking, fast and slow"
[24] "the 7 deadly sins of psychology"
[25] "what tech calls thinking: an inquiry into the intellectual bedrock of silicon valley"
[26] "how do you know if you are making a difference? a practical handbook for public service organisations"
[27] "reckoning with risk: learning to live with uncertainty"
[28] "hybrid humans: dispatches from the frontiers of man and machine"
[29] "asking the right questions: a guide to critical thinking"
[30] "the alignment problem"
|>
botw mutate(Date = lubridate::dmy(Date)) |> # fixing publication dates
group_by(month = lubridate::floor_date(Date, unit = "month")) |>
summarise(mean_year = mean(as.numeric(Year), na.rm = T)) # average year of publication by month of botw
# A tibble: 9 × 2
month mean_year
<date> <dbl>
1 2024-03-01 2003
2 2024-04-01 2016
3 2024-05-01 2011.
4 2024-06-01 2004.
5 2024-07-01 2012.
6 2024-08-01 1997
7 2024-09-01 2018
8 2024-10-01 2012
9 2024-11-01 2013
Plots
Using matplotlib
import matplotlib.pyplot as plt
"Year"], bins = [1970, 1980, 1990, 2000, 2010, 2020])
plt.hist(botw[
"The KIND network BotW is biased towards newer books") plt.title(
library(ggplot2)
|>
botw mutate(Year = readr::parse_number(Year)) |>
ggplot() +
geom_histogram(aes(x = Year), fill="#1F77B4", binwidth = 10, center = 1985) +
ggtitle("The KIND network BotW is biased towards newer books") +
theme_minimal()