"hello world"
[1] "hello world"
KIND learning network training materials by KIND learning network is licensed under CC BY-SA 4.0
September 24, 2024
demo(image)
at the >
prompt and press enter
Ctrl
+ Enter
to run your script[1] "hello"
[1] "hello" "hello"
[1] 8 6 14 110
[1] FALSE FALSE FALSE TRUE
[1] "hello"
[1] "another"
[1] "another" "indexing"
[1] "character"
[1] "integer"
[1] "double"
[1] "logical"
# factors - the odd one
# mainly a way of storing categorical data, especially when you need it in non-alphabetical order
factor(c("thing", "string", "wing", "bling")) # alphabetical
[1] thing string wing bling
Levels: bling string thing wing
ing_things <- factor(c("thing", "string", "wing", "bling"), levels = c("wing", "bling", "string", "thing")) # alphabetical
ing_things
[1] thing string wing bling
Levels: wing bling string thing
[1] string
Levels: wing bling string thing
# the list = a vector of vectors
# ragged - can store different kinds of values together
list(hh, hi, hw, ing_things)
[[1]]
[1] "hello" "hello"
[[2]]
[1] "hello"
[[3]]
[1] "hello world"
[[4]]
[1] thing string wing bling
Levels: wing bling string thing
$hw
[1] "hello" "hello"
$hi
[1] "hello"
$hw
[1] "hello world"
$silly_name
[1] thing string wing bling
Levels: wing bling string thing
[1] "list"
[1] thing string wing bling
Levels: wing bling string thing
[1] thing string wing bling
Levels: wing bling string thing
hw1 hw2 hi hw silly_name1
"hello" "hello" "hello" "hello world" "4"
silly_name2 silly_name3 silly_name4
"3" "1" "2"
Ctrl
+ ⏎
)<-
The range operator is an easy way of making integer sequences:
There’s always a fancier way too:
Really important for lots of programming things
Note that most of these functions are vectorised, but will require you to use c()
if you want to supply your values directly (i.e. if you don’t want to make a variable containing your values first). sum()
is a rare exception:
[1] 16
[1] 16
mean(c(1,5,10)) # and is the general way you'll need to work if you're supplying values directly to the function
[1] 5.333333
[1] 1.414214
[1] 21
[1] 3 9 14 18 21
[1] 1.732051 2.449490 2.236068 2.000000 1.732051
[1] 4.2
[1] 4
[1] 3
[1] 6
For odd reasons, there’s no built-in function to find the statistical mode of some numbers. It can be done, but the code is ugly (and exactly the sort of thing we’d usually avoid in beginner’s sessions). Included here for interest only:
There are also a few other fairly basic functions that you might find helpful:
[1] 1.30384
[1] 3 6
Min. 1st Qu. Median Mean 3rd Qu. Max.
3.0 3.0 4.0 4.2 5.0 6.0
numbers
3 4 5 6
2 1 1 1
There are three main ways of doing this. Traditionally, you’d bracket together several functions, and read from the inside out. Fastest to write, hardest to read and fix:
or you can make intervening variables. Messy, but good if you need to be extra careful:
or, probably the best way, pipe the code together. Ctrl
+ Shift
+ m
will give you a pipe symbol:
Note that the pipe method doesn’t automatically save your output. You’ll need to assign with <-
to do that:
[1] "hello world"
[1] "HELLO WORLD"
[1] "this" "is" "a" "length" "seven" "character"
[7] "vector"
[1] "THIS" "IS" "A" "LENGTH" "SEVEN" "CHARACTER"
[7] "VECTOR"
[1] "hello world hello world"
[1] "just a string ed instrument"
[1] "question 3" "question 6" "question 5" "question 4" "question 3"
[1] "hello world" "hello world" "hello world" "hello world" "hello world"
[6] "hello world" "hello world" "hello world" "hello world" "hello world"
[[1]]
[1] "hello" "world"
[1] "hello" "world"
[1] 5
[1] FALSE FALSE FALSE FALSE TRUE FALSE FALSE
[1] "a" "character" "is" "length" "seven" "this"
[7] "vector"
longer_string
a character is length seven this vector
1 1 1 1 1 1 1
Don’t repeat your code. Long code is hard to read and understand. Three basic design patterns: the function, the loop, the if/else.
# basic function syntax
# need to run the definition before calling it
function_name <- function(argument){
# some code doing something to the argument
argument + 4 # the function will return the last value it produces
}
function_name(3)
[1] 7
# challenge - I'm bored of writing na.rm = TRUE. Could you make mean() automatically ignore the missing values?
new_mean <- function(x){
mean(x, na.rm = TRUE)
}
new_mean(c(1,4,2,4,NA))
[1] 2.75
# seq_along as a sensible safe way to work with vectors
for(i in seq_along(numbers)){ # seq_along converts a vector into sequential integers 1,2,3,4... up to the length of the vector
print(i)
}
[1] 1
[1] 2
[1] 3
[1] 4
[1] 5
[1] "c is true"
[1] "a and/or b are three"
[1] "nope"
We can bring in external code to help us with R. That external code is known as a package. There are thousands of packages in current use, as the relevant pages on CRAN will tell you.
We need to install packages before we can use them. That only needs to be done once for your R setup. To illustrate, let’s install a package, called palmerpenguins, which contains some interesting data:
Once that package is installed, we can use the data (and functions) it contains by attaching them to our current script:
Once we’ve done that, we’ll have several new items available to use. The most important here is the main penguins
dataset:
# A tibble: 344 × 8
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<fct> <fct> <dbl> <dbl> <int> <int>
1 Adelie Torgersen 39.1 18.7 181 3750
2 Adelie Torgersen 39.5 17.4 186 3800
3 Adelie Torgersen 40.3 18 195 3250
4 Adelie Torgersen NA NA NA NA
5 Adelie Torgersen 36.7 19.3 193 3450
6 Adelie Torgersen 39.3 20.6 190 3650
7 Adelie Torgersen 38.9 17.8 181 3625
8 Adelie Torgersen 39.2 19.6 195 4675
9 Adelie Torgersen 34.1 18.1 193 3475
10 Adelie Torgersen 42 20.2 190 4250
# ℹ 334 more rows
# ℹ 2 more variables: sex <fct>, year <int>
That’s tabular data - so formed into rows and columns, rectangular (so all columns the same lengths etc), and with each column containing only one type of data. Tabular data is probably the most widely used type of data in R. That means that there are lots of tools for working with it. Some basic examples:
[1] 344
[1] 8
# A tibble: 6 × 8
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<fct> <fct> <dbl> <dbl> <int> <int>
1 Adelie Torgersen 39.1 18.7 181 3750
2 Adelie Torgersen 39.5 17.4 186 3800
3 Adelie Torgersen 40.3 18 195 3250
4 Adelie Torgersen NA NA NA NA
5 Adelie Torgersen 36.7 19.3 193 3450
6 Adelie Torgersen 39.3 20.6 190 3650
# ℹ 2 more variables: sex <fct>, year <int>
[1] "species" "island" "bill_length_mm"
[4] "bill_depth_mm" "flipper_length_mm" "body_mass_g"
[7] "sex" "year"
As well as those base-R functions, there are also many packages for working with tabular data. Probably the best-known package is dplyr, which we install and attach in the same way as palmerpenguins
:
The reason that dplyr
is so popular is that some of the base-R ways of working with tabular data are a bit messy and hard to read:
[1] Adelie Adelie Adelie Adelie
Levels: Adelie Chinstrap Gentoo
[1] Torgersen Torgersen Torgersen Torgersen
Levels: Biscoe Dream Torgersen
dplyr generally produces much easier-to-read code, especially when using the pipe to bring together lines of code:
# A tibble: 344 × 1
island
<fct>
1 Torgersen
2 Torgersen
3 Torgersen
4 Torgersen
5 Torgersen
6 Torgersen
7 Torgersen
8 Torgersen
9 Torgersen
10 Torgersen
# ℹ 334 more rows
# A tibble: 344 × 7
species bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex
<fct> <dbl> <dbl> <int> <int> <fct>
1 Adelie 39.1 18.7 181 3750 male
2 Adelie 39.5 17.4 186 3800 female
3 Adelie 40.3 18 195 3250 female
4 Adelie NA NA NA NA <NA>
5 Adelie 36.7 19.3 193 3450 female
6 Adelie 39.3 20.6 190 3650 male
7 Adelie 38.9 17.8 181 3625 female
8 Adelie 39.2 19.6 195 4675 male
9 Adelie 34.1 18.1 193 3475 <NA>
10 Adelie 42 20.2 190 4250 <NA>
# ℹ 334 more rows
# ℹ 1 more variable: year <int>
# A tibble: 344 × 3
species flipper_length_mm island
<fct> <int> <fct>
1 Adelie 181 Torgersen
2 Adelie 186 Torgersen
3 Adelie 195 Torgersen
4 Adelie NA Torgersen
5 Adelie 193 Torgersen
6 Adelie 190 Torgersen
7 Adelie 181 Torgersen
8 Adelie 195 Torgersen
9 Adelie 193 Torgersen
10 Adelie 190 Torgersen
# ℹ 334 more rows
# A tibble: 344 × 1
home_island
<fct>
1 Torgersen
2 Torgersen
3 Torgersen
4 Torgersen
5 Torgersen
6 Torgersen
7 Torgersen
8 Torgersen
9 Torgersen
10 Torgersen
# ℹ 334 more rows
A note here: the penguins
object that we’re working with is technically called a tibble. dplyr
is specifically adapted to work with tibbles, and many of the functions won’t work properly on other kinds of data structure. The main idea underlying dplyr
is that the many functions it contains should all work consistently, and work well together. So once you’ve got the hang of select
there’s not much new to say about filter
, which picks rows based on their values:
# A tibble: 152 × 8
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<fct> <fct> <dbl> <dbl> <int> <int>
1 Adelie Torgersen 39.1 18.7 181 3750
2 Adelie Torgersen 39.5 17.4 186 3800
3 Adelie Torgersen 40.3 18 195 3250
4 Adelie Torgersen NA NA NA NA
5 Adelie Torgersen 36.7 19.3 193 3450
6 Adelie Torgersen 39.3 20.6 190 3650
7 Adelie Torgersen 38.9 17.8 181 3625
8 Adelie Torgersen 39.2 19.6 195 4675
9 Adelie Torgersen 34.1 18.1 193 3475
10 Adelie Torgersen 42 20.2 190 4250
# ℹ 142 more rows
# ℹ 2 more variables: sex <fct>, year <int>
# A tibble: 5 × 8
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<fct> <fct> <dbl> <dbl> <int> <int>
1 Gentoo Biscoe 59.6 17 230 6050
2 Gentoo Biscoe 55.9 17 228 5600
3 Gentoo Biscoe 55.1 16 230 5850
4 Chinstrap Dream 58 17.8 181 3700
5 Chinstrap Dream 55.8 19.8 207 4000
# ℹ 2 more variables: sex <fct>, year <int>
# A tibble: 2 × 8
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<fct> <fct> <dbl> <dbl> <int> <int>
1 Adelie Torgersen NA NA NA NA
2 Gentoo Biscoe NA NA NA NA
# ℹ 2 more variables: sex <fct>, year <int>
And mutate
- which makes new columns - will work in the same way:
penguins |>
mutate(new_col = 11) |> # every row the same
select(species, new_col) # so that we can see the new values in the preview
# A tibble: 344 × 2
species new_col
<fct> <dbl>
1 Adelie 11
2 Adelie 11
3 Adelie 11
4 Adelie 11
5 Adelie 11
6 Adelie 11
7 Adelie 11
8 Adelie 11
9 Adelie 11
10 Adelie 11
# ℹ 334 more rows
penguins |>
mutate(bill_vol = bill_length_mm * bill_depth_mm^2) |> # some calculation
select(species, bill_vol)
# A tibble: 344 × 2
species bill_vol
<fct> <dbl>
1 Adelie 13673.
2 Adelie 11959.
3 Adelie 13057.
4 Adelie NA
5 Adelie 13670.
6 Adelie 16677.
7 Adelie 12325.
8 Adelie 15059.
9 Adelie 11172.
10 Adelie 17138.
# ℹ 334 more rows
penguins |>
mutate(label = paste("From", island, "island, a penguin of the species", species)) |>
select(label, body_mass_g) # mutate and then select. You can use your new columns immediately.
# A tibble: 344 × 2
label body_mass_g
<chr> <int>
1 From Torgersen island, a penguin of the species Adelie 3750
2 From Torgersen island, a penguin of the species Adelie 3800
3 From Torgersen island, a penguin of the species Adelie 3250
4 From Torgersen island, a penguin of the species Adelie NA
5 From Torgersen island, a penguin of the species Adelie 3450
6 From Torgersen island, a penguin of the species Adelie 3650
7 From Torgersen island, a penguin of the species Adelie 3625
8 From Torgersen island, a penguin of the species Adelie 4675
9 From Torgersen island, a penguin of the species Adelie 3475
10 From Torgersen island, a penguin of the species Adelie 4250
# ℹ 334 more rows
As before, we need to assign with <-
to save our changes. Let’s add the bill_vol column
to the data now
arrange
sorts columns:
# A tibble: 344 × 9
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<fct> <fct> <dbl> <dbl> <int> <int>
1 Gentoo Biscoe 42.9 13.1 215 5000
2 Gentoo Biscoe 42 13.5 210 4150
3 Gentoo Biscoe 40.9 13.7 214 4650
4 Adelie Dream 32.1 15.5 188 3050
5 Gentoo Biscoe 43.3 13.4 209 4400
6 Gentoo Biscoe 44.9 13.3 213 5100
7 Gentoo Biscoe 42.6 13.7 213 4950
8 Gentoo Biscoe 42.7 13.7 208 3950
9 Gentoo Biscoe 46.1 13.2 211 4500
10 Gentoo Biscoe 44 13.6 208 4350
# ℹ 334 more rows
# ℹ 3 more variables: sex <fct>, year <int>, bill_vol <dbl>
The nice thing about dplyr is that there are several other packages which work in similar ways. This package ecosystem gets called the tidyverse, and is extremely widely used to do data science work in R. A close relative of dplyr
is the readr
package, which reads in data to R and makes it into tibbles: