Filtering and aggregating data with SQL

SQL
beginner
Published

November 27, 2025

No feedback found for this session

Forthcoming session(s)
Booking link Date
Filtering and aggregating data with SQL 10:00-11:30 Tue 3rd February 2026
  • This course is designed as a basic introduction to SQL delivered as live interactive sessions on Teams
  • It’s intended as a light introduction for users with good general digital skills (broadly level 2 in the Digital and Data Capability Framework)
  • It’s heavily based on the excellent W3 schools introductory SQL course
    • If you’re looking for a resource to use to support independent study, W3 schools is a better option than this page
    • this material is largely intended to support our live interactive training sessions
  • we use posit.cloud for this course. Although that platform is mainly meant for analysts writing R code, we can trick it to allow us to practice our SQL skills

About this session

This session is all about using two key patterns in SQL:

  • filtering to find matching rows of data in a table
  • then aggregating that data to produce useful summaries

Setup

  • You will need a free posit.cloud account. Please set this up and check that you have access before the session begins

  • Once you’ve logged-in to posit.cloud, please create a new Rstudio project New Rstudio project

  • In that project, create a new R script: Create a new R script from the file menu

  • Copy and paste the following R code into that script:

## install and attach R packages
pkg <- c("dbplyr", "RSQLite", "palmerpenguins")
install.packages(setdiff(pkg, rownames(installed.packages())))  
library(dbplyr)
library(dplyr)
library(palmerpenguins)

conn <- src_memdb() # create an sqlite db in memory
copy_to(conn, penguins, overwrite = T) # populate that db from R
  • save your script and click the Source button
    • that will create a simple SQL database in memory
  • now create a new SQL script (again, from the File menu)
  • delete all the pre-populated lines of code, as follows: Pre-populated code in new SQL script
  • to connect with our SQL db in memory, add the following line to the head of your SQL script:
-- !preview conn=src_memdb()$con
  • finally, save your SQL script (any file name is fine)

Finding matching rows

  • we can filter rows by values with the WHERE keyword. For example:
SELECT * FROM penguins WHERE flipper_length_mm=181;
7 records
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex year
Adelie Torgersen 39.1 18.7 181 3750 male 2007
Adelie Torgersen 38.9 17.8 181 3625 female 2007
Adelie Dream 37.6 19.3 181 3300 female 2007
Adelie Biscoe 36.5 16.6 181 2850 female 2008
Adelie Biscoe 38.1 17.0 181 3175 female 2009
Chinstrap Dream 58.0 17.8 181 3700 female 2007
Chinstrap Dream 42.4 17.3 181 3600 female 2007
  • or:
SELECT * FROM penguins WHERE species='Adelie'; -- single quotes for preference
Displaying records 1 - 10
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex year
Adelie Torgersen 39.1 18.7 181 3750 male 2007
Adelie Torgersen 39.5 17.4 186 3800 female 2007
Adelie Torgersen 40.3 18.0 195 3250 female 2007
Adelie Torgersen NA NA NA NA NA 2007
Adelie Torgersen 36.7 19.3 193 3450 female 2007
Adelie Torgersen 39.3 20.6 190 3650 male 2007
Adelie Torgersen 38.9 17.8 181 3625 female 2007
Adelie Torgersen 39.2 19.6 195 4675 male 2007
Adelie Torgersen 34.1 18.1 193 3475 NA 2007
Adelie Torgersen 42.0 20.2 190 4250 NA 2007
  • these filters are combinable. So we can find combinations where several conditions are all true using the AND keyword:
SELECT * FROM penguins WHERE species = 'Adelie' AND sex = 'male' AND island = 'Biscoe';
Displaying records 1 - 10
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex year
Adelie Biscoe 37.7 18.7 180 3600 male 2007
Adelie Biscoe 38.2 18.1 185 3950 male 2007
Adelie Biscoe 38.8 17.2 180 3800 male 2007
Adelie Biscoe 40.6 18.6 183 3550 male 2007
Adelie Biscoe 40.5 18.9 180 3950 male 2007
Adelie Biscoe 40.1 18.9 188 4300 male 2008
Adelie Biscoe 42.0 19.5 200 4050 male 2008
Adelie Biscoe 41.4 18.6 191 3700 male 2008
Adelie Biscoe 40.6 18.8 193 3800 male 2008
Adelie Biscoe 37.6 19.1 194 3750 male 2008
  • or combinations where either are true using OR:
SELECT * FROM penguins WHERE sex <> 'male' OR island = 'Biscoe'; -- <> = not equal
Displaying records 1 - 10
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex year
Adelie Torgersen 39.5 17.4 186 3800 female 2007
Adelie Torgersen 40.3 18.0 195 3250 female 2007
Adelie Torgersen 36.7 19.3 193 3450 female 2007
Adelie Torgersen 38.9 17.8 181 3625 female 2007
Adelie Torgersen 41.1 17.6 182 3200 female 2007
Adelie Torgersen 36.6 17.8 185 3700 female 2007
Adelie Torgersen 38.7 19.0 195 3450 female 2007
Adelie Torgersen 34.4 18.4 184 3325 female 2007
Adelie Biscoe 37.8 18.3 174 3400 female 2007
Adelie Biscoe 37.7 18.7 180 3600 male 2007
Exercise
  • are there any Adelie penguins not from Torgersen Island that weight less than 3500g with flippers that are at least 190mm long?

WHERE helpers

  • there are several keywords to sweeten the job of building a WHERE query. First, there’s a BETWEEN keyword to help find values in ranges:
SELECT * FROM penguins WHERE bill_length_mm BETWEEN 37.8 AND 38;
7 records
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex year
Adelie Torgersen 37.8 17.1 186 3300 NA 2007
Adelie Torgersen 37.8 17.3 180 3700 NA 2007
Adelie Biscoe 37.8 18.3 174 3400 female 2007
Adelie Biscoe 37.9 18.6 172 3150 female 2007
Adelie Biscoe 37.8 20.0 190 4250 male 2009
Adelie Biscoe 37.9 18.6 193 2925 female 2009
Adelie Dream 37.8 18.1 193 3750 male 2009
  • for text columns, you might find LIKE with wildcards helpful. % stands for one or more character, while _ stands for exactly one character:
SELECT * FROM penguins WHERE island LIKE 'Tor%' 
Displaying records 1 - 10
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex year
Adelie Torgersen 39.1 18.7 181 3750 male 2007
Adelie Torgersen 39.5 17.4 186 3800 female 2007
Adelie Torgersen 40.3 18.0 195 3250 female 2007
Adelie Torgersen NA NA NA NA NA 2007
Adelie Torgersen 36.7 19.3 193 3450 female 2007
Adelie Torgersen 39.3 20.6 190 3650 male 2007
Adelie Torgersen 38.9 17.8 181 3625 female 2007
Adelie Torgersen 39.2 19.6 195 4675 male 2007
Adelie Torgersen 34.1 18.1 193 3475 NA 2007
Adelie Torgersen 42.0 20.2 190 4250 NA 2007
  • or, possibly, you could select from a group of values using IN:
SELECT * FROM penguins WHERE island IN ('Dream', 'Biscoe') 
Displaying records 1 - 10
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex year
Adelie Biscoe 37.8 18.3 174 3400 female 2007
Adelie Biscoe 37.7 18.7 180 3600 male 2007
Adelie Biscoe 35.9 19.2 189 3800 female 2007
Adelie Biscoe 38.2 18.1 185 3950 male 2007
Adelie Biscoe 38.8 17.2 180 3800 male 2007
Adelie Biscoe 35.3 18.9 187 3800 female 2007
Adelie Biscoe 40.6 18.6 183 3550 male 2007
Adelie Biscoe 40.5 17.9 187 3200 female 2007
Adelie Biscoe 37.9 18.6 172 3150 female 2007
Adelie Biscoe 40.5 18.9 180 3950 male 2007
  • you can also build simple expressions inside the WHERE statement to help capture more complex segments of your data:
SELECT * FROM penguins WHERE (flipper_length_mm * bill_depth_mm) = 3628.4 
1 records
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex year
Adelie Biscoe 40.6 18.8 193 3800 male 2008

Keyword order

  • we’ve already looked at several examples where we’ve combined together keywords. For example, we can use SELECT, FROM, WHERE and OR together to retrieve all the flipper lengths for all the non-male penguins and/or all the penguins that live on Biscoe island:
SELECT flipper_length_mm FROM penguins WHERE sex <> 'male' OR island = 'Biscoe'
Displaying records 1 - 10
flipper_length_mm
186
195
193
181
182
185
195
184
174
180
  • this order is important: SQL keywords have an order in which they must be written

  • for our purposes, we’ll need to build each of our queries in the following order:

  • SELECT

  • FROM

  • WHERE

  • GROUP BY

  • ORDER BY

Ordering

  • for example, ORDER BY is written at the end of the query:
SELECT species, flipper_length_mm FROM penguins WHERE sex <> 'male' OR island = 'Biscoe' ORDER BY flipper_length_mm
Displaying records 1 - 10
species flipper_length_mm
Gentoo NA
Adelie 172
Adelie 174
Adelie 176
Adelie 178
Adelie 178
Chinstrap 178
Adelie 180
Adelie 180
Adelie 180
  • it’s good practice to be explicit about order direction, so you might add the ASC keyword to specify that you want to order your data in ascending order:
SELECT * FROM penguins WHERE sex <> 'male' ORDER BY flipper_length_mm ASC
Displaying records 1 - 10
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex year
Adelie Biscoe 37.9 18.6 172 3150 female 2007
Adelie Biscoe 37.8 18.3 174 3400 female 2007
Adelie Torgersen 40.2 17.0 176 3450 female 2009
Adelie Dream 39.5 16.7 178 3250 female 2007
Adelie Dream 33.1 16.1 178 2900 female 2008
Chinstrap Dream 46.1 18.2 178 3250 female 2007
Adelie Dream 42.2 18.5 180 3550 female 2007
Adelie Torgersen 38.9 17.8 181 3625 female 2007
Adelie Dream 37.6 19.3 181 3300 female 2007
Adelie Biscoe 36.5 16.6 181 2850 female 2008
  • or we could reverse-order using DESC:
SELECT * FROM penguins WHERE bill_depth_mm > 20 ORDER BY bill_depth_mm DESC
Displaying records 1 - 10
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex year
Adelie Torgersen 46.0 21.5 194 4200 male 2007
Adelie Torgersen 38.6 21.2 191 3800 male 2007
Adelie Dream 42.3 21.2 191 4150 male 2007
Adelie Torgersen 34.6 21.1 198 4400 male 2007
Adelie Dream 39.2 21.1 196 4150 male 2007
Adelie Biscoe 41.3 21.1 195 4400 male 2008
Chinstrap Dream 54.2 20.8 201 4300 male 2008
Adelie Torgersen 42.5 20.7 197 4500 male 2007
Adelie Biscoe 39.6 20.7 191 3900 female 2009
Chinstrap Dream 52.0 20.7 210 4800 male 2008
  • or order by multiple columns in different directions:
SELECT * FROM penguins WHERE bill_depth_mm > 20 ORDER BY island ASC, bill_depth_mm DESC
Displaying records 1 - 10
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex year
Adelie Biscoe 41.3 21.1 195 4400 male 2008
Adelie Biscoe 39.6 20.7 191 3900 female 2009
Adelie Biscoe 45.6 20.3 191 4600 male 2009
Adelie Dream 42.3 21.2 191 4150 male 2007
Adelie Dream 39.2 21.1 196 4150 male 2007
Chinstrap Dream 54.2 20.8 201 4300 male 2008
Chinstrap Dream 52.0 20.7 210 4800 male 2008
Adelie Dream 41.3 20.3 194 3550 male 2008
Chinstrap Dream 51.7 20.3 194 3775 male 2007
Adelie Dream 40.2 20.1 200 3975 male 2009
Exercise
  • as an experiment, try re-ordering the previous SQL query so that the WHERE and ORDER BY sections occur in reverse order
  • it will cause an error, and that’s to be expected Re-ordering keywords causes SQL queries to fail

Aggregation

  • effectively, we apply a keyword to the columns specified in the SELECT to aggregate:
SELECT MIN(bill_depth_mm) FROM penguins
1 records
MIN(bill_depth_mm)
13.1
  • rename your summary columns using AS:
SELECT MAX(bill_depth_mm) AS big_bill FROM penguins
1 records
big_bill
21.5
  • you can specify a column to group your data by - just beware of the keyword order!
SELECT species, SUM(bill_depth_mm) AS total_bill_depth FROM penguins GROUP BY species;
3 records
species total_bill_depth
Adelie 2770.3
Chinstrap 1252.6
Gentoo 1842.8
  • and you can group by multiple variables at once:
SELECT species, sex, SUM(bill_depth_mm) AS total_bill_depth FROM penguins GROUP BY species, sex;
8 records
species sex total_bill_depth
Adelie NA 91.6
Adelie female 1286.4
Adelie male 1392.3
Chinstrap female 598.0
Chinstrap male 654.6
Gentoo NA 58.2
Gentoo female 825.8
Gentoo male 958.8
  • it’s possible (but confusing) to group by an implicit group. If we remove sex from the SELECT part of the above query, for example, we end up with (apparently) multiple rows per group:
SELECT species, sex, SUM(bill_depth_mm) AS total_bill_depth FROM penguins GROUP BY species, sex
8 records
species sex total_bill_depth
Adelie NA 91.6
Adelie female 1286.4
Adelie male 1392.3
Chinstrap female 598.0
Chinstrap male 654.6
Gentoo NA 58.2
Gentoo female 825.8
Gentoo male 958.8
  • you can chain together multiple aggregation keywords to produce more complex results:
SELECT species, ROUND(AVG(bill_depth_mm), 1) AS average_bill_depth FROM penguins GROUP BY species
3 records
species average_bill_depth
Adelie 18.3
Chinstrap 18.4
Gentoo 15.0
  • COUNT is especially useful to count the number of rows in a filtered dataset
  • do note there can be dialect issues here, with several possible approaches to counting rows that may (or may not) work in your dialect of SQL
  • for example, the lovely concise DISTINCT shorthand is great, but unfortunately won’t work everywhere:
SELECT COUNT(DISTINCT island) FROM penguins
1 records
COUNT(DISTINCT island)
3
  • this could also be used with a rename in place:
SELECT COUNT(DISTINCT island) AS peng_count FROM penguins
1 records
peng_count
3
  • as a longer, but probably safer, alternative, and permitting several categories to be counted together:
SELECT COUNT(*) AS peng_count FROM (SELECT DISTINCT island, species FROM penguins)
1 records
peng_count
5
  • or, failing that:
SELECT island AS landmass, COUNT(*) AS [male n] FROM penguins WHERE sex = 'male' GROUP BY island 
3 records
landmass male n
Biscoe 83
Dream 62
Torgersen 23
  • if you’re trying to understand missing data, you should use the NULL keyword:
SELECT * FROM penguins WHERE bill_length_mm IS NULL -- really important if you're aggregating
2 records
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex year
Adelie Torgersen NA NA NA NA NA 2007
Gentoo Biscoe NA NA NA NA NA 2009
  • and there are some shortcuts for concatenating multiple columns together:
SELECT CONCAT(species,', ', island) AS species_land from penguins 
-- but be aware this can be very quirky across different SQL dialects
Displaying records 1 - 10
species_land
Adelie, Torgersen
Adelie, Torgersen
Adelie, Torgersen
Adelie, Torgersen
Adelie, Torgersen
Adelie, Torgersen
Adelie, Torgersen
Adelie, Torgersen
Adelie, Torgersen
Adelie, Torgersen