Tidy data in Excel

excel
beginner
Published

September 26, 2024

Session materials

Previous attendees have said…

  • 19 previous attendees have left feedback
  • 95% would recommend this session to a colleague
  • 100% said that this session was pitched correctly

Three random comments from previous attendees
  • Enjoyed this today - learned a few new tips’n’tricks! There’s alwasy somehting you pickup even when you think you know the topic. - ‘Text to columns’ skip column functionality - Data Validation Use Case and Circles - Excel ‘Float’ menu keyboard interaction
  • Easy to follow and although I would describe myself as intermediate Excel user, I still picked up a few new tricks.
  • Good beginnner session. Learned some new techniques that will help my excel use going forward.

Welcome

  • this session is for 🌶 Excel beginners
  • this is a mainly-practical session, and you’ll need Excel of some sort to follow along

Where does this fit in?

KIND Excel beginner skill tree
  • for this session, you’ll need to be familiar with the Excel basics (getting around in Excel, opening/saving/closing files, and a little bit of A1 referencing)
  • we’re going to dodge formatting and formulas as much as possible today

Session outline

  • understanding tidy data
  • a word of warning
  • a practical introduction to making tidy data:
    • values
    • tables
    • autofill
  • exercises and demos

Understanding tidy data

R4DS Figure 5.1: The following three rules make a dataset tidy: variables are columns, observations are rows, and values are cells, via CC BY-NC-ND 3.0 US
  • tidy data is not specific to Excel (e.g. this session is partly adapted from chapter 5 of R for Data Science)
  • if we tidy our data, we’ll be able to compute it
  • standards for tidying data help make for safe and effective analysis

A word of warning

  • tidying data can be very slow and complicated
  • in Excel, there are lots of advanced tools that can speed things up
    • PowerQuery especially
  • this is a beginner’s session, so we’ll avoid the more fancy tools
  • but if your process takes lots of manual work, it’s definitely worth exploring alternative ways of working

Values

  • values is the word we use to describe each bit of information in an Excel spreadsheet. Some examples:
    • a date, like 2024-06-28
    • a number, like 11.2
    • a name, like NHS Grampian
    • a cost, like £12.50
  • each value should have its own cell

Entering values

  • how would you enter this data: 2018, 2019, 2020, 2021, 2022?
  • please now:
    • open Excel
    • start a new workbook
    • add a column header year in cell A1
    • then add each of those five values in the five cells underneath (down to A6)

More values

  • we’re going to be using some birthrate data from the NRS for this session. We’ll start by adding some birth rate data

  • this is given as births per 1,000 women in five year age brackets. We’ll start with 25-29 year old mothers

  • please add the header 25-29yrs in cell B1

  • here are the values for our five years: 73.4, 71, 66.8, 69.6, 66.7

Tables

  • you should keep your new data in a table
  • Insert > Table
    Or you might prefer the Ctrl + T shortcut
  • tables allow you to sort and filter your data
  • they also act as a useful ‘container’ (or data structure) for your data

Extending tables

30-34yrs
90.9
88.6
83.4
85.9
84.4
  • please could you add some new data to your table showing births for another age group
  • you should be able to drag the blue table corner over your new column to extend the table
    blue table corner

Autofill

  • one last way of adding values: autofill
  • drag again to make a new empty column, and label it difference
  • in D2 (the first ‘proper’ cell), copy this formula: =C2-B2. This will calculate the difference in birth rates between the two columns
  • finally, click the small green corner of that newly-filled cell to autofill the column small green corner

Back to tidy data

  • we’ve now got some data with:
    • each value in a cell
    • each variable in a column
      • here, this is a maternal age bracket
    • each observation in a row
      • here, this is a year
  • we could work through and extend this data by hand, but we’ll now switch over to some supplied data to save all the typing
    • errors are common in manual data-entry
    • if you can import data, that’s usually better than re-typing it

Many values per cell

  • We often find useful data with more than one value per cell
  • this can be helpful for humans
    multiple values is good for humans, but bad for Excel
  • Excel can’t do anything with this data

Exercise one: many values per cell

  • find the Exercise one sheet in the exercise file
  • try calculating an average for each of the groups
  • or, if you’re more confident, try plotting the data

Nothing works!

Divide by zero errors in the averages
Plots produce pure nonsense

What’s the solution?

  • make this data have one value per cell
  • we could remove all the percentages by hand - but that’s going to be very slow
  • Text to Columns gives us a better tool for the job
    Text to Columns gives us a better tool for the job

Text to Columns

  • select a column
  • in the Data tab of the ribbon, you should find the Text to Columns tool
    Text to Columns tool
  • note that you can keep, or remove, the percentage column. We’ll skip it, to keep things simple
    keep, or remove, the percentage column

Try working with that data again

We can plot the data now

Exercise two: transposed data

  • sometimes you’ll find data where the columns and rows have been flipped
    data where the columns and rows have been flipped
  • that’s slow to fix by hand, but luckily you can transpose it, which swaps rows and columns
  • select your data, and copy/paste into a new cell
  • then use the transpose option
    transpose option
  • you might need to fix formatting and labels: fix formatting and labels
  • the Clear formats option might help this Clear formats
  • if you run into trouble, please note that transposing only works on data that is not in a table

Demo one: validation and really messy data

  • one of the most time-consuming bits of tidying is checking your values
  • we’ll briefly demonstrate the data validation tool
    the data validation tool
  • this allows you to describe what format you think your data should be in, and then highlights anything that doesn’t fit
  • we select the years column
  • then open the data validation tool
  • then set appropriate validation options, so Whole number between 2010 and 2030
    set appropriate validation options
  • then select Circle Invalid Data
    select Circle Invalid Data
  • we can now go through and fix anything circled

Demo two: reshaped data

  • there are several ways of reshaping data that’s not in a tidy format
  • we’ll look at the manual way here, but - as it’s horrible - I’d be keen to encourage you to investigate Power Query or Pivot Tables to reshape if this is a regular part of your working day. PQ takes < 10 seconds, PT not much longer

Manual reshape

  • sort the data by year
    sort the data by year
  • then copy and paste blocks of data, making sure to keep the years aligned
    copy and paste blocks of data
  • make sure you then copy the age brackets to label the column
    copy the age brackets to label
  • then delete the spare years columns, and the age brackets
    delete the spare years