Data tidying in Excel

excel
intermediate
Published

January 21, 2025

Session materials

No feedback found for this session

Welcome

  • this session is 🌶🌶 - for intermediate Excel users
  • materials can be found on our training pages: tinyurl.com/kindtrp
  • you’ll need Excel, ideally M365 desktop to follow along
  • you’ll also need the sample data

Session outline

  • understanding tidy data
  • a practical introduction to making tidy data
  • tables
  • autofill
  • text to columns
  • validation, manual reshaping, and PowerQuery demo

Understanding tidy data

R4DS Figure 5.1: The following three rules make a dataset tidy: variables are columns, observations are rows, and values are cells, via CC BY-NC-ND 3.0 US
  • tidy data is not specific to Excel (e.g. this session is partly adapted from chapter 5 of R for Data Science)
  • if we tidy our data, we’ll be able to compute it
  • standards for tidying data help make for safe and effective analysis

Values

  • values is the word we use to describe each bit of information in an Excel spreadsheet. Some examples:
  • a date, like 2024-06-28
  • a number, like 11.2
  • a name, like NHS Grampian
  • a cost, like £12.50
  • each value should have its own cell

Entering values

  • how would you enter this data: 2018, 2019, 2020, 2021, 2022?
  • please now:
  • open Excel
  • start a new workbook
  • add a column header year in cell A1
  • then add each of those five values in the five cells underneath (down to A6)

More values

  • we’re going to be using some birthrate data from the NRS for this session. We’ll start by adding some birth rate data
  • this is given as births per 1,000 women in five year age brackets. We’ll start with 25-29 year old mothers
  • please add the header 25-29yrs in cell B1
  • here are the values for our five years: 73.4, 71, 66.8, 69.6, 66.7

Tables

  • you should keep your new data in a table
  • Insert > Table
    Or you might prefer the Ctrl + T shortcut
  • tables allow you to sort and filter your data
  • they also act as a useful ‘container’ (or data structure) for your data

Extending tables

30-34yrs
90.9
88.6
83.4
85.9
84.4
  • please could you add some new data to your table showing births for another age group
  • you should be able to drag the blue table corner over your new column to extend the table
    blue table corner

Autofill

  • one last way of adding values: autofill
  • drag again to make a new empty column, and label it difference
  • in D2 (the first ‘proper’ cell), copy this formula: =C2-B2. This will calculate the difference in birth rates between the two columns
  • finally, click the small green corner of that newly-filled cell to autofill the column small green corner

Back to tidy data

  • we’ve now got some data with:
  • each value in a cell
  • each variable in a column
  • here, this is a maternal age bracket
  • each observation in a row
  • here, this is a year
  • we could work through and extend this data by hand, but we’ll now switch over to some supplied data to save all the typing
    • errors are common in manual data-entry
    • if you can import data, that’s usually better than re-typing it

Many values per cell

  • We often find useful data with more than one value per cell
  • this can be helpful for humans
    multiple values is good for humans, but bad for Excel
  • Excel can’t do anything with this data

Exercise one: many values per cell

  • find the Exercise one sheet in the exercise file
  • try calculating an average for each of the groups
  • or, if you’re more confident, try plotting the data

Nothing works!

Divide by zero errors in the averages
Plots produce pure nonsense

What’s the solution?

  • make this data have one value per cell
  • we could remove all the percentages by hand - but that’s going to be very slow
  • Text to Columns gives us a better tool for the job
    Text to Columns gives us a better tool for the job

Text to Columns

  • select a column
  • in the Data tab of the ribbon, you should find the Text to Columns tool
    Text to Columns tool
  • note that you can keep, or remove, the percentage column. We’ll skip it, to keep things simple
    keep, or remove, the percentage column

Try working with that data again

We can plot the data now

Exercise two: transposed data

  • sometimes you’ll find data where the columns and rows have been flipped
    data where the columns and rows have been flipped
  • that’s slow to fix by hand, but luckily you can transpose it, which swaps rows and columns
  • select your data, and copy/paste into a new cell
  • then use the transpose option
    transpose option
  • you might need to fix formatting and labels: fix formatting and labels
  • the Clear formats option might help this Clear formats
  • if you run into trouble, please note that transposing only works on data that is not in a table

Validation and really messy data

  • one of the most time-consuming bits of tidying is checking your values
  • we’ll briefly demonstrate the data validation tool
    the data validation tool
  • you can follow-along from the Demo one (validation) worksheet
  • this allows you to describe what format you think your data should be in, and then highlights anything that doesn’t fit
  • we select the years column
  • then open the data validation tool
  • then set appropriate validation options, so Whole number between 2010 and 2030
    set appropriate validation options
  • then select Circle Invalid Data
    select Circle Invalid Data
  • we can now go through and fix anything circled

Reshaped data

  • there are several automated ways of reshaping data that’s not in a tidy format
  • we’ll look at the manual way here, but - as it’s horrible - I’d be keen to encourage you to investigate Power Query or Pivot Tables to reshape if this is a regular part of your working day. PQ takes < 10 seconds, PT not much longer
  • you can follow-along from the Demo two (reshaping) worksheet

Manual reshape

  • sort the data by year
    sort the data by year
  • then copy and paste blocks of data, making sure to keep the years aligned
    copy and paste blocks of data
  • make sure you then copy the age brackets to label the column
    copy the age brackets to label
  • then delete the spare years columns, and the age brackets
    delete the spare years