| date | day | attendance |
|---|---|---|
| 2025-12-23 | Tue | 348 |
| 2025-12-24 | Wed | 304 |
| 2025-12-25 | Thu | 210 |
| 2025-12-26 | Fri | 315 |
| 2025-12-27 | Sat | 391 |
Datafy your system
In this session, we’ll talk about some of the decisions that we need to make when we want to collect some data about our system. Most systems are more complicated than they at first appear. That means we’ll need to make some decisions about how we understand our system before we can start measuring it. We’ll describe this process, and the decisions that we take during it as data-fying our system. This is the most high-concept session in this course, because what we’re really talking about here is the data worldview: how do we need to treat our system differently if we’re going to try to understand (and improve) it using data?
Sentence aim
This session aims to give you an understanding of how we need to translate systems into data, and why that understanding matters.
Exercises
- E1: definition of a week
- E2: different definitions mean different outcomes
- E3: thinking about your system
- E4: write a data dictionary
Data
Week-classification dataset (.xlsx/.rds), which is a short synthetic dataset with thirty daily measurements of attendance, together with their dates and day of the week.
Key concepts
- the idea of datafying a system
- why definitions matter for data
- the composition and role of a data dictionary
Introduction
In this session, we’ll venture into an area that isn’t usually found in data or statistics courses, but that lies at the heart of data work. In earlier sessions, we repeatedly talked about the overall aim of using data, which is that it gives you simple ways of understanding the behaviour of complicated systems. That understanding is very useful, because it then allows us to (hopefully) improve our system.
That’s a claim we’ve made several times before, and it’s the sort of thing that crops up pretty often in introductory data science courses and the like. But it contains the germ of an interesting problem, which is about how we can take a complicated system, and translate that system into simple data.
Note that this isn’t really about measurement. We’ll talk in the next session about the fine details of measuring things to make data. This session deals with a broader topic, which is about how we choose which things to measure, and which things to leave out, when we’re trying to turn our system into data. More simply, today is about figuring out how to collect data from your system, in the broadest terms. In this session, we’ll refer to the thinking that we do to let us collect data from something as .
Traditionally, this topic is very much in the background. Most of us haven’t ever been explicitly taught how to datafy something, and have had to formulate our own ideas about the right way to do that largely by trial and error, guidance from colleagues, and by copying other bits of good practice that we’ve seen. In other words, how to datafy systems is usually treated as tacit knowledge - something that you learn from experience, rather than get taught. I think that’s a mistake, and hope this session goes some way to making sense of how to datafy systems, and what that means for us. Even if you’re not planning to start your own data collection, it’s definitely worth understanding how datafying works, because a major cause of errors when working with data is talking at cross purposes about how a system was datafied.
Being selective
There’s a misleading idea that working with data is about collecting as much information as possible. That’s not true at all! In fact, doing good work with data largely begins with being highly selective.
Most of us work in systems that are, to put it mildly, very complicated. Trying to collect every bit of data about the way that a care home, a GP practice, a housing provider works isn’t wise or useful: it’d very rapidly expand to become more than a full-time job. Worse, that everything-data would become very hard to deal with. If our data is designed to capture everything about our system, then our data is going to be as complicated to understand and navigate as our system is. And that’s to miss the point about data work. The point of data is to let you take simple measurements of complicated systems, and achieve good things by sticking those measurements together, comparing them to other data, and in general working with them without having to pay careful attention to the system that produced them, or dealing with all the complications of a real system. A 1:1 map usually isn’t very useful!
That means that a prime rationale for doing data work in the first place is because it’s a tool for making complicated systems simpler. That depends on several issues, as we’ll discuss below, but their underlying logic is similar: be selective, and pay attention only to some aspects of your system. Don’t try and capture everything! The remainder of this session is about how to be selective, and how to be reasonably sure you’re not missing anything important, or misleading people about what you have paid attention to. We’ll also return to give some more practical advice about how to be selective in the next session.
Most words are vague
As soon as we start being selective about our systems, we run into an interesting problem about language, which is that most ordinary words have several different meanings. That’s a simple idea, with lots of interesting results, but for this course we’ll focus on one reason why ordinary language being vague matters for us.1
We don’t need to go very far to find vagueness. Even a simple dictionary will usually give a few, often rather different, definitions of a word.2
Usually, that vagueness isn’t a serious problem, because we’re generally good at inferring what is really meant by context. Okay, we can still occasionally run into areas of serious confusion in ordinary language, but happily most of the time we can infer what someone means from the context. I can happily say “I ran to catch the bus, and then ran a bath, but then stopped as the hot water ran out” and generally hope to be understood without too much trouble.3
Let’s try and collect some cases of this sort of vagueness now about something that we might try and datafy: the week.
- We quite often want to collect weekly data. But what does that mean?
- In the chat, please give as many definitions of week as you can
We’re going to use some of the real definitions of a week later in the session. Please try to come up with your own before looking at these examples, but once you’ve had a brief think about how you might define a week, feel free to open the section below to see some real examples.
Because a year isn’t neatly divisible by seven, and because different classifications of week are used in different places, there are several common definitions of a week that can (and do) give different results. Just to note a few examples in current use:
- ISO 8601 weeks, 7 day Mon-Sun blocks, numbered from the week containing the first Thursday of January
- US weeks, 7 day Sun-Sat blocks, starting from the 1st January (also popular across “much of America, South and Southeast Asia, and southern Africa” according to Wikipedia)
- tax weeks, successive 7 day blocks, beginning from the 6th April
- epidemiological weeks, 7 day Sun-Sat blocks, numbered from the week containing the first Wednesday in January
- retail weeks, 7 day blocks, running continuously year-on-year, with occasional 53 week years to account for days over
- and the informal idea of the working week - see your Teams calendar - with 5 day blocks, running non-continuously Mon-Fri (and maybe Tues-Fri occasionally?)
What happens if we don’t datafy?
Why do these different definitions of a week matter? Let’s start with an exercise using some of the week definitions above.
Here’s a preview of the main dataset for this session (.xlsx/.rds):
It contains three - date containing dates, day containing short day-of-the-week labels, and attendance containing attendance data. Together, that data is intended to be something like the kind of ordinary service-use data that gets collected in many services.4
- please take one of the definitions of a week from the definitions above, and use them to classify those dates by adding a week column
- Now repeat with a second definition, and compare the two week columns
- if you’re Excel confident, you might also try grouping the values data by each of those weeks, and seeing if the overall results change
So because there are multiple definitions of “a week” that could be used to make data, and because changing which definition we use affects our results, we need some way of specifying exactly how we’re understanding a week in this data.
Again, that sort of specifying might initially seem unnecessarily picky and pedantic. Even when we’re having quite precise conversations about work, we usually can get away with not specifying something like a week precisely, because we expect our colleagues to be able to work out from the context what we mean, or (at the very worst) be able to come back to us to discuss and ask questions if things don’t quite work out. If we’re doing something local with data - collecting some information about our service to use internally, for example - we can maybe get away with relying on everyone’s knowledge of the service to help them understanding the local context. Our colleages are likely to know how we’ve counted a week in our service, and at the very worst know that we created the data, and so can come back to us with any unanswered questions.
Most data isn’t like this: it isn’t closely tied to where it was made. In fact, much of the power of data comes from deliberately moving data around without worrying at all about where it was made and the details of the systems that gave rise to it. That ability to move data from place to place is what lets us aggregate data (sticking together, say, all the counts of care home beds in the country to give a national total) or compare different areas (say, seeing if there are more care home beds in Aberdeenshire than Highland), and so on. Moving data in that way means that we need to agree on some of the core aspects of our data: we need to tell people interested in the total national number of care home beds that we’re counting them in the right way. We need to pick a definition here, and we need to be explicit about which one we’re using. It’s definitely not good enough to just say “weeks” when we’re hoping to make data that will be useful to us, our services, and to other people.5 As we saw in the first session, that kind of use of a dataset across several different data journeys is one of the main reasons that we collect data at all.
How do we datafy?
So far, we’ve mainly concentrated on describing a problem about vagueness. Let’s now start to think about a solution to our problem, which we’ll call datafying.6 To summarise the themes that we’ve already discussed:
- turning what happens in a system into data is mainly about throwing away information
- you need rules for that: it’s a principled throwing away of information
- those rules should be written down and available to anyone who might potentially collect or use your data
In essence, datafying a system involves making three related decisions so that we can safely start measuring and collecting data about that system. These decisions are:
- What aspect(s) of your system should your data describe: we can’t capture everything, but instead need to concentrate on the salient parts of your system. Good datasets are usually specific to a topic. What do you want the data to tell you?
- What access do you have?: not everything we want to measure is measurable. Which salient parts of your system can you reasonably hope to measure with the resources you have available?
- What assumptions will you make when you measure: as the week example above tells us, How will we tell people how we’ve measured our system? Note this isn’t about giving a definition for everything, but instead giving definitions for assumptions that have made a difference to our data.
This is a two part activity.
Part one is to come up with an example of a system that you might be interested in datafying. You’ll have about five minutes to think about where you work. During that time, please reflect on (and be ready to present) a short summary describing:
- briefly, what your service does, and is like
- a suggestion about some aspect of your system you might be interested in datafying. Try and be selective and specific here.
- any relevant information about access and constraints that might be important when collecting data
You might find it helpful to make brief notes on those three areas during this first part of the activity.
Part two is then collaborative. Taking your short notes, you’ll have a short discussion with another member of the group. You’ll present your idea about datafying some aspect of your system, listen to their presentation about their plans to datafy something, and then you can discuss issues around access and assumptions. That’s especially helpful for understanding what You might find it helpful to think about access and assumptions. Working in pairs, especially given the range of roles that participants in our training work in is likely to be extremely helpful in understanding what assumptions would need explaining.
For example, say we’re working in a service with pre-booked scheduled appointments, and we’re interested in measuring how many people use our service in a week. We’d need to refine that desire to make sure we’re measuring what we think we’re measuring. Your group conversation might involve making some decisions about:
- what counts as a person, for our purposes? If someone visits us for an appointment, and brings a relative with them, should that count as one user or two?
- what counts as a use, for us? Does the postie dropping off letters count as a use? How about people who arrive, but leave before their planned appointment?
- as per the earlier discussion, what about a week? Is that working week, or a 7-day week? Which day of the week does that start on?
All of these questions have several possible good answers. The point of this session though is to try and underline that - even before getting to the point of collecting data - we need some precision about what the data we collect is about. That means that you don’t need to reach any kind of firm decision about your datafication during the discussion, but do please be ready to report back to the group any especially interesting, surprising, or controversial decisions you explored during your discussion. Do also retain any notes etc from this process - you’ll need them later in the session
In conclusion, datafying a system is all about making the decisions you’ll later need to collect your data effectively, and for others to be able to use your data safely. That leads us on to the final section (and activity) in this session.
How do we tell people how we’ve datafied?
We’ve already spent a lot of time thinking about how we datafy. In this section, we’ll talk about how to communicate that work. We do that by writing a data dictionary.7 A is just a group of that describes the key decisions and assumptions that we made while datafying our system.8
What does a data dictionary look like? It usually consists of two parts. First, we give an overall description of the dataset and its intended purpose. That usually includes:
- a sentence or two giving the overall purpose of the dataset
- some very basic information about the size of the data (columns and rows, basically) and any specific formatting information
- if not obvious from the data itself, date and location information describing where the data was collected
- any licensing information
- if you’re feeling beneficent and helpful, a snapshot of a bit of the dataset itself
Next, we give -by-variable descriptions. That’s helpful, because it means we can use nice short column names, but give users proper full descriptions without having to resort to writing baroque column titles like rounded_total_monthly_number_hgv_mot_minor_injuries_highland. We can also itemise any key assumptions - like our definition of week - so that later users can understand exactly what our data means. We can also do things like descriptive statistics here - which we talk about more in later sessions - to help our users understand what they’re dealing with. For now though we’ll omit those helpful statistical bits, and concentrate on the assumptions.
For example, for the dataset we used previously (with added ISO 8601 week classifications), we could write a simple data dictionary, as below:
Here’s a very simple example of a data dictionary.
Description
30 rows and 4 columns of data describing the number of appointments per day for parts of December 2025 and January 2026 seen across our care service’s three sites.
Variables
- date, contains dates in yyyy-mm-dd format
- day, containing three-letter short days of the week in English
- week, containing an ISO 8601 week number
- attendance, containing a count of the number of service users for that date, collected from the appointment booking system and ignoring missed or cancelled appointments
Sample data
| date | day | week | attendance |
|---|---|---|---|
| 2025-12-27 | Sat | 52 | 391 |
| 2025-12-28 | Sun | 52 | 373 |
| 2025-12-29 | Mon | 1 | 293 |
| 2025-12-30 | Tue | 1 | 327 |
| 2025-12-31 | Wed | 1 | 214 |
| 2026-01-01 | Thu | 1 | 354 |
For this final task, you’ll go back into pairs again. Pick one of your project notes, and develop a simple data dictionary. This is likely to involve a bit of imagination as we’re obviously doing this in a slightly artificial way.
For the first part, concentrate on what your data is intended to be about, any of your likely constraints, and on the specific areas you’d envisage collecting data about.
For the second part, think about each of your variables. Are there any areas where you’re going to need to specify how you’re understanding weeks/patients/costs or any of the other technical terms your data might contain? Can you think of suitable ways of defining these technical terms?
We can (and probably should) write our data dictionary before we start collecting data. Data projects have a habit of expanding beyond their initial boundaries, and an itemised list is an excellent preventative against projects senselessly growing arms and legs. In the next session, we’ll use this idea of a data dictionary as a starting point for thinking about how to actually collect data from a system.
We’ll revisit data dictionaries briefly later in the course once we’ve had a chance to think a bit more about how we summarise data.
bimodal distribution
Theoretical distribution with two peaks
categorical
Non-orderable data, such as names or addresses
coding
The method used to translate observations into data. For example, we could use a pain scale to translate individual reports of discomfort into a standard ten-point score.
data dictionary
A metadata description of the variables our dataset contains that describes any key assumptions necessary to safely create and use that data
datafying
The process of making decisions about how to record the activities of a system as data. For example, if we’re looking to collect data about the number of phone calls our service receives a week, we’d need to decide how to define a week (midnight Monday to Monday? Any rolling 168 hour period?) and how to count phone calls (Do we include missed calls? Calls of less than 1 minute?)
empirical
Made by measuring, and often used as the opposite of theoretical
empirical frequency distribution
The frequency with which each observation in our dataset occurs
frequency
How often some value occurs in a group of measurements. This can either be an absolute frequency (expressed as a count of the number of values) or a relative frequency, usually expressed as a percentage
histogram
Type of graph produced by grouping values into ranges (usually known as bins), then counting the number of values in each bin, then plotting the counts for each range
mean
A way of calculating an average by dividing the total of a variable by the total number of observations
metadata
Data about data. For example, the file size of a csv file, the number of rows in an Excel spreadsheet, or a note that the ISO8601 definition of a week was used in our data collection are all varieties of metadata
n
The total number of observations in a dataset
normal distribution
Also known as the bell curve, this is a theoretical distribution that is centred on the mean, symmetrical, and where just over 95% of the total range of values falls within two standard deviations of the mean
numerical
Data that consists of ordinary (cardinal) numbers that can be manipulated mathematically. Can be either discrete (one from a defined range of values, like school year), or continuous (like weight, any possible value).
observation
A collection of measurements of different variables for one item in our dataset. For example, if we collected daily counts of the number of new patients and discharged patients for a hospital site, each day would be a separate observation
ordinal
Orderable data. Most numerical data can be considered ordinal because it can be sensibly ranked. Would also include values with a natural ranking, such as month, or day of the week, but beware the local extent of some ranking practices.
outlier
Values of a variable that appear to not fit in with the main body of the data. They might represent genuinely exceptional cases, or might be the result of errors during data collection or tidying
range
The difference between the smallest and the largest value of a variable
skew
A term describing where the majority of observations in an empirical frequency distribution lie. Skew describes data where the data is broadly non-symmetrical about the mean. If observations are clustered to the left of the mean, the data should be described as right-skewed, whereas if observations are closely clustered to the right of the mean then it should be described as left-skewed
standard deviation
A measure of spread from the mean
tidy data’
A simple method for standardising the structure of data. This was initially developed to help analysts working in R, but is generally applicable to most systems. See our beginner-level Excel session about tidy data.
type
A way of referring to the sort of data we’re dealing with. While there are several different possible classifications, for this course we discuss numerical, ordinal, and categorical data as being the most important types.
variable
A collection of measurements of one specific characteristic of our system. For example, if we collected daily counts of the number of new patients and discharged patients for several hospital sites, that data set would have four variables: the date, the hospital site, the number of new patients, and the number of discharged patients
Footnotes
There’s a lot of interesting philosophical work about this whole area, which is definitely better ignored here. I also like the quote from Osip Mandelstam in a long essay about Dante to the effect that “the word is a bundle and meaning sticks out of it in various directions.”↩︎
If you’ve ever been lucky enough to spend some time with the full Oxford English Dictionary (or reputable equivalent) you’ll know that some words have dozens or hundreds of different definitions - the words “run”, “put” and “set” are apparently exceptionally rich in this way↩︎
If you’re coming to this sentence with English as a second language, please accept my sincere apologies.↩︎
We haven’t introduced the idea of variables properly yet, so don’t worry about the details for now. We’ll discuss them properly in the next session↩︎
If you wanted to try and summarise that idea, you might say that data work is usually plural: most datafied services have lots of people involved, and making sure that we do things in a collaborative and consistent way is essential if our data is ever to be useful in answering questions about our service.↩︎
While the idea of datafying is widespread in work about data, to the best of my knowledge there isn’t a standard account of what it means to datafy something. I’d love a correction, and some sources of reading if you do know of one↩︎
There are several other related terms that are used to describe similar structures. I don’t think there’s a sufficiently clear rationale to prefer one over the other, and will use data dictionary as a simple descriptive term that describes the kind of metadata we need to safely do our work↩︎
Like lots of other material in this session, there’s no standard way of writing a data dictionary that I know of, and so this is a first attempt to give a short and clear recipe for doing so.↩︎