name | height |
---|---|
Emma | 1.6 |
Rani | 1.4 |
Sven | 1.5 |
BI for Beginners Session 5
Previous attendees have said…
- 60 previous attendees have left feedback
- 100% would recommend this session to a colleague
- 97% said that this session was pitched correctly
- clear and concise
- Very useful and it was particularly interesting seeing/hearing of the different issue users were having because of the differences in the version of Excel/Power Bi they had available to them. My heart goes out to eHealth because it’s incredibly difficult to keep everyone right and happy at the same time.
- I’m a beginner with some previous experience of being an end user getting data from powerBi so I have an understanding of the possibilities. I found the method of clicking along very helpful tied with Brendan’s positive presentation skills in openly checking issues and answering queries most beneficial. The level of content was right for me until the last session - I have no experience of DAX so this was all newer. I have been able to apply the skills I have learned with confidence. I feel I’m still lacking a better understanding of the semantics model and if I have a recurring set of data ie datix incidents, how I can update the data in a dashboard without having to rebuild the dashboard each time.
There are two Excel datasets for this session
Session outline
- In this session, we’re going to think about data modelling for Power BI
- we’ll start with a sub-set of the GP data we were working with over the past couple of sessions
- as we’ll see, the simplest (and most Excel-like) way of holding that data isn’t a natural fit for work in Power BI
- we’ll use the model to explore that data
- we’ll introduce the idea of a star schema as the usual way that you’ll need to organise your data for Power BI
Setup
- download both datasets (.xlsx full data and .xlsx split data) to somewhere sensible on your computer
- connect to the full data Excel file, and load the single table
- recreate a simple starting dashboard as a warm-up exercise
What’s the problem?
- this data is held as a single flat file
- while easy to understand, it’s highly repetitious: the seven GP practices in Orkney require 25 columns and 112 rows of data - so 2800 values in total
- this causes performance issues (as well as brain-ache)
What’s the solution?
- we can break this data into several related tables
- that reduces repetition and makes the data easier to understand
- this has the useful side-effect of speeding-up Power BI
Jumping ahead: a second bout of data loading
- create a new report
- connect to the .xlsx split data and load all four tables
- have a look at the data - you’ll see that it’s much more concise (582 values in total - so approx a fifth of the size of the full data)
- now inspect your new data in the model
The model
- we should see our data tables as blocks in the model
- there are also connections between some of those blocks
- Mouse-over the relationships in the model view
- What’s the difference between
and
?
- Now expand the Properties pane
Relational data
- instead of a single complex file, our data is now held as several related tables
- relationships - the lines between the blobs - show how the parts of our data are connected together
- for example, in the practices table, the HB column (which contains the health board code S08000025 each time) relates to the boards table HB column, which contains matching values
- that means that we can use the data from the boards table - like the name of the Health Board - in our Power BI work
Relationships
- you can see details of your relationships in the properties pane in the model view
- make sure the practices-to-boards relationship is set to active - take care to click
Apply
- re-create your original map using the postcodes from the practices table and the HBName column from the boards table
Creating a relationship
- if you look carefully at the model, you’ll see a missing relationship
- Power BI guesses which columns might share a relationship by looking for identical column names
- but the population table has a different column name from the practices table. Luckily, we can manually create a relationship:
- drag the
DataZone
column from practices to theDZ
column in the population table - that should open the
New relationship
dialogue: - make sure the cardinality is set to
Many to one (*:1)
, and make the relationship active
Cardinality
- cardinality describes one feature of a data relationship: how many items should we expect at each end of the relationship?
- the simplest case is 1:1. We’d find that where each value is unique. So if we have two data sets like this, showing something like sports players, with their heights in one table and the number of goals they score in a second table:
and
name | goals |
---|---|
Emma | 5 |
Rani | 8 |
Sven | 6 |
Think of the values in the name columns as keys that show us which rows of data correspond to each other. Using these names as keys, we’d expect a 1:1 relationship between the tables. Each key value is unique in each table, and so we’ll get exactly one value for Rani’s goals and height, for example. If we had several different rows for each player’s goals…
name | goals |
---|---|
Emma | 5 |
Rani | 8 |
Sven | 6 |
Emma | 4 |
Rani | 7 |
Sven | 6 |
This would give us a 1:many (which Power BI calls 1:*
) relationship, where we’d expect to get several rows from the goals table for each row in the height table.
The star schema
If you look at our data, we’ve got a single central table (with the practice details), and then a group of three lookup tables that contain fine-print details about each one of the practices. That’s starting to approximate a star schema
Starring your data
- we’ll leave this as a post-session exercise, but roughly, the steps are:
- take the full data query and duplicate five times (one per final table)
- rename each query (
main_table
,demographics
,boards
,population
,practice_details
) - select the columns you need for each table:
- PracticeCode, DataZone, HB in the main table
- PracticeCode through to GPCluster in the practices_details table
- PracticeCode, age_range, n_patients, and Sex in the demographics table
- DataZone, Population, SIMD2020v2_Rank and URname in the population table
- HB, HBName, HBDateEnacted in the boards table
- remove duplicate rows in each table
- then check the relationships to make sure the keys and cardinalities match up!