The Census is one of those data exercises that brings out the best and worst in day-to-day data journalism. The best is the access to lovely very granular data which can allow you to interrogate an area in detail. The worst is the amount of mucking around you have to do with the data just to get it to work.
The map above in more interactive form was the most popular thing we did last week on the web – it was in the Guardian’s top ten on census day even though we didn’t finish it until after 4pm GMT. So, how did we make it? This is how that process can work.
The Census was published by the Office for National Statistics at 9.30 in the morning. It is not released under embargo as so much government data is – if you’re not on the pre-release access list like the Home Secretary, then you get it the same time as everyone else.
The big deal about this release was that it was the first time people in England and Wales were asked which language they speak as their main one. This brings its own problems: English was lumped in with Welsh in Wales (but not in the rest of the country). So, we were stuck with something vaguely unsatisfactory to start with. I asked the ONS about this and they told me:
In Wales, we asked three questions, with the additional one being on Welsh language ability (looking at ability to speak, read and write Welsh). To avoid confusing/annoying respondents (we tested these questions a lot!), we concluded that we shouldn’t ask English proficiency of people who’s main language was either English or Welsh (since Welsh speakers typically speak English as well, and there is no user requirement to know how much English they speak as all services are provided bilingually in Wales).
As we were limited on questionnaire space, we were forced to combine English and Welsh into a single category in the “Main Language” question, which was reported on today, and is causing the irritation.
The other thing about the data was that it was the first time the information had been published at the smallest levels of geography: lower super output area, each of which has about 1,500 people living in it. It means that you can see things at a much smaller level than the larger local authorities – and get some real detail in there.
And this information is most interesting while it’s in the news, which means getting it on the website the same day it comes out – which also means publishing by lunchtime. We also used the census data to map transport use the next day – but the method is the same.
Step 1: Find the data
As main language was the big story (and the most interesting), I thought it would be good to show at that local level. But the main ONS data page for this census release only had this data at local authority level.
Nomis came to the rescue. For anyone who deals with UK data, Nomis is a godsend. It’s pretty old-fashioned but very simple. It allows you to download local-level ONS data on everything from unemployment and benefit claimants through to population figures and England & Wales census data – at any local level you want. Regions, parliamentary constituencies, local authorities, council wards and the level we needed here: lower super output areas.
I love Nomis so much that I live in daily fear of it being modernised as part of a misguided government IT plan.
If you want to reproduce how I did it, this is how you get the data:
- Go to Nomis
- Click Wizard query
- Then hit Census 2011 – Quick Statistics
- Then you have to choose your dataset. For languages, I went for this one: 2011 census qs204ew – main language. For the transport dataset, you want this one: 2011 census qs701ew – method of travel to work
- Next, you have to choose your geography. I went for 2011 super output areas – lower layer. When you hit that dropdown you got a choice of ‘none’, ‘some’ or ‘all’. Choose some if you only want a place or two – or click ‘all’ to get all 34,753 of them, which is what I did. Then ‘next‘
- Then you can choose your language: I chose just to get the main ones, not least because Nomis won’t allow you to download more than 1m cells in one go. So, it was Polish, Urdu, Bengali, French and so on. Then ‘next‘
- Then you get a choice of Excel or CSV, depending on the data. Then hit ‘finish‘
That’s your 34,753 row spreadsheet. And it’s much easier to work with than the local authority level one given by the ONS on its main site – that has the added complication of putting the names for regions, unitary authorities and smaller areas into different columns. I understand there are good filtering reasons for that but it is always a pain to deal with.
Step 2: sort the spreadsheet
This one was pretty straightforward but for one thing – it only shows the raw numbers. And raw numbers are good for one thing: to get percentages from. If there are 50 Polish speakers in one area and 150 in another, it’s meaningless unless you know how many people live in each place.
But luckily the first column contains a total population – so you can use that to work out the percentage of each population in each place. If you don’t know it the sum you use goes like this:
25 Polish speakers in an area with 1,450 people in means you go:
Which gives you 1.72%
There’s a little trick here for getting that formula to apply to the whole column. Click on a cell and you get this little square marker in the bottom right corner.
Double-click and it shoots all the way down, or until there’s a gap on the column to its left. Handy with such a deep spreadsheet. And just repeat for each language.
Step 3: get the borders
The LSOA boundaries had been updated since the census, which meant our existing data file was out of date – but the ONS helpfully sent the new one which we downloaded and re-uploaded as a Fusion table.
And then I merged the two together. You can find out how to do that on this Google Fusion tutorial – and used LSOA name to merge them as that is unique to each place. This Fusion table is the result.
And that’s pretty much it, but for a few coding and styling tricks to make the map background black. The captions for what happens when you click on an area are basic html – the custom code is below at the bottom of the post. It brings in a google chart too and makes a table
Google fusion is not perfect – though it works well with this tiny geography at a big zoom into central London, say, zoom out and the tiling fails completely. CartoDB works better zoomed out but needs more coding:
The caption HTML code (for those who care):
This looks horrible like this and it always takes me ages to get right but it does work.