you're reading...
Data journalism, Open data

Journalist datastores: where can you find them? A list.

The data behind the Upshot's Senate prediction model is available for download on its Github page

The data behind the Upshot’s Senate prediction model is available for download on its Github page

Where do journalists post their data? It’s a pretty core tenet of open journalism that you share your sources; i.e. , you write a story about data then you make numbers available to download.

It matters because:

  1. Your audience is more likely to trust your story if they can test the sources
  2. Someone out there probably knows more about your story than you do — and can help make it better
  3. Your story can be improved upon and replicated
  4. Your data can be tested by the community for errors
  5. It encourages data visualizations of your work which you may not have the resources to do for yourself

So, you’d assume that with so much data journalism going on out there that we have record amounts of data curated ready to be downloaded and used. I’ve written before about the importance of the availability of the data behind these stories and this piece started as a set of links for me to use, with these points being the most-important:

  • What’s the link? Is it a special interface to the datasets, or a Github repo up-front? Github has become de-rigeur for reporters as a storage centre. But there’s a difference between just linking direct to a Github page, which is less than user-friendly to the amateur, and creating a special interface that’s easy to use. It would be interesting to know if the enthusiasm for opening up data is directly proportional to how easy it is to access.
  • How up-to-date is it? When was the datastore last updated and how complete is it?
  • How many datasets are there? Without knowing exactly how many articles or pieces of work are covered it’s not possible to know what proportion of each site’s data journalism involves the data itself being published too.

You can read a bit more about open data news sources on The Source too.

This is a work in progress, but here’s the list so far (in alphabetical order):


Data link

Github page? Yes

Last update: September 25, 2014

Number of datasets: 26

Description: “This repository contains a selection of the data — and the data-processing scripts — behind the articles, graphics and interactives at FiveThirtyEight. We hope you’ll use it to check our work and to create stories and visualizations of your own.”


Data link

Github page? Yes

Last update: September 05, 2014

Number of datasets: 7

Description: “An index of all our open-source data, analysis, libraries, tools, and guides.”

Guardian Data

Data link

Github page? No (front page based on ScraperWiki scrape of Google Spreadsheets. Either it’s not working or no new datasets have been published since 2013)

Last update: June 5, 2013

Number of datasets: 800+

Description: “Lost track of the hundreds of datasets published by the Guardian Datablog since it began in 2009? Thanks to ScraperWiki, this is the ultimate list and resource. The table below is live and updated every day – if you’re still looking for that ultimate dataset, the chance is we’ve already done it.”

Full disclosure: Up until April 2013 I edited the Guardian Datablog. This data front page was created by the great @ChrisCross_UK, who has also left the Guardian.

Huffpost Data

Data link

Github page? Yes

Last update: July 08, 2014

Number of datasets: 3 (plus lots of code)

Description: None

La Nacion Data

Data link

Github page? No

Last update:

Number of datasets: hundreds



Data link

Github page? No (mixture of free FOIA datasets, links to original data or premium datasets behind investigations)

Last update: June 2014

Number of datasets: 12

Description: “ProPublica is making available the datasets that power our data journalism. The raw data we received as the result of a FOIA request is available for free, and datasets that reflect substantial cleaning and processing by our staff are available for a one-time fee.”

The Upshot

Data link

Github page? Yes

Last update: September 09, 2014

Number of datasets: 9

Description: “A New York Times website with analysis and data visualizations about politics, policy and everyday life.”

You can find some more media Githubs here on this index — and here’s the Github home of Twitter’s interactive department too.

About Simon Rogers

Data journalist, writer, speaker. Author of 'Facts are Sacred', from Faber & Faber and a range of infographics for children books from Candlewick. Edited and launched the Guardian Datablog. Now works for Google in California as Data Editor and is Director of the Sigma awards for data journalism.



  1. Pingback: Demystifying Data Journalism: Getting Started | - October 6, 2014

  2. Pingback: Data Viz News [71] | Visual Loop - October 4, 2014

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

About me

Data journalist, writer, speaker. Author of 'Facts are Sacred', published by Faber & Faber and a new range of infographics for children books from Candlewick. Data editor at Google, California. Formerly at Twitter, San Francisco. Created the Guardian Datablog. All opinions on this site are mine, not my employers'. Read more >>

Free to share

Creative commons

Please share me around. Everything here is free to use under a Creative Commons Attribution-NonCommercial 3.0 Unported License

Follow me on Twitter

%d bloggers like this: