Shortlisted for the Information Is Beautiful Awards

31st October 2013

We’re delighted and very honoured to be short-listed in the Information Is Beautiful Awards this year. There are some great entries so please go and check them out, and we certainly wouldn’t mind if you chose to vote for us :)

Merging spreadsheets with OpenRefine (Google Refine)

4th June 2013

Let’s face it - we’re not quite in the golden age of linked data yet.

While we’re here, let’s take a pragmatic look at how you can combine data from two spreadsheets, using OpenRefine. We’ll walk through a relatively simple real-life example.

We recently wanted to import some data from the UK 2011 Census into Dataseed. We wanted to use a map to visualise the geographic dimension. Unfortunately the data used exclusively the new geographic codes while we only had the boundary data for the old geographic codes. Luckily we had a lookup table that simply mapped the old codes to the new ones.

So, the challenge then became to take the spreadsheet containing our boundary data, and change the old geographic codes to the new ones.

These are the steps we took:

  1. Create a project in OpenRefine for the boundary data spreadsheet.

  2. Create a project in OpenRefine for the table mapping old codes to new codes.

  3. In the boundary data project, use the “Add column based on this column” feature.

  4. Use the following GREL expression to pull in the data from NEW_COLUMN in PROJECT_NAME, linking on the COMMON_COLUMN: cell.cross("PROJECT_NAME", "COMMON_COLUMN").cells["NEW_COLUMN"].value[0]

  5. Et voila, your new column has been created.

You’ll probably want to use the facet feature to check for rows that didn’t match. You might also want to add some exceptions or error handling to the GREL expression in step 4, but we’ll keep this simple for now.

Happy refining.

Most data isn’t “big,” and businesses are wasting money pretending it is

31st May 2013

Read the full article by Christopher Mims

I couldn’t agree with this more - Big data has become a synonym for data analysis.

Our philosophy with Dataseed has always been to focus on quality of data rather than quantity. Christopher points to a paper stating that most of Facebook’s data problems are in the megabyte to gigabyte range. I was surprised to read that about Facebook, but I’m sure it’s true for the vast majority of businesses.

With Dataseed, we’re focusing on 1mb - 1gb sized data. We believe most of the complexity comes from high-dimensionality, rather than size. Our datacube visualisations allow you to explore datasets with multiple dimensions using clickable interactive charts.

Your responses to our data import survey

2nd April 2013

Thanks so much to everyone who answered the survey on data importing. The concensus was overwhelmingly in favour of CSV / Excel spreadsheet imports - so that’s what we’re building!

It’s not Dataseed, but here’s a screen-shot of the aggregated responses. [roll on Survey Builder and Dataseed integration!]

How would you prefer to import your data?

12th March 2013

We’ve been beavering away, getting ready for beta. We want to make sure we get it right, so we’re asking you one important question:

Q: How would you prefer to get your data into Dataseed?

  • Google Spreadsheets
  • CSV / Excel file upload
  • Restful JSON API
  • Browser extension that allows you to clip tables from webpages
  • Other (e.g. integration with your data source)

Please answer our 1-minute survey

Thank you so much for your support, we appreciate and rely upon your feedback!

New Features - Coming Soon!

23rd December 2012

We’ve been thinking about how we can make Dataseed work well for even more types of statistical data. We’d love to hear your feedback on these features - are they useful you, how would you use them, can you think of a better way? Please leave a comment below…

Customising chart types

Fig 1. shows the “Vehicle Type” dimension is displayed as both a bar and a pie chart. Both charts can be displayed or more likely you can show one at a time and use the toogle tool to select between them. This feature is already available on the current demo visualisation - use the settings on the “Causes of death” dimension to choose between bubble or bar chart.

Fig 1.

Hierarchical dimensions

Fig 2. uses a bubble tree to explore hierarchial dimensions. The first level of the hierarchy is the central bubble and the second level is displayed around it. If you select a bubble in the second hierarchy it will replace the central bubble and the third hierarchy will be displayed around it.

Fig 2.

Fig 3. shows an alternative way of exploring hierarchial dimensions. It uses a breadcrumb as a way to navigate back up levels in the hierarchy, whereas double clicking on an bubble will drill down one level further into the data.

The first chart in Fig 3. is the first level (year). The second chart is the second level (month) showing a cut on month 2.

This approach is compatible with multiple chart types.

Fig 3.

Custom layouts

We have been experimenting with custom layouts of charts to form a dashboard. We would like the user to be able to drop and drag charts into the main content area from the set of possible dimensions. The charts would restack according to a predertmined grid.

Fig 4.

Technical Architecture

1st November 2012

Dataseed’s architecture can be split into two parts — back-end and front-end.

All the technology currently used is open-source and all components are loosely coupled, communicating via APIs that can be opened up as integration points.

Back-end

The back-end is powered by three different data stores, some python code, and is exposed through a RESTful API.

  • Graph store — for storing relationships and linked data model
  • Analytics engine — for performing fast calculations
  • Data warehouse — for storing original imported data

Perhaps the most powerful component at the moment is the analytics engine that performs OLAP-like operations on multidimensional data. It is based on a schemaless inverted-index architecture and is built for the cloud so it scales seamlessly across many servers. It is capable of performing statistical operations on data added in real-time, as well as text analysis (keyword matching, n-gram and custom tokenizers) and geo-distance operations.

The data model we use is inspired by Datacube, the RDF ontology, which is in turn designed to be compatible with a subset of SDMX — the statistical data and metadata exchange ISO standard.

Front-end

The front-end is a client-side application built entirely in javascript and containing the following key components:


Fig 2.2.1 - Dataseed client modular architecture

  • Back-end API wrapper — to request datasets and operations thereupon.
  • Dataset model — to access datasets using native js data structures.
  • Chart model — with a pluggable architecture to implement new chart types.
  • Page manager — to lay out charts on the page in a responsive grid.

The code is MVC structured with loosely coupled modules, using design patterns which make it easy for developers to extend and modify. The frameworks and libraries used are well respected, including backbone.js, d3.js, grunt.js and bootstrap.

A example visualisation tool will be provided as a demonstration of how the front-end can be used to query and visualise our sample dataset on mortality. This example includes bubble charts, bar charts and choropleth maps, all of which are designed to be interactive, with the scales dynamically re-computing as the data changes. The design is responsive. The page reconfigures it’s dimensions based on the browser / device size.

What does Dataseed do?

31st October 2012

The main functions that Dataseed offers can be summarised as:

  • 1.1. Import
  • 1.2. Link
  • 1.3. Visualise
  • 1.4. Analyse
  • 1.5. Export

1.1 Import

Data can be imported programmatically from any source using the RESTful API. This is the fastest way of getting large amounts of data into the system but it requires programming ability.

We also have a user interface to import data from spreadsheets which requires no programming ability.

Additional interfaces to import data from other sources such as databases, CKAN, scraperwiki, google spreadsheets can be written on top of the API. It is also possible to support real-time data feeds, perhaps from physical sensors or social media streams.

As an example of data to import, let’s take the following snippet of a spreadsheet containing mortality statistics from the Office of National Statistics:

Year

Area

Gender

Disease

No. of deaths

2002

00FN

M

Cancer

72245

2002

00FN

F

Cancer

74583

2002

00FN

M

Diabetes

5648

2002

00FN

F

Diabetes

4378

Fig 1.1.1 - ONS mortality observations

1.2 Link

Linking additional data is optional. The data that is fed into the importer need only contain a set of facts, with one or more dimensions. The above example contains 4 dimensions — year, gender, disease and area. We can augment this data to make it easier to understand, and/or extract more meaning, but it would be necessary to link in more information about each of the dimensions.

For example, the data currently refers to areas with their ONS code, we would prefer to refer to them by a name that’s more familiar. To do so, we may load in another spreadsheet of data containing all the area codes and their names:

Area Code

Area Name

00FN

Birmingham

00FD

Manchester

00FA

Cornwall

Fig 1.2.1 - Area dimension meta-data

Once we have this data in the system, we can recognise ONS area codes in the future, and offer to use these names instead.

Additionally, we may wish to plot these areas on a map. The simplest way to do that would be to add geo-data of area boundaries to the spreadsheet shown in Fig 1.2.1:

Area Code

Area Name

Boundary Geojson

00FN

Birmingham

[-0.675915853559, 51.608213181169], [-0.692084437317, 51.655083439557], [-0.762116728164, 51.702360018498], [-0.711262710734, 51.72531379259]…

00FD

Manchester

[-0.724184876162, 51.749122534045], [-0.694807160618, 51.749369131919], [-0.673480280744, 51.768465052017], [-0.586388780072, 51.7521092153]…

00FA

Cornwall

[-0.563217840884, 51.711874238386], [-0.544024860281, 51.696812243168], [-0.548703878535, 51.682666637748], [-0.505115190046, 51.673072327805]…

Fig 1.2.2 - Area dimension meta-data

We plan to be able to automatically link in data for entities that we recognise. We also plan to support the linked data pattern of dereferencable URIs that return an RDF Datacube representation of your data.

1.3 Visualise

We have designed and built a visualisation tool that can be applied to any multidimensional dataset — an interactive data-cube.


Fig 1.3.1 - screenshot of unfiltered visualisation

Each chart shows the variation across one dimension in the dataset. Default charts are chosen for each dimension based on it’s meta-data and some simple analysis. For example, a bubble chart is a good way to visualise many values; pie charts are only really useful when there are two or three values. The charts can then be customised by the user.

This visualisation will be provided as an open source project that contains a framework for extracting data using our API. The visualisation may be customised entirely, but we have also provided an additional API just for charts. This make it possible to change just the way that each dimension is displayed. For example, instead of a bar chart, you could use a slider for the year, a third party pie chart for gender, and a google map for the areas.

1.4 Analyse

We provide an aggregate total for the “number of deaths” measure in the dataset. It is also possible to click on any of the features in the chart to filter or cut the dataset. Multiple dimensions can be cut at once, providing slice and dice operations in OLAP terminology.


Fig 1.4.1 - visualisation filtered on year, disease and area

If a dimension has a hierarchy, you could drill up or down within the hierarchy and cut on values at various levels.

We provide some additional statistical analysis of each dimension that’s not displayed in the visualisation. The mean, minimum, maximum, variance, sum of squares and standard deviation are currently included.

1.5 Export

It is possible to export either the entire visualisation, individual charts or even individual facts as sentences.

Exports take the form of embed code that can be pasted into other websites, e.g. news articles, blog posts, interactive reports. These exported charts are still driven by Dataseed and will be updated dynamically from live data, making it possible to track and provide analytics on their usage.

Charts can also be exported as static SVG vector files, which can be imported into graphics programs such as Adobe Illustrator. This makes it easy for designers to create accurate charts from real data for further design or pre-press in familiar tools.

UK Mortality Data Preparation - Google Refine and Dataseed

29th October 2012

To showcase Dataseed's abilities, we needed some data to visualise! The Office for National Statistics releases yearly mortality statistics which detail the causes and locations of deaths throughout England and Wales (Scotland and NI are collated and released separately). As with any large dataset, there were many idiosyncrasies to get to grips with before the data was ready for use. Not least, the International Classification of Diseases (ICD-10) and the ever-changing administrative geography of the UK.

In this post I’m going to run through the steps we took to get the data into a set of “observations” suitable for importing into Dataseed. The software we used for this purpose was the excellent Google Refine, although similar results could probably be achieved in Excel if that’s what you’re more comfortable with.

  1. Firstly, all CSV files were imported into a new Google Refine project. The ONS data is released yearly and each set is split-up regionally with an additional “national” file for aggregate statistics. We ended importing all the files into Refine as the national files are the only place where data for non-residents is stored.

  2. The import settings used in Refine were:

    1. Set “Columns are separated by” to “commas (CSV)”
    2. Ignore first 6 line(s) at beginning of file
    3. Parse next 2 line(s) as column headers
    4. Discard initial 2 row(s) of data
  3. Once the files had been imported (which took a while!) all cells were stripped of whitespace using: “Edit cells” → “Common transforms” → “Trim leading and trailing whitespace”

  4. Next we created a new “Year” column based on the source file name (the File column). Using “Edit column” → “Add column based on this column” and entering the following GREL expression:

    "20" + substring(value, 2, 4)
  5. As we are only visualising the top-level ICD chapters and not all the individual codes within them we needed to filter the data to reflect this. Due to our familiarity with Python (and our unfamiliarity with GREL) we switched over to using Jython scripting in Refine at this point. The chapter column was created using “Add column based on this column” on the ICD column with the following script:

    if value in ('A00-B99', 'C00-D48', 'D50-D89', 'E00-E90', 'F00-F99', 'G00-G99', 'H00-H59',
    'H60-H95', 'I00-I99', 'J00-J99', 'K00-K93', 'L00-L99', 'M00-M99', 'N00-N99', 'O00-O99',
    'P00-P96', 'Q00-Q99', 'R00-R99', 'S00-T98', 'V01-Y89', 'Z00-Z99', 'U00-U99'): return value elif value == 'U50.9,V01-Y89': return 'V01-Y89' return null

    This code only returns a value for valid chapters (the long tuple at the beginning) and transforms “U50.9,V01-Y89” to “V01-Y89” (as we want these external causes to be listed under a single “Accidents” chapter).

    Once the new column has been added, we can use it to filter our results by first creating a new facet with “Facet” → “Text facet”. As our new column only has values for the rows in the data that we want, we can remove all other rows by selecting “(blank)” in the facet and then going to “All” → “Edit rows” → “Remove all matching rows”.

  6. The data provided had many different region “levels” e.g. the South West and City of Bristol. These required normalisation if we wanted to use geographical visualisations such as a Choropleth. By only including county councils, metropolitan areas, unitary authorities and london boroughs, we managed to create one complete set of data, with no deaths being counted more than once.

    Due to the creation of new Unitary Authorities in 2009, we had to move the pre-2009 authorities into their new Unitary authorities for all years before 2009. We also had to remove the old authorities from the post-2009 data.

    The final part of the geographic normalisation process was including a special case for the “A0005” region. This is a “virtual” region used to count deaths for non-UK residents.

    Bringing together all these requirements led us to the following Jython script:

    if value[:1] == 'F' or value[:1] == 'H': value = value[1:] if cells['Year']['value'] == '2009': if value in ('15UF', '15UD', '15UC', '15UG', '15UB', '15UE', '20UE', '20UF', '20UG',
    '20UH', '20UJ', '20UD', '20UB', '35UD', '35UG', '35UE', '35UF', '35UB', '35UC', '39UC',
    '39UD', '39UE', '39UF', '39UB', '46UD', '46UF', '46UB', '46UC', '09UC', '09UE', '09UD',
    '13UE', '13UH', '13UB', '13UG', '13UC', '13UD', '15UH',): return null else: if value in ('15UF', '15UD', '15UC', '15UG', '15UB', '15UE',): return '00HE' elif value in ('20UE', '20UF', '20UG', '20UH', '20UJ', '20UD', '20UB',): return '00EJ' elif value in ('35UD', '35UG', '35UE', '35UF', '35UB', '35UC',): return '00EM' elif value in ('39UC', '39UD', '39UE', '39UF', '39UB',): return '00GG' elif value in ('46UD', '46UF', '46UB', '46UC',): return '00HY' elif value in ('09UC', '09UE',): return '00KC' elif value == '09UD': return '00KB' elif value in ('13UE', '13UH', '13UB',): return '00EW' elif value in ('13UG', '13UC', '13UD',): return '00EQ' elif value == '15UH': return '00HF' return value elif value == 'A0005': return value return null

    After creating a new column with this script we used the same facet, select blank, removing matching technique detailed in the previous step to filter down the data.

  7. We wanted to match the ICD code ranges to their associated chapters, so another Refine project was created using a CSV file of chapters with columns for “ICD Range”, “Chapter” and “Title”.

    Once this had been imported, we returned to the mortality Refine project and used the “Add column based on this column” with the following GREL expressions to link in the chapters’ numerals and titles:

    • "Chapter" column based on "ICD10 code":

      cell.cross("Chapters", "ICD Range").cells["Chapter"].value[0]
    • "Chapter Title" column based on "ICD10 code":

      cell.cross("Chapters", "ICD Range").cells["Title"].value[0]
  8. The final step was to use Refine’s “Export” function to output a new CSV file. The new aggregate file contained data from all years, filtered down to only the causes and regions we wanted, all ready to import and visualise in Dataseed!

Dataseed Roadmap

1st October 2012

We work in a agile process, so priorities are subject to change periodically. The below is a high level list of features that are currently on the roadmap:

  • Release open-source front-end project

  • Release API to beta users

  • Release UI to upload spreadsheets

  • Add visualisation customisation features

  • Exports

    • Multidimensional visualisations, driven by live data
    • Raw data as csv and RDF Datacube
    • Single charts as SVG
  • Simple processors - e.g. bucketing, translation, calculation (e.g. dob to age)

  • New types of charts and widgets (e.g. network charts, sliders etc.)

  • In-page d3/CSS editor

  • Allow hierarchy in dimensions (roll-up)

  • Analytics on visualisation embeds

  • UI to drag & drop design multidimensional visualisations

  • Integrations. E.g.

    • CKAN - “Visualise this dataset with Dataseed”
    • Sraperwiki - periodically scrape data from webpages
    • Drupal module - send entities to Dataseed
    • With these tools for data wrangling and visualisation: http://selection.datavisualization.ch/
  • Measures with custom calculations

  • Animated visualisations / slide-slow / walk-through presentations

  • Automated analysis - point out interesting features of the dataset

  • Live alerts for real-time data streams

  • More linked-data features!