One of the goals of publishing State Records NSW’s archival control data is to promote creative re-use of that information. But what does that mean in practice… and how do you do it? In this post I give a small example of the type of mashup that can be created using the datasets on this site.
State Records NSW’s online catalogue, Archives Investigator, is often regarded simply as a gateway to archival records but it is a much richer resource than that. Archives Investigator provides information about the people or agencies that created records and their business reasons for doing so. This contextual information helps us identify and interpret records. It also has a lot of untapped secondary value. For example, the ministries data is perhaps the simplest of the datasets State Records has released. On their own, the ministry entities in Archives Investigator simply list the names and date ranges of government ministries, with links to the portfolios they contained. In aggregate, however, this data provides a complete timeline of political change in New South Wales since self-government.
Click on image to view interactive timeline
This timeline is an example of a mashup. A mashup is simply the combination of one set of data (in this case, information on ministry changes in New South Wales) with other data and/or applications. The great thing about mashups is that by combining data and applications together you can often create something that is greater than the sum of the individual parts.
Our starting point in creating this mashup is the ministries.xml file (you can find it in the context.zip file on the Datasets page). The structure of this file is very simple, for each ministry we have four fields of information:
To make the timeline, we use the TimelineSetter application, an application recently released as open source by ProPublica.org. The four fields in the ministries.xml file are enough to establish the framework for a TimelineSetter timeline: we have dates, titles for the events, and we can use the ministry numbers to create links back to Archives Investigator (the ‘Read More’ links in the timeline).
This is a great start, and we already have a useable timeline, but we can make it better by mixing our data with other sources of publically available data.
One thing our ministry data lacks is information about the political affiliation of different ministries: were they Labor or Coalition or something else? By matching the names of the leaders of each ministry to lists of political leaders for the two major political parties we can make a pretty good guess. Luckily enough, Wikipedia has two such lists:
If you are making a mashup using Wikipedia you should consider using a service such as DBpedia which provides an interface to the structured data in Wikipedia. For our timeline, however, we can take a simpler approach by screen scraping the Wikipedia pages directly. Screen scraping just means extracting data from a web page’s HTML code. The party leader information in Wikipedia is neatly arranged in particular columns within tables: we can use that structure to identify the cells we need to pull out the names.
Our timeline is looking better now. Along with the ministry information, we’ve been able to automatically categorise a good proportion of the ministries according to party affiliation. But it is still a little bare and it would be great to fill it out with some descriptive information about the different ministries.
The National Library of Australia’s Trove: Newspapers service is brilliant. If you haven’t seen it already, go take a look. It is a database of digitised Australian newspapers that runs from 1803 right up to the mid-1950s (and until the 1980s for the Australian Women’s Weekly). By searching in New South Wales papers on the dates of ministry change for the names of new premiers, we are sure to discover many links to relevant newspaper articles that will enrich our timeline.
In the same way that we screen scraped Wikipedia entries, we could probably also screen scrape Trove. This is a bit fiddly however and Tim Sherratt has provided a much better solution with his unofficial Australian newspapers API. An API, or application programming interface, is just a set of rules that defines how one computer service can talk to another. On the internet, APIs enable users to write programs that can interact with web sites. Tim Sherratt’s API allows us to query the Trove:Newspaper database and get back the results in formats (XML and JSON) that an application can read.
And that’s all it takes! By displaying State Records’ ministry data with the TimelineSetter tool and by connecting it with information from Wikipedia and Trove:Newspapers, we’ve quickly created a pretty useful tool. What ideas do you have for mashups using State Records’ data?
If you are interested in seeing the code that created this timeline, check out: https://gist.github.com/1005698