It could be a database of cases being dealt with, it could be a calendar of meetings, it might be a collection of PDF documents of the minutes of those meetings, or perhaps it’s even a filing cabinet containing manilla folders full of paper.
Let’s assume that we can get the data in a digital form, there would still be a wide range of different types of data. We can place them on a Web server so that people can download them, but it might be useful to try and categorise them in a way that helps people understand what type of data it is and how easy it will be for them to make use of the data once they’ve downloaded it.
Tim Berners-Lee came up with a simple five star rating system that helps describe the nature of published open data. The rating system can be summarised as follows:
One star data:
The data is in a proprietary format that might be easily readable by a person, but is perhaps harder to process by a computer. This might be a PDF document for example. A PDF of a document describing the expenditure of a local council would allow people to read what has been spent, but perhaps not allow them to easily write a computer script to check if any expenditure was over a certain amount.
Two star data:
Here, the data is a more machine readable form but still a proprietary format. An example here might be an MS Office Excel spreadsheet. It is easy to read, and a script could be written to examine it automatically, but the format is perhaps specific to a certain type of computer operating system or application, that may not be free to use.
Three star data:
Now, the data is in a non-proprietary format such as CSV (standing for comma separated variables.) This means that it can be opened by a range of applications and across a number of different computer platforms and operating systems. It is also relatively easy to process automatically using scripts, but the script will need to understand the format of the file, for example what each of the columns means.
Four star data:
Data in this form uses specific Web technologies that allow us to describe the semantics of the data. For this MOOC, we don’t have scope to discuss Semantic Web technologies in great detail although we’d encourage you to explore the area if you find it interesting, but in simple terms the data is written in a Web format such as RDF (Resource Description Framework) that can be used to describe the data in a way that allows machines to understand the semantics of the data more easily.
RDF helps promote greater interoperability by allowing the construction of data models (ontologies) that mean similar data can be described using the same vocabularies. This can help when constructing systems that want to access a range of similar datasets on similar systems. It should be noted that data in this format is generally harder for people to read directly. Special browsers have been developed to make the data easier for people to read, or alternative versions of the data can be also provided in formats of 1-3 star ratings.
Five star data:
The gold standard of open data, this is where the data is written in a semantic format such as RDF, but importantly refers to data in other datasets using references or links. In the same way that web pages refer to other web pages, datasets can also link to other datasets. This helps avoid large scale duplication of data and helps turn discrete data sets into a Web of data.
The Semantic Web is a rich area of Computer Science research and these technologies are gradually beginning to link up large datasets of information around the globe, providing unique opportunities for both ‘Big Data’ research, and more powerful commercial information systems.
Having decided in which format the data is to be made available, there will be many other issues that need resolving.
The data will probably need to be made available with a specific licence attached, that specifies how people are able to make use of the data. These licenses might require the user of the data to obtain permission to use the data, they might allow the user to use the data for free, or they may perhaps restrict the use of the data to say that it can’t then be sold on to togel singapore make a profit.
What mechanisms are available for downloading the data will also need to be considered carefully. In some cases, where the data files are small, it may be possible just to download the files. If the dataset is large and users are likely to only want use small portions of the data then perhaps search mechanisms will need to be in place to allow people to ask for just specific parts of the data.
If the data is in four or five star formats then specific machine understandable query mechanisms might be used such as SPARQL, a language for computers to search large databases of RDF data.
In many cases, centralised stores are used for the dissemination of open data. This reduces the need for government departments to run their own Web servers and maintain their own systems. An example of this is data.gov.uk where thousands of UK government datasets from a large number of different government departments can be found.
Clearly turning data resources into open data is not necessarily a simple task, but once the data is available, it can be read and reused by many different people and organisations. Often this reuse might involve combining different data sources with different presentation mechanisms to provide new interfaces for people to understand the data. These combinations of visualisation tools and datasets are commonly referred to as ‘mashups’ and in the next step we will go on to look at one such mashup, that shows the mapping of the UK crime statistics data.