Data
Overview
As a group allocated to research in the domain of Data we initially met to split the area in to three distinct areas.
The first area was Data acquisition. This involved researching how the EPA gather the data and understanding the different forms and formats of data that is available around Air Quality. Data analytics was focused primarily on understanding what is required to analyse the data. This lead to understanding how the data should be validated, what trends can be identified, etc. The third area was Data Action. In this area we would investigate Data visualisation, actioning the data and what processes or toolkits are available to work with data.
Data Acquisition
The first are looked at in Data acquisition was to detail and document the different datasets that the EPA were gathering. We were keen to start breaking down what was available for us to work with. The approach taken here was to create a spreadsheet with each of the data resources listed and some metadata against each item.
With each resource we focused on the following areas:
Name: This is the name of the pollutant being measured.
Abstract/Description: This is the background of why the pollutant is being measured. Here we can see what european directives are involved and any meaningful health impacts.
URL: This is the live URL of the resource on the EPA site
Start Date: When the EPA started recording this data
Lineage: This section covers any background information or causes for the data to be recorded.
Frequency: This details the frequency of the data collected.
Unit: We detail out the unit of measurement for the data.
High risk individuals: Lists people in the public most at risk from specific pollutants.
Impacts: Details the health impacts of specific pollutants.
The next area of research focused on the area of citizen data collection. We were keen to research what guidelines are available for potential users, what kits or arduino projects are our there and what potential pitfalls can surround this area. We were lucky to be able to source a detailed overview by the US EPA foundation on citizen data collection.
Additionally we looked at a few arduino projects that gave some indicators on what sensors are required to collect the data.
Overall a clear picture of the available data and the processes we need to be aware of for data acquisition were now documented and feeding our insights for the project.
Data Analytics
We began by dividing the analytics side of things into a number of areas;
- The Air Quality Index (AQI)
- How the AQI relates to health
- Other forms of distilling/analysing the data other than AQI (if any)
- Air quality Patterns on daily, weekly, monthly and annual timeframes
- Air quality trends, how they are identified, measured and logged
- How air pollution episodes/real world events are captured, logged and linked with data records
- Why are these specific substances measured
- What is the process for validating data
The Air Quality Index (AQI) is the most common form that the data takes when being presented to the public and therefore one of the most meaningful to the public. We took a deep dive into the AQI in an effort to understand how the EPA currently analyse and process the data to then deliver it to the public.
Many countries have their own air quality rating system, the majority of which are very similar to the AQI used in Ireland and the United Kingdom. All of these rating systems have a set of air pollution bands which correlate to the health risks of exposure to the measured air. Although the number of bands can differ from country to country (usually 4-6), all bands tend to range from low risk (ie. healthy) to very high risk. As well as this, the number of index values often differ from country to country. It seems to be the case that index value ranges generally fall into one of two categories; 1-10(+)(Ireland, UK, Hong Kong), 0-300/500+(United States, China, India).
Ireland’s AQI seems to be a direct copy of the UK’s AQI.
The UK’s AQI is recommended by the Committee on Medical Effects of Air Pollution (COMEAP). COMEAP provide independent advice to government departments and agencies on how air pollution impacts health. It’s members include a range of specialist fields such as; air quality science, atmospheric chemistry, toxicology, physiology, epidemiology, statistics, pediatrics, cardiology and a lay member to ensure the work of the committee is understandable by the public.
The index is calculated from the concentrations of the above pollutants.
The breakpoints between index values differ per pollutant eg. index 1 is between 0µg/m3 and 33µg/m3 for ozone whereas for nitrogen dioxide, index 1 is between 0µg/m3 and 67µg/m3. While the pollutants are constantly measured, the results are averaged over a certain period of time. These periods also vary between pollutants eg. ozone results are averaged over 8 hours whereas nitrogen dioxide is averaged over 1 hour.
We explored trends and patterns that appear in the levels of the different pollutants of different time periods. The most obvious trends are associated with NO2 variation on a daily basis. This trend is connected to rush hour traffic levels. Seasonal changes to PM and SO2 levels are also possible. There is a strong correlation between PM levels and ambient temperature due to an increase in the burning of solid fuels at home. This is particularly noticeable in rural towns which are not connected to the gas network and may or maynot have the smoky coal ban enforced. We also noticed a weekly trend at the rathmines monitoring station where PM levels increased midweek. While unsure about the reasons for this, we speculate that it may be due to consistent holidays being taken by the public at either side of the weekend. This would reduce PM levels caused by traffic into the city center on these days. Interesting the EPA produce the annual statistics as required by CAFÉ but assessments of seasonal trends or patterns are not done routinely.
We also looked at other ways in which the pollutant measurements in relation to the CAFÉ and WHO limits might be reported on besides those outlined by the CAFÉ/EU. An example of how we did this was to look at the number of days in the year that the pollutant levels at the rathmines station exceeded the CAFÉ or WHO limits. The number of days exceed in the year were as follows.
Number of days the CAFÉ and WHO limits were exceeded in Rathmines in 2014
These results are interesting as they tell a slightly different story to the public than the annual average.
Air pollution episodes seem to be captured and recorded on an ad hoc basis. There doesn’t appear to be a procedure or protocol for capturing real world events that might skew measurements. General events nationally are dealt with by the EPA, while local events are dealt with by the local authorities.
For data validation the EPA follow an annual data processing policy which is based on EU guidance. A flagging system is used to highlight the validity and extent of processing completed. More information on the validation process will be required if this area is to be pursued and explored further.
Data Action
The last area we explored in our research on data was data action or how we should go about using the data. This began with basic research into the nature of data and the terminology of the field as it was new to us. Looking into areas like exploratory data analytics, descriptive statistics and data mining shed insight on what is a complex and multifaceted field.
We paid a visit to the library and found two very useful books to inform us about how people go about taking action to make the data they have acquired and analysed more meaningful.
These books were Visualizing Data by Ben Fry and Data Flow: Visualizing Information in Graphic Design 2 edited by Klanten, Ehman, Bourquin and Tissot.
Fry outlines seven stages of visualizing data that we feel will be a very useful methodological framework for us going forward no matter what our design solution.
Acquire: Obtain the data, whether from a file on a disk or a source over a network.
Parse: Provide some structure for the data’s meaning and order it into categories.
Filter: Remove all but the data of interest.
Mine: Apply methods from statistics or data mining as a way to discern patterns or replace the data in a mathematical context.
Represent: Choose a basic visual model, such as a bar graph, list or tree.
Refine: Improve the representation to make it clearer and more visually engaging.
Interact: Add methods for manipulation the data or controlling what features are visible.
He outlines in his book how each set of data has particular display needs and that the purpose to which you will put the data is almost as important as the data itself.
He also tells the designer to constantly question “Why was the data collected, what’s interesting about it, what stories can it tell?” pg
Use as little data as possible, no matter how precious it seems.
Find the smallest amount of data that can still convey something meaningful about the contents of the data set.
Who is your audience? What are their goals when approaching a visualisation? What do they stand to learn? In what way will they use it?
When reading Data Flow: Visualizing Information in Graphic Design 2 edited by Klanten, Ehman, Bourquin and Tissot the message that stood out was that a visualisations function is to facilitate understanding and allow insight rather than provide pretty pictures.
The distinction is made between information art and data visualization in this sense. Made examples are given that provided us with inspiration in terms of where we might take the data we have been given by the EPA.
Following our research into data and its myriad visual representations we began laying out mood boards of interesting examples of data that might inspire us to have on the wall in the studio. Examples from the Dublin Dashboard as well as others from the Big Bang Data Expo in London we found very interesting and will seek to emulate and improve on when we get to our final design concepts.
0 comments