Data-driven journalism
Data-driven journalism, often shortened to "ddj", a term in use since 2009, is a journalistic process based on analyzing and filtering large data sets for the purpose of creating or elevating a news story. Many data-driven stories begin with newly available resources such as open source software, open access publishing and open data, while others are products of public records requests or leaked materials. This approach to journalism builds on older practices, most notably on computer-assisted reporting a label used mainly in the US for decades. Other labels for partially similar approaches are "precision journalism", based on a book by Philipp Meyer, published in 1972, where he advocated the use of techniques from social sciences in researching stories.
Data-driven journalism has a wider approach. At the core the process builds on the growing availability of open data that is freely available online and analyzed with open source tools. Data-driven journalism strives to reach new levels of service for the public, helping the general public or specific groups or individuals to understand patterns and make decisions based on the findings. As such, data driven journalism might help to put journalists into a role relevant for society in a new way.
Since the introduction of the concept a number of media companies have created "data teams" which develop visualizations for newsrooms. Most notable are teams e.g. at Reuters, Pro Publica, and La Nacion. In Europe, The Guardian and Berliner Morgenpost have very productive teams, as well as public broadcasters.
As projects like the MP expense scandal and the 2013 release of the "offshore leaks" demonstrate, data-driven journalism can assume an investigative role, dealing with "not-so open" aka secret data on occasion.
The annual Data Journalism Awards recognize outstanding reporting in the field of data journalism, and numerous Pulitzer Prizes in recent years have been awarded to data-driven storytelling, including the 2018 Pulitzer Prize in International Reporting and the 2017 Pulitzer Prize in Public Service
Definitions
According to architect and multimedia journalist Mirko Lorenz, data-driven journalism is primarily a workflow that consists of the following elements: digging deep into data by scraping, cleansing and structuring it, filtering by mining for specific, visualizing and making a story. This process can be extended to provide results that cater to individual interests and the broader public.Data journalism trainer and writer Paul Bradshaw describes the process of data-driven journalism in a similar manner: data must be found, which may require specialized skills like MySQL or Python, then interrogated, for which understanding of jargon and statistics is necessary, and finally visualized and mashed with the aid of open source tools.
A more results-driven definition comes from data reporter and web strategist Henk van Ess. "Data-driven journalism enables reporters to tell untold stories, find new angles or complete stories via a workflow of finding, processing and presenting significant amounts of data with or without open tools." Van Ess claims that some of the data-driven workflow leads to products that "are not in orbit with the laws of good story telling" because the result emphases on showing the problem, not explaining the problem. "A good data driven production has different layers. It allows you to find personalized that are only important for you, by drilling down to relevant but also enables you to zoom out to get the big picture".
In 2013, Van Ess came with a shorter definition in that doesn't involve visualisation per se:
"Data journalism is journalism based on data that has to be processed first with tools before a relevant story is possible."
Reporting based on data
Telling stories based on the data is the primary goal. The findings from data can be transformed into any form of journalistic writing. Visualizations can be used to create a clear understanding of a complex situation. Furthermore, elements of storytelling can be used to illustrate what the findings actually mean, from the perspective of someone who is affected by a development. This connection between data and story can be viewed as a "new arc" trying to span the gap between developments that are relevant, but poorly understood, to a story that is verifiable, trustworthy, relevant and easy to remember.Data quality
In many investigations the data that can be found might have omissions or is misleading. As one layer of data-driven journalism a critical examination of the data quality is important. In other cases the data might not be public or is not in the right format for further analysis, e.g. is only available in a PDF. Here the process of data-driven journalism can turn into stories about data quality or refusals to provide the data by institutions. As the practice as a whole is in early development steps, examinations of data sources, data sets, data quality and data format are therefore an equally important part of this work.Data-driven journalism and the value of trust
Based on the perspective of looking deeper into facts and drivers of events, there is a suggested change in media strategies: In this view the idea is to move "from attention to trust". The creation of attention, which has been a pillar of media business models has lost its relevance because reports of new events are often faster distributed via new platforms such as Twitter than through traditional media channels. On the other hand, trust can be understood as a scarce resource. While distributing information is much easier and faster via the web, the abundance of offerings creates costs to verify and check the content of any story create an opportunity. The view to transform media companies into trusted data hubs has been described in an article cross-published in February 2011 on Owni.eu and Nieman Lab.Process of data-driven journalism
The process to transform raw data into stories is akin to a refinement and transformation. The main goal is to extract information recipients can act upon. The task of a data journalist is to extract what is hidden. This approach can be applied to almost any context, such as finances, health, environment or other areas of public interest.''Inverted pyramid of data journalism''
In 2011, Paul Bradshaw introduced a model, he called .Steps of the process
In order to achieve this, the process should be split up into several steps. While the steps leading to results can differ, a basic distinction can be made by looking at six phases:- Find: Searching for data on the web
- Clean: Process to filter and transform data, preparation for visualization
- Visualize: Displaying the pattern, either as a static or animated visual
- Publish: Integrating the visuals, attaching data to stories
- Distribute: Enabling access on a variety of devices, such as the web, tablets and mobile
- Measure: Tracking usage of data stories over time and across the spectrum of uses.
Description of the steps
Finding data
Data can be obtained directly from governmental databases such as data.gov, data.gov.uk and World Bank Data API but also by placing Freedom of Information requests to government agencies; some requests are made and aggregated on websites like the UK's What Do They Know. While there is a worldwide trend towards opening data, there are national differences as to what extent that information is freely available in usable formats. If the data is in a webpage, scrapers are used to generate a spreadsheet. Examples of scrapers are: Import.io, ScraperWiki, OutWit Hub and Needlebase. In other cases OCR software can be used to get data from PDFs.Data can also be created by the public through crowd sourcing, as shown in March 2012 at the Datajournalism Conference in Hamburg by Henk van Ess.
Cleaning data
Usually data is not in a format that is easy to visualize. Examples are that there are too many data points or that the rows and columns need to be sorted differently. Another issue is that once investigated many datasets need to be cleaned, structured and transformed. Various tools like Google Refine, Data Wrangler and Google Spreadsheets allow uploading, extracting or formatting data.Visualizing data
To visualize data in the form of graphs and charts, applications such as Many Eyes or Tableau Public are available. Yahoo! Pipes and Open Heat Map are examples of tools that enable the creation of maps based on data spreadsheets. The number of options and platforms is expanding. Some new offerings provide options to search, display and embed data, an example being Timetric.To create meaningful and relevant visualizations, journalists use a growing number of tools. There are by now, several descriptions what to look for and how to do it. Most notable published articles are:
- Joel Gunter: "#ijf11: Lessons in data journalism from the New York Times"
- Steve Myers: "Using Data Visualization as a Reporting Tool Can Reveal Story’s Shape", including a link to a tutorial by Sarah Cohen
Publishing data story
There are different options to publish data and visualizations. A basic approach is to attach the data to single stories, similar to embedding web videos. More advanced concepts allow to create single dossiers, e.g. to display a number of visualizations, articles and links to the data on one page. Often such specials have to be coded individually, as many Content Management Systems are designed to display single posts based on the date of publication.Distributing data
Providing access to existing data is another phase, which is gaining importance. Think of the sites as "marketplaces", where datasets can be found easily by others.Especially of the insights for an article where gained from Open Data, journalists should provide a link to the data they used for others to investigate.
Providing access to data and enabling groups to discuss what information could be extracted is the main idea behind Buzzdata, a site using the concepts of social media such as sharing and following to create a community for data investigations.
Other platforms :
- Help Me Investigate
- Timetric
- ScraperWiki
Measuring the impact of data stories
In the context of data-driven journalism, the extent of such tracking, such as collecting user data or any other information that could be used for marketing reasons or other uses beyond the control of the user, should be viewed as problematic. One newer, non-intrusive option to measure usage is a lightweight tracker called PixelPing. The tracker is the result of a project by ProPublica and DocumentCloud. There is a corresponding service to collect the data. The software is open source and can be downloaded via GitHub.
Examples
There is a growing list of examples how data-driven journalism can be applied:- The Guardian, one of the pioneering media companies in this space, has compiled an extensive list of data stories, see: "All of our data journalism in one spreadsheet".