How reproducible and replicable open contracting data analysis can help investigations
Red Palta (Latin American Network of Journalists for Transparency and Anti-Corruption) was born with the idea to use open data to do collaborative investigations on corruption across Latin America. For the first story, the data journalism teams of Datasketch, El Faro, La Diaria, La Nación, Ojoconmipisto.com, Ojo Público and PODER that form Red Palta looked at milk, an essential product on every household’s shopping list, with their investigation La Leche Prometida. Dairy products are also part of many social assistance programs aimed at the most vulnerable populations in countries such as Colombia, El Salvador, Guatemala, and Peru.
When conducting investigations with data, we often want to explore something more than once: we might want to update the data quickly as the story develops, show whether new policies have an influence in practice, or simply be transparent about our method so others can understand the scope of the data and even perform the same analysis with other datasets. But working with public data can quickly become challenging when trying to reproduce or replicate results either in your own country or elsewhere.
The biggest challenges when working with government datasets include data that is in closed or the wrong formats (text instead of a number), poor data quality, data availability that is not guaranteed, data that is spread across multiple entities, and data standards that are not implemented properly.
Public procurement systems are constantly adapting and evolving and can impact data use. Earlier this year Colombia Compra Eficiente made adjustments to their system during which the open data on its contracts became unavailable for some days without prior notice to the users.
As open data systems adapt and evolve we need to keep the data clean so that our queries can run smoothly. To ensure that this burden does not increase, we at Red Palta started implementing practices of reproducible research in our data analysis workflows.
The importance of reproducible and replicable analysis of open contracting data
When we talk about being able to re-do our data analysis we must ask ourselves if the data structure we are analyzing is the same. If not, we will have to start from scratch or manipulate the data in a way that can be plugged into our predefined structures. This is one of the benefits of using data standards, such as the Open Contracting Data Standard (OCDS) – it allows us to assume that our data has the same structure. With this, our analysis becomes reproducible: that is, when we input the same data we used in the past to the analysis or code, we can expect to get the same result as before. And the procedure is replicable if we can run the same analysis with different data (either containing new samples or all brand new data) that follows the same structure.
These concepts emerged originally in the context of science and specifically in open science, but they can easily be applied to analyze open contracting data.
The Turing Way provides a clear presentation of the difference.
That is:
- Reproducible: Same data – Same Analysis/Code.
- Replicable: Different data – Same Analysis/Code.
One of the main objectives of Red Palta is to create reproducible and replicable frameworks to ease the work of journalists reporting on corruption issues using open contracting data. For the members of our network, this constitutes a great challenge to solve, as we face different scenarios that we have to account for technically to be able to streamline data analysis and visualization for our investigations.
The challenge is due to different contexts and levels of access to information in the participating countries. Some countries like Mexico have very good access to information laws, while others like El Salvador have little public openness. In countries like Colombia you can analyze large data dumps of public contract data, while in others like Peru, you must devise ways to bypass the lack of open information systems to have access to data in bulk. Even in countries where OCDS data is available like Colombia and Mexico, in practice, it is sometimes better to resort to a tabular format, which is more practical to work with. In countries where OCDS data is somewhat available, like Uruguay and Argentina, in practice it could not be used, as the data was either available only at the federal (Uruguay) or local (Buenos Aires) level only. Differences in the granularity of the data was also something to take into account, as some countries have detailed tender data and even item information at the contract level, while others did not.
Despite data discrepancies in journalistic investigations, open contracting data serves a great purpose at the exploration level to get different investigative leads, especially those pertaining to red flags in open contracting.
For this first series La Leche Prometida each media outlet presented an individual investigation.
In El Salvador, for example, the investigation showed that milk, destined for public schools, was not reaching all institutions. And in the case of the schools that did receive the milk, it was in powder form, without considering that many of these institutions do not have access to drinking water, according to the report.
In Peru and Guatemala, the investigation showed conflicts of interest and million-dollar contracts in the supply of this product. While in Uruguay, the issue had a more economic perspective. For example, the report said that the State has failed to grant subsidies to family farming enterprises, although the law allows it.
During the reporting period, Red PALTA members fed data into a database that would eventually allow them to discover patterns in the region: the role of transnational companies that are often the main beneficiaries such as Colombia and Mexico and the role milk plays in delivering social programs increasing the risks of being linked to political campaign financing, as the example of Grupo Nutresa’s contracts in Colombia shows, or nepotism when the political agenda benefits a family company as in Guatemala.
Connecting over the same issue across countries helped paint a more complete picture of the often complex regional realities and identify common angles.
Scaling Open Contracting Data Investigations
Finding common grounds for the investigations and data exchange only solves part of the equation. As we scale the fight for transparency and accountability we need to push for ways to streamline the process of transforming data into insights quickly enough for journalists to react. This is why implementing reproducible and replicable frameworks is crucial.
Here are some tips:
- Use version control systems: set up repositories to track historic changes to the code being used for analysis.
- Use literate programming when possible: Use executable documentation on the scripts and reports, that is, scripts that not only generate charts and graphs but that are inherently accompanied by documentation.
- Automate testing: When doing reproducible analysis we not only need to ensure the data is the same, we also need to make sure the code is the same. Using versioned packages is desirable to replicate the exact computer environments and ensure the exact same code is running under the same conditions.
- Publication: Having a workflow that allows you to easily re-run analysis and output intermediate reports is very valuable at the data exploration phase with the team of journalists or context experts. Once the analysis is approved we only need to run them with as much frequency as the data changes to be able to monitor possible research leads.
Here’s how we are applying this to Red Palta:
We use static site generators for all our reports. Red Palta’s site was built with Hugo. In the same repositories we keep the scripts to generate reproducible analysis, and we are working on making it more coherent as we develop further investigations. A lot of the code for the analysis and visualizations was done using the R language and reproducible reports using Rmd. Different software packages are being developed, tested and versioned, especially around data visualization and simplifying analysis of open contracting data.
Some others things we are tinkering with:
- Interconexión: A protocol for exchanging information in a decentralized way with different civil society and journalism organizations in Latin America.
- A joint repository of multiples public databases in different Latin American countries, curated and maintained by the members of the network.
- We are still working on finding better ways to do version control, not only for the code but also for the data itself. Public data may appear and disappear for many reasons.
- Ensuring data discoverability with the introduction of https://datatxt.org/ specifications in our workflows as a way to automate data exchanges.