Sprinting at EuroPython 2014
A phenomenal two days of sprinting just wrapped up two back-to-back conferences in Berlin.
On July 15 – 17, the Open Contracting team participated in the Open Knowledge Festival. We presented our latest work, got feedback from the community, gained experience, knowledge and insight from an incredible range of people working on Open Data around the world.
As a happy co-incidence OKFest was followed by the annual European Python conference, EuroPython. In April, Open Contracting participated in the North American Python Conference, PyCon. We participated in the 4-day sprint and were delighted to have 14 volunteers from the conference stay on for 4 days to work on Open Contracting (more here).
So, with EuroPython right next to Open Knowledge Festival, I stayed on in Berlin to run a sprint with the Europython community. The EuroPython sprints were the weekend of July 26 & 27. We had an incredible weekend with 9 volunteers, building a pipeline of tools to help Open Contracting.
Our Open Contracting crew was:
- Mihai Dinca [dincamihai]
- Sarah Bird [birdsarah] (that’s me)
- Florian Pilz [florianpilz]
- Jure Cuhalev [gandalfar]
- Tomaž Šolc [avian2]
- Alex Morega [mgax]
- Joren Retel [jorenretel]
- Beatrice Macucoa [beaMacuco]
- and (missing from the photo):
- Josip Delic [delijati]
- Danny Crasto [danwald]
Some quick background. Last month, the Open Contracting Data Standard team released our Draft Data Model for consultation. We received feedback on-line and through conversations and our workshop at the Open Knowledge Festival. Since then, I started drafting a first cut of a serialization – specifying fields that might be in the standard. I chose to use JSON Schema, not because that is a final decision of the Open Contracting Data Standard team, but it was a good tool to get started in. So going into the sprints we had a partial JSON Schema to work with.
The sprint work split into little chunks that all fed into one another, plus some little extras.
- First, Alex worked on the existing library json-schema-random which allows you to generate sample JSON data from a JSON Schema. Out of the box this excellent library took a JSON schema and made sample JSON with random data. Alex tweaked it so that it could also produce an empty JSON sample with no random data in it, but this is generated directly from the schema. This useful as it allows us to work while the schema is evolving without having to maintain sample data. Alex was also able to make a pull request to offer these enhancements back to the original project.
- This blank sample data, feeds into the work that Mihai and Florian did. As part of testing and demonstrating the new standard, we want to test it against existing datasets so that we can see what real data looks like. Mihai and Florian built a new library called «ocds_mapper.» With the blank JSON file from Alex, we can then look at an existing dataset and look at its fields and specify which fields correspond to our new Open Contracting fields. Florian and Mihai’s library takes this mapping and a CSV from an existing publisher and turns the CSV data into a series of Open Contracting «Releases» (see the Draft Data Model) in our new format.
- This output then feeds into Jure’s work. Jure built a validation website. You can either upload a file, provide a URL or paste in JSON and the validator will tell you whether the data is valid and provide some clear messages when there are errors e.g. a missing required field, or an invalid value.
- Once we know that we have a valid set of releases, then we want to be able to compile them into records. (Again, see the Draft Data Model to learn more about Records and Releases). First, Alex wrote a small script that finds all the releases with the same unique Contracting ID, and grouped them into an Open Contracting Record. (It’s worth noting that normally you wouldn’t have to do this as publishers would also provide an Open Contracting Record document, but as the standard doesn’t exist yet, we had to do this step ourselves for the existing data we have).
- With our bare open contracting record in hand (just a list of releases), we now move to the compile stage. This is where we pull all the data that has been released over time and pull it into one single record that reflects the current state of the data (as well as a history of changes, where appropriate). To do this, Tomaz wrote a new library called json-merge that effectively enhances JSON schema and allows you to specify a «merge strategy» for each field so that as new data comes in you know whether to overwrite it, or store all the changes. The output of this is a single chunk of data that reflects the current state of a contracting process in one place bringing together data from across the contracting process. This was no small feat, and is still a work in progress, trying to implement this we got some really good feedback on what the technical & data challenges might be with out releases -> record model.
- Finally with the output as a series of unique contracting processes, we are able to start visualizing our data. The Open Contracting Data Standard team has been working to build out key use cases for contracting data. Within this, the corruption and fraud use case has some metrics that you can extract from the data. Mihaly Fazekas who works on this, kindly suggested a couple of easy to generate metrics so that we could demonstrate visualizations from the Open Contracting Data. And so Joren building a couple of visualizations that take standardized open contracting data, as produced by all the steps above, and turns it into a visualization that we could reuse across any dataset that uses the open contracting data standard.
It was amazing to see the whole pipeline come together over a weekend. But wait! That wasn’t all!
The sprinting team went out for a taste of Berlin’s famous beer gardens on Saturday night and Joren’s wife joined us. By the end of the evening, we’d persuaded her to come sprint with us the next day and Beatrice joined us and got her first taste of django adding some much needed tests to our standard-collaborator tool.
And finally, following on from our work in Montreal, there is an interesting challenge in being able to compare on what categories contracts were issued over datasets. For example, how much was spent on Construction in Canada compared to the UK. The challenge is that different countries use different systems to classify goods and services, so a similar item maybe classified as C123 in one place and 07435 somewhere else, and also labeled with a similar, but not the same title. Danny and Josip investigated how we could build a hierarchy of meaning from existing classification systems as a first step to understanding if we could automate some basic matching between classification systems.
And, on top of all of that, I think we all had fun. I know I did. My thanks again to the wonderful Open Contracting EuroPython team and to EuroPython for hosting sprints (we were all well-fed and watered with a never ending supply of Club Mate – the German programmer’s drink of choice).