Apr 4, 2016 10:04 AM

How the 11.5 million Panama Papers were analysed

Kajdi Szabolcs / iStock

The biggest leak in history has connected more than 70 current and former world leaders to tax evasion schemes that channel billions of pounds into secretive off-shore accounts. This is how the data was analysed.

The Panama Papers show that law firm Mossack Fonseca helped hundreds of clients, with connections to some of the most powerful people in the world, launder money, dodge tax and potentially avoid sanctions.

The papers themselves were leaked to news organisations by an unknown person and have been shared with more than 100 news organisations and 400 journalists – the investigation has been ongoing for almost a year.

The process of making the raw data accessible for journalists involved converting it to digital formats, high-performance computers, and algorithms to find well known names among the thousands of details.

How big is the Panama Papers leak?

While the actual leaked documents have not been published -– the International Consortium of Investigative Journalists (ICIJ) say the full list of companies linked to the papers will be revealed in May – how much data they contain is known.

The leak reportedly has more than 11.5 million internal files from Mossack Fonseca. These include, but aren't limited to, emails, contracts, transcriptions and scanned documents. In total, the leak contains: 4.8 million emails, three million database entries, two million PDFs, one million images and 320,000 text documents. The dataset is bigger than any from Wikileaks, or the Edward Snowden disclosures.

In total the data comes to 2.6 terabytes of information. Included in the files, which were first obtained by Süddeutsche Zeitung, is data from 1977 through to 2015. "The data shows that Mossack Fonseca worked with more than 14,000 banks, law firms, company incorporators and other middlemen to set up companies, foundations and trusts for customers," the ICIJ says.

Mossack Fonseca, the company at the centre of the revelationsGetty Images

How do you analyse 11.5 million files?

To be able to report on the leaked documents those with access to the data needed to ensure that it would be machine readable and be able to be searched. "Heterogeneous data is hard to ingest and cross-reference," Gabriel Brostow, an associate professor in computer science, at University College London, told WIRED. "Tables, figures, PDFs are almost impenetrable."

Süddeutsche Zeitung and the ICIJ worked with software company Nuix to initially sort and organise the files. Tackling the data involved it being kept on private servers, not connected to the outside world Carl Barron, a senior consultant from Nuix, explained to WIRED. Once separated it would be indexed. "We would bring out the text of this information, we would bring out all of the metadata, and then we could start using Nuix to investigate it from the big data and analytical perspective," Barron said.

The biggest challenge for processing the data was the amount of text that couldn't initially be recognised by machines. Optical character recognition (OCR) was used to transform the data into text that could be understood and searched by computers. Once the text was extracted it could then be inserted into the index and database. The final database size was predicted by Barron to be 30 per cent of the original data size. "We allowed ICIJ and Süddeutsche Zeitung to run their keyword searches, we could also bring out entities: first names, second names and figures," Barron said. "We could also use our analytics to find how these names refer to the documents. If you find a person's name in an email, you may want to find out where else that person has been mentioned across all of the other data."

Once the information had been indexed, algorithms were used to look for specific links in the vast database. Finally, this automated information was combines with manually created data. "The journalists compiled lists of important politicians, international criminals, and well-known professional athletes, among others," Süddeutsche Zeitung explained in an editorial. From here it was possible to create a search tool for the names on the list.

The news organisation says: "The 'party donations scandal' list contained 130 names, and the UN sanctions list more than 600. In just a few minutes, the powerful search algorithm compared the lists with the 11.5 million documents."

This article was originally published by WIRED UK