Policy —

Poorly anonymized logs reveal NYC cab drivers’ detailed whereabouts

Botched attempt to scrub data reveals driver details for 173 million taxi trips.

Poorly anonymized logs reveal NYC cab drivers’ detailed whereabouts

In the latest gaffe to demonstrate the privacy perils of anonymized data, New York City officials have inadvertently revealed the detailed comings and goings of individual taxi drivers over more than 173 million trips.

City officials released the data in response to a public records request and specifically obscured the drivers' hack license numbers and medallion numbers. Rather than including those numbers in plaintext, the 20 gigabyte file contained one-way cryptographic hashes using the MD5 algorithm. Instead of a record showing medallion number 9Y99 or hack number 5296319, for example, those numbers were converted to 71b9c3f3ee5efb81ca05e9b90c91c88f and 98c2b1aeb8d40ff826c6f1580a600853, respectively. Because they're one-way hashes, they can't be mathematically converted back into their original values. Presumably, officials used the hashes to preserve the privacy of individual drivers since the records provide a detailed view of their locations and work performance over an extended period of time.

It turns out there's a significant flaw in the approach. Because both the medallion and hack numbers are structured in predictable patterns, it was trivial to run all possible iterations through the same MD5 algorithm and then compare the output to the data contained in the 20GB file. Software developer Vijay Pandurangan did just that, and in less than two hours he had completely de-anonymized all 173 million entries.

"Security researchers have been warning for a while that simply using hash functions is an ineffective way to anonymize data," Pandurangan wrote in a post published over the weekend on Medium. "In this case, it's substantially worse because of the structured format of the input data. This anonymization is so poor that anyone could, with less than two hours work, figure which driver drove every single trip in this entire dataset. It would even be easy to calculate drivers' gross income or infer where they live."

The incident is only the latest to underscore how easy it often is to de-anonymize data presumed to be scrubbed clean of personal details. E-mail addresses of neo Nazi advocates posting anonymous comments online were deciphered late last year thanks to the poor approach of Gravatar, the service that works with Github and millions of other sites. In 2009, researchers uncovered similar flaws in US Census Bureau releases that attempted to omit the locations of people's homes and work places. As long ago as 2006, AOL demonstrated the same pitfalls when it released 20 million search queries from 658,000 users. Although the company took care to remove names and other personal information, the disclosure proved to be ham-fisted after privacy advocates showed that the data could still be used to identify the people making the searches.

The New York City taxi data was easy to de-anonymize because of the formats of the data that was being obscured. Taxi license numbers are always six-digit numbers or seven-digit numbers that begin with a five. That makes for a maximum of two million possible numbers, a sum that takes a matter of seconds to exhaust using programming rules built into cracking apps such as Hashcat. Medallion numbers similarly conform to specific patterns that make for a total of only 22 million possible combinations. The recent disclosure of so much personal New York City cab driver data may compound existing privacy concerns over the use of GPS devices to monitor drivers' movements and fares.

Pandurangan said he constructed a rainbow table to cycle through all 24 million hashes. In fact, he didn't use a rainbow table, but rather a table of all precomputed values. A rainbow table is a special type of precomputed table that contains all or almost all possible entries using mathematical representations that greatly compress the file size. Rainbow tables and other types of precomputed data have largely fallen out of vogue as graphics cards and cracking applications such as Hashcat and John the Ripper have grown increasingly powerful.

No, MD5 is not to blame

Pandurangan said there are at least two things that New York City officials could have done to better protect taxi drivers' privacy. The first would have been to assign a random number to each hack license number and medallion number and use the substitute numbers throughout the disclosure. The other would have been to create a secret AES key and then encrypt each value individually. Readers should bear in mind that the de-anonymization flaw in this case had nothing to do with the choice of MD5 as the hashing algorithm. While MD5 is extremely fast and computationally undemanding, that benefit is of little value to crackers attacking a dataset of only 24 million entries. The second weakness of MD5—its susceptibility to so-called cryptographic collision attacks—has no application in this case.

"The cat is already out of the bag in this case," Pandurangan wrote, "but hopefully in the future, agencies will think carefully about the method they use to anonymize data before releasing it to the public."

Channel Ars Technica