On web browsing privacy in anonymized netflows
Anonymization of network traces is widely viewed as a necessary condition for releasing such data for research purposes. For obvious privacy reasons, an important goal of trace anonymization is to suppress the recovery of web browsing activities. While several studies have examined the possibility of reconstructing web browsing activities from anonymized packet-level traces, we argue that these approaches fail to account for a number of challenges inherent in real-world network traffic, and more so, are unlikely to be successful on coarser NetFlow logs. By contrast, we develop new approaches that identify target web pages within anonymized NetFlow data, and address many real-world challenges, such as browser caching and session parsing. We evaluate the effectiveness of our techniques in identifying front pages from the 50 most popular web sites on the Internet (as ranked by alexa.com), in both a closed-world experiment similar to that of earlier work and in tests with real network flow logs. Our results show that certain types of web pages with unique and complex structure remain identifiable despite the use of state-of-the-art anonymization techniques. The concerns raised herein pose a threat to web browsing privacy insofar as the attacker can approximate the web browsing conditions represented in the flow logs.