1. bookVolume 2019 (2019): Issue 2 (April 2019)
Journal Details
First Published
16 Apr 2015
Publication timeframe
4 times per year
access type Open Access

Cardinality Estimators do not Preserve Privacy

Published Online: 04 May 2019
Page range: 26 - 46
Received: 31 Aug 2018
Accepted: 16 Dec 2018
Journal Details
First Published
16 Apr 2015
Publication timeframe
4 times per year

Cardinality estimators like HyperLogLog are sketching algorithms that estimate the number of distinct elements in a large multiset. Their use in privacy-sensitive contexts raises the question of whether they leak private information. In particular, can they provide any privacy guarantees while preserving their strong aggregation properties?

We formulate an abstract notion of cardinality estimators, that captures this aggregation requirement: one can merge sketches without losing precision. We propose an attacker model and a corresponding privacy definition, strictly weaker than differential privacy: we assume that the attacker has no prior knowledge of the data. We then show that if a cardinality estimator satisfies this definition, then it cannot have a reasonable level of accuracy. We prove similar results for weaker versions of our definition, a nd a nalyze t he p rivacy o f existing algorithms, showing that their average privacy loss is significant, e ven f or m ultisets w ith l arge cardinalities. We conclude that strong aggregation requirements are incompatible with any reasonable definition o f privacy, and that cardinality estimators should be considered as sensitive as raw data. We also propose risk mitigation strategies for their real-world applications.


[1] [n. d.]. Approximate algorithms in Apache Spark: HyperLogLog and Quantiles. https://databricks.com/blog/2016/05/19/approximate-algorithms-in-apache-spark-hyperloglog-and-quantiles.html. Accessed: 2018-05-03.Search in Google Scholar

[2] [n. d.]. BigQuery Technical Documentation on Functions & Operators. https://cloud.google.com/bigquery/docs/reference/standardsql/functions-and-operators. Accessed: 2017-10-24.Search in Google Scholar

[3] [n. d.]. HyperLogLog implementation for mssql. https://github.com/shuvava/mssql-hll. Accessed: 2018-05-03.Search in Google Scholar

[4] [n. d.]. PostgreSQL extension extension adding HyperLogLog data structures as a native data type. https://github.com/citusdata/postgresql-hll. Accessed: 2018-05-03.Search in Google Scholar

[5] Vikas G Ashok and Ravi Mukkamala. 2014. A scalable and efficient privacy preserving global itemset support approximation using Bloom filters. In IFIP Annual Conference on Data and Applications Security and Privacy. Springer, 382–389.Search in Google Scholar

[6] Ziv Bar-Yossef, TS Jayram, Ravi Kumar, D Sivakumar, and Luca Trevisan. 2002. Counting distinct elements in a data stream. In International Workshop on Randomization and Approximation Techniques in Computer Science. Springer, 1–10.Search in Google Scholar

[7] Raef Bassily, Adam Groce, Jonathan Katz, and Adam Smith. 2013. Coupled-worlds privacy: Exploiting adversarial uncertainty in statistical data privacy. In Foundations of Computer Science (FOCS), 2013 IEEE 54th Annual Symposium on. IEEE, 439–448.Search in Google Scholar

[8] Raef Bassily and Adam Smith. 2015. Local, private, efficient protocols for succinct histograms. In Proceedings of the forty-seventh annual ACM symposium on Theory of computing. ACM, 127–135.Search in Google Scholar

[9] Raef Bassily, Uri Stemmer, Abhradeep Guha Thakurta, et al. 2017. Practical locally private heavy hitters. In Advances in Neural Information Processing Systems. 2288–2296.Search in Google Scholar

[10] Kevin Beyer, Peter J Haas, Berthold Reinwald, Yannis Sismanis, and Rainer Gemulla. 2007. On synopses for distinct-value estimation under multiset operations. In Proceedings of the 2007 ACM SIGMOD international conference on Management of data. ACM, 199–210.Search in Google Scholar

[11] Raghav Bhaskar, Abhishek Bhowmick, Vipul Goyal, Srivatsan Laxman, and Abhradeep Thakurta. 2011. Noiseless database privacy. In International Conference on the Theory and Application of Cryptology and Information Security. Springer, 215–232.Search in Google Scholar

[12] George Casella and Roger L Berger. 2002. Statistical inference. Vol. 2. Duxbury Pacific Grove, CA.Search in Google Scholar

[13] Jeffrey Dean and Sanjay Ghemawat. 2008. MapReduce: simplified data processing on large clusters. Commun. ACM 51, 1 (2008), 107–113.Search in Google Scholar

[14] Claudia Diaz, Stefaan Seys, Joris Claessens, and Bart Preneel. 2002. Towards measuring anonymity. In International Workshop on Privacy Enhancing Technologies. Springer, 54–68.Search in Google Scholar

[15] Yitao Duan. 2009. Privacy without noise. In Proceedings of the 18th ACM conference on Information and knowledge management. ACM, 1517–1520.Search in Google Scholar

[16] Marianne Durand and Philippe Flajolet. 2003. Loglog counting of large cardinalities. In European Symposium on Algorithms. Springer, 605–617.Search in Google Scholar

[17] Cynthia Dwork. 2008. Differential privacy: A survey of results. In International Conference on Theory and Applications of Models of Computation. Springer, 1–19.Search in Google Scholar

[18] Cynthia Dwork, Frank McSherry, Kobbi Nissim, and Adam Smith. 2006. Calibrating noise to sensitivity in private data analysis. In Theory of Cryptography Conference. Springer, 265–284.Search in Google Scholar

[19] Cynthia Dwork, Guy N Rothblum, and Salil Vadhan. 2010. Boosting and differential privacy. In Foundations of Computer Science (FOCS), 2010 51st Annual IEEE Symposium on. IEEE, 51–60.Search in Google Scholar

[20] Rolf Egert, Marc Fischlin, David Gens, Sven Jacob, Matthias Senker, and Jörn Tillmanns. 2015. Privately computing set-union and set-intersection cardinality via Bloom filters. In Australasian Conference on Information Security and Privacy. Springer, 413–430.Search in Google Scholar

[21] Úlfar Erlingsson, Vasyl Pihur, and Aleksandra Korolova. 2014. RAPPOR: Randomized aggregatable privacy-preserving ordinal response. In Proceedings of the 2014 ACM SIGSAC conference on computer and communications security. ACM, 1054–1067.Search in Google Scholar

[22] Cristian Estan, George Varghese, and Mike Fisk. 2003. Bitmap algorithms for counting active flows on high speed links. In Proceedings of the 3rd ACM SIGCOMM conference on Internet measurement. ACM, 153–166.Search in Google Scholar

[23] Giulia Fanti, Vasyl Pihur, and Úlfar Erlingsson. 2016. Building a rappor with the unknown: Privacy-preserving learning of associations and data dictionaries. Proceedings on Privacy Enhancing Technologies 2016, 3 (2016), 41–61.Search in Google Scholar

[24] Philippe Flajolet, Éric Fusy, Olivier Gandouet, and Frédéric Meunier. 2008. HyperLogLog: the analysis of a near-optimal cardinality estimation algorithm. DMTCS Proceedings 1 (2008).10.46298/dmtcs.3545Search in Google Scholar

[25] Philippe Flajolet and G Nigel Martin. 1985. Probabilistic counting algorithms for data base applications. Journal of computer and system sciences 31, 2 (1985), 182–209.Search in Google Scholar

[26] Philippe Golle and Kurt Partridge. 2009. On the anonymity of home/work location pairs. Pervasive computing (2009), 390–397.Search in Google Scholar

[27] Michaela Gotz, Ashwin Machanavajjhala, Guozhang Wang, Xiaokui Xiao, and Johannes Gehrke. 2012. Publishing search logs—a comparative study of privacy guarantees. IEEE Transactions on Knowledge and Data Engineering 24, 3 (2012), 520–532.Search in Google Scholar

[28] Krzysztof Grining and Marek Klonowski. 2017. Towards Extending Noiseless Privacy: Dependent Data and More Practical Approach. In Proceedings of the 2017 ACM on Asia Conference on Computer and Communications Security. ACM, 546–560.Search in Google Scholar

[29] Stefan Heule, Marc Nunkesser, and Alexander Hall. 2013. HyperLogLog in practice: algorithmic engineering of a state of the art cardinality estimation algorithm. In Proceedings of the 16th International Conference on Extending Database Technology. ACM, 683–692.Search in Google Scholar

[30] Daniel Kifer and Ashwin Machanavajjhala. 2012. A rigorous and customizable framework for privacy. In Proceedings of the 31st ACM SIGMOD-SIGACT-SIGAI symposium on Principles of Database Systems. ACM, 77–88.Search in Google Scholar

[31] Ashwin Machanavajjhala, Daniel Kifer, John Abowd, Johannes Gehrke, and Lars Vilhuber. 2008. Privacy: Theory meets practice on the map. In Proceedings of the 2008 IEEE 24th International Conference on Data Engineering. IEEE Computer Society, 277–286.Search in Google Scholar

[32] Luca Melis, George Danezis, and Emiliano De Cristofaro. 2015. Efficient private statistics with succinct sketches. arXiv preprint arXiv:1508.06110 (2015).10.14722/ndss.2016.23175Search in Google Scholar

[33] Ilya Mironov. 2017. Renyi differential privacy. In Computer Security Foundations Symposium (CSF), 2017 IEEE 30th. IEEE, 263–275.Search in Google Scholar

[34] Anna Monreale, Wendy Hui Wang, Francesca Pratesi, Salvatore Rinzivillo, Dino Pedreschi, Gennady Andrienko, and Natalia Andrienko. 2013. Privacy-preserving distributed movement data aggregation. In Geographic Information Science at the Heart of Europe. Springer, 225–245.Search in Google Scholar

[35] Sriram Padmanabhan, Bishwaranjan Bhattacharjee, Tim Malkemus, Leslie Cranston, and Matthew Huras. 2003. Multidimensional clustering: A new data layout scheme in db2. In Proceedings of the 2003 ACM SIGMOD international conference on Management of data. ACM, 637–641.Search in Google Scholar

[36] Odysseas Papapetrou, Wolf Siberski, and Wolfgang Nejdl. 2010. Cardinality estimation and dynamic length adaptation for Bloom filters. Distributed and Parallel Databases 28, 2 (2010), 119–156.Search in Google Scholar

[37] Vibhor Rastogi, Michael Hay, Gerome Miklau, and Dan Suciu. 2009. Relationship privacy: output perturbation for queries with joins. In Proceedings of the twenty-eighth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems. ACM, 107–116.Search in Google Scholar

[38] David Rebollo-Monedero and Jordi Forné. 2010. Optimized query forgery for private information retrieval. IEEE Transactions on Information Theory 56, 9 (2010), 4631–4642.Search in Google Scholar

[39] David Rebollo-Monedero, Jordi Forne, and Josep Domingo-Ferrer. 2010. From t-closeness-like privacy to postrandomization via information theory. IEEE Transactions on Knowledge and Data Engineering 22, 11 (2010), 1623–1636.Search in Google Scholar

[40] Amit Shukla, Prasad Deshpande, Jeffrey F Naughton, and Karthikeyan Ramasamy. 1996. Storage estimation for multidimensional aggregates in the presence of hierarchies. In VLDB, Vol. 96. 522–531.Search in Google Scholar

[41] Florian Tschorsch and Björn Scheuermann. 2013. An algorithm for privacy-preserving distributed user statistics. Computer Networks 57, 14 (2013), 2775–2787.Search in Google Scholar

[42] Kyu-Young Whang, Brad T Vander-Zanden, and Howard M Taylor. 1990. A linear-time probabilistic counting algorithm for database applications. ACM Transactions on Database Systems (TODS) 15, 2 (1990), 208–229.Search in Google Scholar

[43] Shuheng Zhou, Katrina Ligett, and Larry Wasserman. 2009. Differential privacy with compression. In Information Theory, 2009. ISIT 2009. IEEE International Symposium on. IEEE, 2718–2722.Search in Google Scholar

Recommended articles from Trend MD

Plan your remote conference with Sciendo