1. bookVolume 2018 (2018): Issue 3 (June 2018)
Journal Details
First Published
16 Apr 2015
Publication timeframe
4 times per year
access type Open Access

Toward Distribution Estimation under Local Differential Privacy with Small Samples

Published Online: 28 Apr 2018
Page range: 84 - 104
Received: 30 Nov 2017
Accepted: 16 Mar 2018
Journal Details
First Published
16 Apr 2015
Publication timeframe
4 times per year

A number of studies have recently been made on discrete distribution estimation in the local model, in which users obfuscate their personal data (e.g., location, response in a survey) by themselves and a data collector estimates a distribution of the original personal data from the obfuscated data. Unlike the centralized model, in which a trusted database administrator can access all users’ personal data, the local model does not suffer from the risk of data leakage. A representative privacy metric in this model is LDP (Local Differential Privacy), which controls the amount of information leakage by a parameter ∈ called privacy budget. When ∈ is small, a large amount of noise is added to the personal data, and therefore users’ privacy is strongly protected. However, when the number of users ℕ is small (e.g., a small-scale enterprise may not be able to collect large samples) or when most users adopt a small value of ∈, the estimation of the distribution becomes a very challenging task. The goal of this paper is to accurately estimate the distribution in the cases explained above. To achieve this goal, we focus on the EM (Expectation-Maximization) reconstruction method, which is a state-of-the-art statistical inference method, and propose a method to correct its estimation error (i.e., difference between the estimate and the true value) using the theory of Rilstone et al. We prove that the proposed method reduces the MSE (Mean Square Error) under some assumptions.We also evaluate the proposed method using three largescale datasets, two of which contain location data while the other contains census data. The results show that the proposed method significantly outperforms the EM reconstruction method in all of the datasets when ℕ or ∈ is small.


[1] Aggarwal CC, Yu PS (2008) Privacy-Preserving Data Mining. SpringerSearch in Google Scholar

[2] Agrawal D, Aggarwal CC (2001) On the design and quantification of privacy preserving data mining algorithms. In: Proceedings of the 20th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems (PODS’01), pp 247-255Search in Google Scholar

[3] Agrawal R, Srikant R, Thomas D (2005) Privacy preserving OLAP. In: Proceedings of the 2005 ACM SIGMOD international conference on Management of data (SIGMOD’05), pp 251-262Search in Google Scholar

[4] Andrés ME, Bordenabe NE, Chatzikokolakis K, Palamidessi C (2013) Geo-indistinguishability: Differential privacy for location-based systems. In: Proceedings of the 20th ACM Conference on Computer and Communications Security (CCS’13), pp 901-914Search in Google Scholar

[5] Bao Y, Ullah A (2007) The second-order bias and mean squared error of estimators in time-series models. Journal of Econometrics 140(2):650-669Search in Google Scholar

[6] Bordenabe NE, Chatzikokolakis K, Palamidessi C (2014) Optimal geo-indistinguishable mechanisms for location privacy. In: Proceedings of the 21st ACM Conference on Computer and Communications Security (CCS’14), pp 251-262Search in Google Scholar

[7] Chatzikokolakis K, ElSalamouny E, Palamidessi C (2017) Practical mechanisms for location privacy. Proceedings on Privacy Enhancing Technologies (PoPETs) 2017(4):210-231Search in Google Scholar

[8] Chelmis C, Kolte J, Prasanna VK (2015) Big data analytics for demand response: Clustering over space and time. In: Proceedings of 2015 IEEE International Conference on Big Data (BigData’15), pp 2223-2232Search in Google Scholar

[9] Cover TM, Thomas JA (2006) Elements of Information Theory, Second Edition. Wiley-InterscienceSearch in Google Scholar

10] Data Breaches Increase 40 Percent in 2016, Finds New Report from Identity Theft Resource Center and CyberScout (2017) http://www.idtheftcenter.org/2016databreaches.htmlSearch in Google Scholar

[11] Duchi JC, Jordan MI, Wainwright MJ (2013) Local privacy and statistical minimax rates. In: Proceedings of the IEEE 54th Annual Symposium on Foundations of Computer Science (FOCS’13), pp 429-438Search in Google Scholar

[12] Dwork C (2006) Differential privacy. In: Proceedings of the 33rd international conference on Automata, Languages and Programming (ICALP’06), pp 1-12Search in Google Scholar

[13] Dwork C, Roth A (2014) The Algorithmic Foundations of Differential Privacy. Now PublishersSearch in Google Scholar

[14] Eagle N, Pentland A, Lazer D (2009) Inferring friendship network structure by using mobile phone data. Proceedings of the National Academy of Sciences (PNAS) 106(36):15,274-15,278Search in Google Scholar

[15] Efron B, Hastie T (2016) Computer Age Statistical Inference. Cambridge UniversitySearch in Google Scholar

[16] Eltarjaman W, Dewri R, Thurimella R (2017) Location privacy for rank-based geo-query systems. Proceedings on Privacy Enhancing Technologies (PoPETs) 2017(4):19-38Search in Google Scholar

[17] Úlfar Erlingsson, Pihur V, Korolova A (2014) RAPPOR: Randomized aggregatable privacy-preserving ordinal response. In: Proceedings of the 2014 ACM SIGSAC Conference on Computer and Communications Security (CCS’14), pp 1054-1067Search in Google Scholar

[18] Freudiger J, Shokri R, Hubaux JP (2011) Evaluating the privacy risk of location-based services. In: Proceedings of the 15th international conference on Financial Cryptography and Data Security (FC’11), pp 31-46Search in Google Scholar

[19] Gelman A, Meng XL, Stern H (1996) Posterior predictive assessment of model fitness via realized discrepancies. Statistica Sinca 6:733-807Search in Google Scholar

[20] Groetcsh C (1984) The Theory of Tikhonov Regularization for Fredholm Equations of the First Kind. Pitman Advanced Publishing ProgramSearch in Google Scholar

[21] Hasan O, Habegger B, Brunie L, Bennani N, Damiani E (2009) A discussion of privacy challenges in user profiling with big data techniques: The EEXCESS use case. In: Proceedings of 2013 IEEE International Congress on Big Data (BigData Congress’13), pp 25-30Search in Google Scholar

[22] Hastie T, Tibshirani R, Friedman J (2009) The Elements of Statistical Learning. Spinger, 2nd edition Search in Google Scholar

[23] Hino H, Shen H, Murata N, Wakao S, Hayashi Y (2013) A versatile clustering method for electricity consumption pattern analysis in households. IEEE Transactions on Smart Grid 4(2):1048-1057Search in Google Scholar

[24] Hsu J, Gaboardi M, Haeberlen A, Khanna S, Narayan A, Pierce BC, Roth A (2014) Differential privacy: An economic method for choosing epsilon. In: Proceedings of the 2014 IEEE 27th Computer Security Foundations Symposium (CSF’14), pp 398-410Search in Google Scholar

[25] Huang Z, Du W (2008) OptRR: Optimizing randomized response schemes for privacy-preserving data mining. In: Proceedings of IEEE 24th International Conference on Data Engineering (ICDE’08), pp 705-714Search in Google Scholar

[26] Hull B, Bychkovsky V, Zhang Y, Chen K, Goraczko M (2006) CarTel: A distributed mobile sensor computing system. In: Proceedings of the 4th International Conference on Embedded Networked Sensor Systems (SenSys’06), pp 125-138Search in Google Scholar

[27] Ishiguro M, Sakamoto Y, Kitagawa G (1997) Bootstrapping log likelihood and EIC, an extension of AIC. Annals of the Institute of Statistical Mathematics 49(3):411-43410.1023/A:1003158526504Open DOISearch in Google Scholar

[28] Johansen TA (1997) On Tikhonov regularization, bias and variance in nonlinear system identification. Automatica (3):441-44610.1016/S0005-1098(96)00168-9Open DOISearch in Google Scholar

[29] Jorgensen Z, Yu T, Cormode G (2015) Conservative or liberal? Personalized differential privacy. In: Proceedings of IEEE 31st International Conference on Data Engineering (ICDE’15), pp 1023-1034Search in Google Scholar

[30] Kairouz P, Bonawitz K, Ramage D (2016) Discrete distribution estimation under local privacy. In: Proceedings of the 33rd International Conference on Machine Learning (ICML’16), pp 2436-2444Search in Google Scholar

[31] Kairouz P, Oh S, Viswanath P (2016) Extremal mechanisms for local differential privacy. Journal of Machine Learning Research 17(1):492-542Search in Google Scholar

[32] Li N, Lyu M, Su D (2016) Differential Privacy: From Theory to Practice. Morgan & Claypool Publishers Search in Google Scholar

[33] Lichman M (2013) UCI machine learning repository. URL http://archive.ics.uci.edu/mlSearch in Google Scholar

[34] Lin J (1991) Divergence measures based on the shannon entropy. IEEE Transactions on Information Theory 37(1):145- 15110.1109/18.61115Open DOISearch in Google Scholar

[35] Lisovich MA, Mulligan DK, Wicker SB (2010) Inferring personal information from demand-response systems. IEEE Security & Privacy 8(1):11-2010.1109/MSP.2010.40Open DOISearch in Google Scholar

[36] Matsuo Y, Okazaki N, Izumi K, Nakamura Y, Nishimura T, Hasida K (2007) Inferring long-term user properties based on users’ location history. In: Proceedings of the 20th International Joint Conference on Artifical Intelligence (IJCAI’ 07), pp 2159-2165Search in Google Scholar

[37] Pastore A, Gastpar M (2016) Locally differentially-private distribution estimation. In: Proceedings of the 2016 IEEE International Symposium on Information Theory (ISIT’16), pp 2694-2698Search in Google Scholar

[38] Pyrgelis A, Troncoso C, Cristofaro ED (2017) What does the crowd say about you? Evaluating aggregation-based location privacy. Proceedings on Privacy Enhancing Technologies (PoPETs) 2017(4):76-96Search in Google Scholar

[39] Qin Z, Yang Y, Yu T, Khalil I, Xiao X, Ren K (2016) Heavy hitter estimation over set-valued data with local differential privacy. In: Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security (CCS’16), pp 192-203Search in Google Scholar

[40] Quercia D, Leontiadis I, McNamara L, Mascolo C, Crowcroft J (2011) Spotme if you can: Randomized responses for location obfuscation on mobile phones. In: Proceedings of the 2011 31st International Conference on Distributed Computing Systems (ICDCS’11), pp 363-372Search in Google Scholar

[41] Rilstone P, Srivastava V, Ullah A (1996) The second-order bias and mean squared error of nonlinear estimators. Journal of Economics 75(2):369-39510.1016/0304-4076(96)89457-7Open DOISearch in Google Scholar

[42] Schiaffino S, Amandi A (2009) Intelligent user profiling. In: Bramer M (ed) Artificial Intelligence, Springer-Verlag, pp 193-216Search in Google Scholar

[43] Schubert E, Zimek A, Kriegel HP (2014) Generalized outlier detection with flexible kernel density estimates. In: Proceedings of the 14th SIAM International Conference on Data Mining (SDM’14), pp 542-550Search in Google Scholar

[44] Sei Y, Ohusuga A (2017) Differential private data collection and analysis based on randomized multiple dummies for untrusted mobile crowdsensing. IEEE Transactions on Information Forensics and Security 12(4):926-93910.1109/TIFS.2016.2632069Open DOISearch in Google Scholar

[45] Sekimoto Y, Shibasaki R, Kanasugi H, Usui T, Shimazaki Y (2011) Pflow: Reconstructing people flow recycling large-scale social survey data. IEEE Pervasive Computing 10(4):27-3510.1109/MPRV.2011.43Open DOISearch in Google Scholar

[46] Shekhar S, Evans MR, Gunturi V, Yang K (2012) Spatial big-data challenges intersecting mobility and cloud computing. In: Proceedings of the 11th ACM International Workshop on Data Engineering for Wireless and Mobile Access (MobiDE’12), pp 1-12Search in Google Scholar

[47] van der Vaart AW (1998) Asymptotic Statistics. Cambridge University PressSearch in Google Scholar

[48] Wang W, Carreira-Perpiñán MÁ (2013) Projection onto the probability simplex: An efficient algorithm with a simple proof, and an application. CoRR abs/1309.1541, URL http://arxiv.org/abs/1309.1541Search in Google Scholar

[49] Warner SL (1965) Randomized response: A survey technique for eliminating evasive answer bias. Journal of the American Statistical Association 60(309):63-69Search in Google Scholar

[50] Yang D, Zhang D, Qu B (2016) Participatory cultural mapping based on collective behavior data in location based social network. ACM Transactions on Intelligent Systems and Technology 7(3):30:1-30:23Search in Google Scholar

[51] Zheng Y, Zhang L, Xie X, Ma WY (2009) Mining interesting locations and travel sequences from GPS trajectories. In: Proceedings of the 18th International Conference on World Wide Web (WWW’09), pp 791-800Search in Google Scholar

Recommended articles from Trend MD

Plan your remote conference with Sciendo