1. bookVolume 2018 (2018): Issue 1 (January 2018)
Journal Details
License
Format
Journal
First Published
16 Apr 2015
Publication timeframe
4 times per year
Languages
English
access type Open Access

SafePub: A Truthful Data Anonymization Algorithm With Strong Privacy Guarantees

Published Online: 11 Jan 2018
Page range: 67 - 87
Received: 31 May 2017
Accepted: 16 Sep 2017
Journal Details
License
Format
Journal
First Published
16 Apr 2015
Publication timeframe
4 times per year
Languages
English

Methods for privacy-preserving data publishing and analysis trade off privacy risks for individuals against the quality of output data. In this article, we present a data publishing algorithm that satisfies the differential privacy model. The transformations performed are truthful, which means that the algorithm does not perturb input data or generate synthetic output data. Instead, records are randomly drawn from the input dataset and the uniqueness of their features is reduced. This also offers an intuitive notion of privacy protection. Moreover, the approach is generic, as it can be parameterized with different objective functions to optimize its output towards different applications. We show this by integrating six well-known data quality models. We present an extensive analytical and experimental evaluation and a comparison with prior work. The results show that our algorithm is the first practical implementation of the described approach and that it can be used with reasonable privacy parameters resulting in high degrees of protection. Moreover, when parameterizing the generic method with an objective function quantifying the suitability of data for building statistical classifiers, we measured prediction accuracies that compare very well with results obtained using state-of-the-art differentially private classification algorithms.

Keywords

[1] A. Machanavajjhala et al. l-diversity: Privacy beyond kanonymity. Transactions on Knowledge Discovery from Data, 1(1):3, 2007.Search in Google Scholar

[2] B. C. M. Fung et al. Introduction to Privacy-Preserving Data Publishing: Concepts and Techniques. CRC Press, 2010.Search in Google Scholar

[3] R. J. Bayardo and R. Agrawal. Data privacy through optimal k-anonymization. In International Conference on Data Engineering, pages 217–228, 2005.Search in Google Scholar

[4] J. Brickell and V. Shmatikov. The cost of privacy: Destruction of data-mining utility in anonymized data publishing. In ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 70–78, 2008.Search in Google Scholar

[5] C. Clifton and T. Tassa. On syntactic anonymity and differential privacy. In International Conference on Data Engineering Workshops, pages 88–93, 2013.Search in Google Scholar

[6] F. K. Dankar and K. El Emam. Practicing differential privacy in health care: A review. Transactions on Data Privacy, 6(1):35–67, 2013.Search in Google Scholar

[7] T. de Waal and L. Willenborg. Information loss through global recoding and local suppression. Netherlands Official Statistics, 14:17–20, 1999.Search in Google Scholar

[8] J. Domingo-Ferrer and J. Soria-Comas. From t-closeness to differential privacy and vice versa in data anonymization. Knowledge-Based Systems, 74:151–158, 2015.Search in Google Scholar

[9] C. Dwork. An ad omnia approach to defining and achieving private data analysis. In International Conference on Privacy, Security, and Trust in KDD, pages 1–13, 2008.Search in Google Scholar

[10] C. Dwork. Differential privacy: A survey of results. In International Conference on Theory and Applications of Models of Computation, pages 1–19, 2008.Search in Google Scholar

[11] K. El Emam and L. Arbuckle. Anonymizing Health Data. O’Reilly Media, 2013.Search in Google Scholar

[12] K. El Emam and F. K. Dankar. Protecting privacy using k-anonymity. Jama-J Am. Med. Assoc., 15(5):627–637, 2008.10.1197/jamia.M2716Open DOISearch in Google Scholar

[13] K. El Emam and B. Malin. Appendix b: Concepts and methods for de-identifying clinical trial data. In Sharing Clinical Trial Data: Maximizing Benefits, Minimizing Risk, pages 1–290. National Academies Press (US), 2015.Search in Google Scholar

[14] European Medicines Agency. External guidance on the implementation of the european medicines agency policy on the publication of clinical data for medicinal products for human use. EMA/90915/2016, 2016.Search in Google Scholar

[15] F. Prasser et al. Lightning: Utility-driven anonymization of high-dimensional data. Transactions on Data Privacy, 9(2):161–185, 2016.Search in Google Scholar

[16] F. Prasser et al. A tool for optimizing de-identified health data for use in statistical classification. In IEEE International Symposium on Computer-Based Medical Systems, 2017.Search in Google Scholar

[17] L. Fan and H. Jin. A practical framework for privacy-preserving data analytics. In International Conference on World Wide Web, pages 311–321, 2015.Search in Google Scholar

[18] M. R. Fouad, K. Elbassioni, and E. Bertino. A supermodularity-based differential privacy preserving algorithm for data anonymization. IEEE Transactions on Knowledge and Data Engineering, 26(7):1591–1601, 2014.Search in Google Scholar

[19] A. Friedman and A. Schuster. Data mining with differential privacy. In International Conference on Knowledge Discovery and Data Mining, pages 493–502, 2010.Search in Google Scholar

[20] G. Cormode et al. Empirical privacy and empirical utility of anonymized data. In IEEE International Conference on Data Engineering Workshops, pages 77–82, 2013.Search in Google Scholar

[21] G. Poulis et al. Secreta: a system for evaluating and comparing relational and transaction anonymization algorithms. In International Conference on Extending Database Technology, pages 620–623, 2014.Search in Google Scholar

[22] J. Gehrke, E. Lui, and R. Pass. Towards privacy for social networks: A zero-knowledge based definition of privacy. In Theory of Cryptography Conference, pages 432–449, 2011.Search in Google Scholar

[23] R. L. Graham, D. E. Knuth, and O. Patashnik. Concrete Mathematics: A Foundation for Computer Science. Addison-Wesley publishing company, 2nd edition, 1994.Search in Google Scholar

[24] Y. Hong, J. Vaidya, H. Lu, and M. Wu. Differentially private search log sanitization with optimal output utility. In International Conference on Extending Database Technology, pages 50–61, 2012.Search in Google Scholar

[25] V. S. Iyengar. Transforming data to satisfy privacy constraints. In International Conference on Knowledge Discovery and Data Mining, pages 279–288, 2002.Search in Google Scholar

[26] J. Gehrke et al. Crowd-blending privacy. In Advances in Cryptology, pages 479–496. Springer, 2012.Search in Google Scholar

[27] J. Soria-Comas et al. Enhancing data utility in differential privacy via microaggregation-based k-anonymity. VLDB J., 23(5):771–794, 2014.10.1007/s00778-014-0351-4Open DOISearch in Google Scholar

[28] J. Soria-Comas et al. t-closeness through microaggregation: Strict privacy with enhanced utility preservation. IEEE Transactions on Knowledge and Data Engineering, 27(11):3098–3110, 2015.Search in Google Scholar

[29] J. Vaidya et al. Differentially private naive bayes classification. In IEEE/WIC/ACM International Joint Conferences on Web Intelligence and Intelligent Agent Technologies, pages 571–576, 2013.Search in Google Scholar

[30] Y. Jafer, S. Matwin, and M. Sokolova. Using feature selection to improve the utility of differentially private data publishing. Procedia Computer Science, 37:511–516, 2014.Search in Google Scholar

[31] Z. Ji, Z. C. Lipton, and C. Elkan. Differential privacy and machine learning: a survey and review. CoRR, abs/1412.7584, 2014.Search in Google Scholar

[32] Z. Jorgensen, T. Yu, and G. Cormode. Conservative or liberal? personalized differential privacy. In IEEE International Conference on Data Engineering, pages 1023–1034, April 2015.Search in Google Scholar

[33] K. El Emam et al. A globally optimal k-anonymity method for the de-identification of health data. J. Am. Med. Inform. Assn., 16(5):670–682, 2009.10.1197/jamia.M3144Open DOISearch in Google Scholar

[34] F. Kohlmayer, F. Prasser, C. Eckert, A. Kemper, and K. A. Kuhn. Flash: efficient, stable and optimal k-anonymity. In 2012 International Conference on Privacy, Security, Risk and Trust (PASSAT) and 2012 International Conference on Social Computing (SocialCom), pages 708–717, 2012.Search in Google Scholar

[35] K. LeFevre, D. J. DeWitt, and R. Ramakrishnan. Incognito: Efficient full-domain k-anonymity. In International Conference on Management of Data, pages 49–60, 2005.Search in Google Scholar

[36] K. LeFevre, D. J. DeWitt, and R. Ramakrishnan. Mondrian multidimensional k-anonymity. In International Conference on Data Engineering, pages 25–25, 2006.Search in Google Scholar

[37] K. LeFevre, D. J. DeWitt, and R. Ramakrishnan. Workloadaware anonymization techniques for large-scale datasets. ACM Transactions on Database Systems, 33(3):1–47, 2008.10.1145/1386118.1386123Open DOISearch in Google Scholar

[38] D. Leoni. Non-interactive differential privacy: A survey. In International Workshop on Open Data, pages 40–52, 2012.Search in Google Scholar

[39] N. Li, W. Qardaji, and D. Su. On sampling, anonymization, and differential privacy: Or, k-anonymization meets differential privacy. In ACM Symposium on Information, Computer and Communications Security, pages 32–33, 2012.Search in Google Scholar

[40] N. Li, W. H. Qardaji, and D. Su. Provably private data anonymization: Or, k-anonymity meets differential privacy. CoRR, abs/1101.2604, 2011.Search in Google Scholar

[41] T. Li and N. Li. On the tradeoff between privacy and utility in data publishing. In International Conference on Knowledge Discovery and Data Mining, pages 517–526, 2009.Search in Google Scholar

[42] M. Lichman. UCI machine learning repository. http://archive.ics.uci.edu/ml, 2013.Search in Google Scholar

[43] M. R. Fouad, K. Elbassioni, and E. Bertino. Towards a differentially private data anonymization. CERIAS Tech Report 2012-1, Purdue Univ., 2012.Search in Google Scholar

[44] F. McSherry and K. Talwar. Mechanism design via differential privacy. In IEEE Symposium on Foundations of Computer Science, pages 94–103, 2007.Search in Google Scholar

[45] F. D. McSherry. Privacy integrated queries: An extensible platform for privacy-preserving data analysis. In International Conference on Management of Data, pages 19–30, 2009.Search in Google Scholar

[46] N. Mohammed et al. Differentially private data release for data mining. In International Conference on Knowledge Discovery and Data Mining, pages 493–501, 2011.Search in Google Scholar

[47] M. E. Nergiz, M. Atzori, and C. Clifton. Hiding the presence of individuals from shared databases. In International Conference on Management of Data, pages 665–676, 2007.Search in Google Scholar

[48] F. Prasser, F. Kohlmayer, and K. A. Kuhn. The importance of context: Risk-based de-identification of biomedical data. Methods of information in medicine, 55(4):347–355, 2016.Search in Google Scholar

[49] J. R. Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers Inc., 1993.Search in Google Scholar

[50] F. Ritchie and M. Elliott. Principles-versus rules-based output statistical disclosure control in remote access environments. IASSIST Quarterly, 39(2):5–13, 2015.Search in Google Scholar

[51] A. D. Sarwate and K. Chaudhuri. Signal processing and machine learning with differential privacy: Algorithms and challenges for continuous data. IEEE Signal Processing Magazine, 30(5):86–94, 2013.10.1109/MSP.2013.2259911Open DOISearch in Google Scholar

[52] L. Sweeney. Achieving k-anonymity privacy protection using generalization and suppression. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, 10(5):571–588, Oct. 2002.10.1142/S021848850200165XOpen DOISearch in Google Scholar

[53] L. Willenborg and T. De Waal. Statistical disclosure control in practice. Springer Science & Business Media, 1996.Search in Google Scholar

[54] I. H. Witten and F. Eibe. Data mining: Practical machine learning tools and techniques. Morgan Kaufmann, 2005.Search in Google Scholar

[55] X. Jiang et al. Differential-private data publishing through component analysis. Transactions on Data Privacy, 6(1):19–34, Apr. 2013.Search in Google Scholar

[56] Z. Wan et al. A game theoretic framework for analyzing re-identification risk. PloS one, 10(3):e0120592, 2015.Search in Google Scholar

[57] N. Zhang, M. Li, and W. Lou. Distributed data mining with differential privacy. In IEEE International Conference on Communications, pages 1–5, 2011.Search in Google Scholar

Recommended articles from Trend MD

Plan your remote conference with Sciendo