1. bookVolume 2021 (2021): Issue 3 (July 2021)
Journal Details
First Published
16 Apr 2015
Publication timeframe
4 times per year
access type Open Access

Growing synthetic data through differentially-private vine copulas

Published Online: 27 Apr 2021
Page range: 122 - 141
Received: 30 Nov 2020
Accepted: 16 Mar 2021
Journal Details
First Published
16 Apr 2015
Publication timeframe
4 times per year

In this work, we propose a novel approach for the synthetization of data based on copulas, which are interpretable and robust models, extensively used in the actuarial domain. More precisely, our method COPULA-SHIRLEY is based on the differentially-private training of vine copulas, which are a family of copulas allowing to model and generate data of arbitrary dimensions. The framework of COPULA-SHIRLEY is simple yet flexible, as it can be applied to many types of data while preserving the utility as demonstrated by experiments conducted on real datasets. We also evaluate the protection level of our data synthesis method through a membership inference attack recently proposed in the literature.


[1] Kjersti Aas, Claudia Czado, Arnoldo Frigessi, and Henrik Bakken. Pair-copula constructions of multiple dependence. Insurance: Mathematics and economics, 44(2):182–198, 2009.10.1016/j.insmatheco.2007.02.001Search in Google Scholar

[2] Gergely Acs, Claude Castelluccia, and Rui Chen. Differentially private histogram publishing through lossy compression. In 2012 IEEE 12th International Conference on Data Mining, pages 1–10, Brussels, Belgium, 2012. Institute of Electrical and Electronics Engineers (IEEE).Search in Google Scholar

[3] Hirotogu Akaike. Information theory and an extension of the maximum likelihood principle. In Hirotogu Akaike, editor, Selected papers of hirotugu akaike, pages (p. 199–213). New York: Springer, 1998.10.1007/978-1-4612-1694-0_15Search in Google Scholar

[4] James A. Anderson. An Introduction to Neural Networks. Cambridge: The MIT Press, 1997.Search in Google Scholar

[5] Julia Angwin, Jeff Larson, Lauren Kirchner, and Surya Mattu. Machine bias, Mar 2019.Search in Google Scholar

[6] Hassan Jameel Asghar, Ming Ding, Thierry Rakotoarivelo, Sirine Mrabet, and Mohamed Ali Kaafar. Differentially Private Release of High-Dimensional Datasets using the Gaussian Copula. arXiv preprint, 2019. (Preprint).Search in Google Scholar

[7] Borja Balle and Yu-Xiang Wang. Improving the gaussian mechanism for differential privacy: Analytical calibration and optimal denoising. In International Conference on Machine Learning, pages 394–403, 2018.Search in Google Scholar

[8] A Bárdossy and GGS Pegram. Copula based multisite model for daily precipitation simulation. Hydrology & Earth System Sciences, 13(12):2299—-2314, 2009.10.5194/hess-13-2299-2009Search in Google Scholar

[9] Tim Bedford and Roger M. Cooke. Probability density decomposition for conditionally dependent random variables modeled by vines. The Annals of Mathematics and Artificial Intelligence, 32(1-4):245–268, 2001.10.1023/A:1016725902970Search in Google Scholar

[10] Tim Bedford and Roger M. Cooke. Vines: A new graphical model for dependent random variables. The Annals of Statistics, 30(4):1031–1068, 2002.Search in Google Scholar

[11] Vincent Bindschaedler, Reza Shokri, and Carl A Gunter. Plausible deniability for privacy-preserving data synthesis. arXiv preprint, 2017. (Preprint).10.14778/3055540.3055542Search in Google Scholar

[12] Eike C Brechmann, Claudia Czado, and Kjersti Aas. Truncated regular vines in high dimensions with application to financial data. Canadian Journal of Statistics, 40(1):68–85, 2012.10.1002/cjs.10141Search in Google Scholar

[13] Norman E Breslow and David G Clayton. Approximate inference in generalized linear mixed models. Journal of the American statistical Association, 88(421):9–25, 1993.10.1080/01621459.1993.10594284Search in Google Scholar

[14] Lane F Burgette and Jerome P Reiter. Multiple imputation for missing data via sequential regression trees. American journal of epidemiology, 172(9):1070–1076, 2010.10.1093/aje/kwq26020841346Search in Google Scholar

[15] Thee Chanyaswad, Changchang Liu, and Prateek Mittal. Ron-gauss: Enhancing utility in non-interactive private data release. Proceedings on Privacy Enhancing Technologies, 2019(1):26–46, 2019.Search in Google Scholar

[16] Tianqi Chen, Tong He, Michael Benesty, Vadim Khotilovich, and Yuan Tang. Xgboost: extreme gradient boosting. R package version 0.4-2, pages 1–4, 2015.Search in Google Scholar

[17] Yves-Alexandre De Montjoye, César A Hidalgo, Michel Verleysen, and Vincent D Blondel. Unique in the crowd: The privacy bounds of human mobility. Scientific reports, 3:1376, 2013. doi: 10.1038/srep01376.10.1038/srep01376360724723524645Search in Google Scholar

[18] Yves-Alexandre De Montjoye, Laura Radaelli, Vivek Kumar Singh, et al. Unique in the shopping mall: On the reidentifi-ability of credit card metadata. Science, 347(6221):536–539, 2015.Search in Google Scholar

[19] Luc Devroye. Non-Uniform Random Variate Generation. New York: Springer-Verlag, 1986.10.1007/978-1-4613-8643-8Search in Google Scholar

[20] Jeffrey Dissmann, Eike C Brechmann, Claudia Czado, and Dorota Kurowicka. Selecting and estimating regular vine copulae and application to financial returns. Computational Statistics & Data Analysis, 59:52–69, 2013.10.1016/j.csda.2012.08.010Search in Google Scholar

[21] Dheeru Dua and Casey Graff. UCI machine learning repository, 2017.Search in Google Scholar

[22] Cynthia Dwork, Aaron Roth, et al. The algorithmic foundations of differential privacy. Foundations and Trends in Theoretical Computer Science, 9(3–4):211–407, 2014.10.1561/0400000042Search in Google Scholar

[23] Jerome H Friedman. Greedy function approximation: a gradient boosting machine. Annals of statistics, pages 1189–1232, 2001.10.1214/aos/1013203451Search in Google Scholar

[24] Lorenzo Frigerio, Anderson Santana de Oliveira, Laurent Gomez, and Patrick Duverger. Differentially private generative adversarial networks for time series, continuous, and discrete open data. In IFIP International Conference on ICT Systems Security and Privacy Protection, pages 151–164, Lisbon, Portugal, 2019. Springer.10.1007/978-3-030-22312-0_11Search in Google Scholar

[25] Seymour Geisser. The predictive sample reuse method with applications. Journal of the American statistical Association, 70(350):320–328, 1975.10.1080/01621459.1975.10479865Search in Google Scholar

[26] Arpita Ghosh, Tim Roughgarden, and Mukund Sundararajan. Universally utility-maximizing privacy mechanisms. SIAM Journal on Computing, 41(6):1673–1693, 2012.10.1137/09076828XSearch in Google Scholar

[27] Katsuichiro Goda and Solomon Tesfamariam. Multi-variate seismic demand modelling using copulas: Application to non-ductile reinforced concrete frame in victoria, canada. Structural Safety, 56:39–51, 2015.Search in Google Scholar

[28] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In 28th Conference on Neural Information Processing Systems, pages 2672–2680, Montréal, Canada, 2014. Advances in Neural Information Processing Systems.Search in Google Scholar

[29] Jing He, Hongzhe Li, Andrew C Edmondson, Daniel J Rader, and Mingyao Li. A gaussian copula approach for the analysis of secondary phenotypes in case–control genetic association studies. Biostatistics, 13(3):497–508, 2012.10.1093/biostatistics/kxr025337294121933777Search in Google Scholar

[30] Benjamin Hilprecht, Martin Härterich, and Daniel Bernau. Monte carlo and reconstruction membership inference attacks against generative models. Proceedings on Privacy Enhancing Technologies, 2019(4):232–249, 2019.10.2478/popets-2019-0067Search in Google Scholar

[31] A Hoyer and O Kuss. Meta-analysis of diagnostic tests accounting for disease prevalence: a new model using trivariate copulas. Statistics in medicine, 34(11):1912–1924, 2015.Search in Google Scholar

[32] Finn V. Jensen. Introduction to Bayesian Networks. New York: Springer, 1997.Search in Google Scholar

[33] Harry Joe. Multivariate models and multivariate dependence concepts. Londres: Chapman & Hall/CRC, 1997.10.1201/b13150Search in Google Scholar

[34] James Jordon, Jinsung Yoon, and Mihaela van der Schaar. PATE-GAN: Generating synthetic data with differential privacy guarantees. In International Conference on Learning Representations (ICLR)., New Orleans, USA, 2019. ICLR.Search in Google Scholar

[35] David Lazer, D Brewer, N Christakis, J Fowler, and G King. Life in the network: the coming age of computational social. Science, 323(5915):721–723, 2009.Search in Google Scholar

[36] Haoran Li, Li Xiong, and Xiaoqian Jiang. Differentially private synthesization of multi-dimensional data using copula functions. In Proceedings of the 17th International Conference on Extending Database Technology, volume 2014, pages 475–486, Athens, Greece, 2014. Extending Database Technology (EDBT).Search in Google Scholar

[37] Ziqi Liu, Yu-Xiang Wang, and Alexander Smola. Fast differentially private matrix factorization. In Proceedings of the 9th ACM Conference on Recommender Systems, pages 171–178, Vienna, Austria, 2015.10.1145/2792838.2800191Search in Google Scholar

[38] Donald MacKenzie and Taylor Spears. ‘the formula that killed wall street’: The gaussian copula and modelling practices in investment banking. Social Studies of Science, 44(3):393–417, 2014.Search in Google Scholar

[39] Brian W Matthews. Comparison of the predicted and observed secondary structure of t4 phage lysozyme. Biochimica et Biophysica Acta (BBA)-Protein Structure, 405(2):442–451, 1975.10.1016/0005-2795(75)90109-9Search in Google Scholar

[40] William D McGinnis, Chapman Siu, S Andre, and Hanyu Huang. Category encoders: a scikit-learn-contrib package of transformers for encoding categorical data. Journal of Open Source Software, 3(21):501, 2018.Search in Google Scholar

[41] Claire McKay Bowen and Joshua Snoke. Comparative study of differentially private synthetic data algorithms and evaluation standards. arXiv preprint, 2019. (Preprint).Search in Google Scholar

[42] Ryan McKenna, Daniel Sheldon, and Gerome Miklau. Graphical-model based estimation and inference for differential privacy. In International Conference on Machine Learning, pages 4435–4444. PMLR, 2019.Search in Google Scholar

[43] Casey Meehan, Kamalika Chaudhuri, and Sanjoy Dasgupta. A Non-Parametric Test to Detect Data-Copying in Generative Models. arXiv preprint, 2020. (Preprint).Search in Google Scholar

[44] Daniel Muise and Kobbi Nissim. Notes on differential privacy in cdfs, https://privacytools.seas.harvard.edu/files/privacytools/files/dpcdf_usermanual_2016.pdf. Harvard University Privacy Tools Project, April 2016.Search in Google Scholar

[45] Dominik Müller and Claudia Czado. Selection of sparse vine copulas in high dimensions with the lasso. Statistics and Computing, 29(2):269–287, 2019.10.1007/s11222-018-9807-5Search in Google Scholar

[46] Takao Murakami, Koki Hamada, Yusuke Kawamoto, and Takuma Hatano. Privacy-preserving multiple tensor factorization for synthesizing large-scale location traces. arXiv preprint arXiv:1911.04226, 2019.Search in Google Scholar

[47] Thomas Nagler and Thibault Vatter. R interface to the vinecopulib c++ library, https://vinecopulib.github.io/rvinecopulib/, 2017.Search in Google Scholar

[48] Arvind Narayanan and Vitaly Shmatikov. Robust deanonymization of large sparse datasets. In 2008 IEEE Symposium on Security and Privacy, pages 111–125, Oakland, USA, 2008. Institute of Electrical and Electronics Engineers (IEEE).10.1109/SP.2008.33Search in Google Scholar

[49] Roger B. Nelsen. An introduction to copulas (2nd ed.). New York: Springer Science & Business Media, 2007.Search in Google Scholar

[50] Nicolas Papernot, Martín Abadi, Ulfar Erlingsson, Ian Good-fellow, and Kunal Talwar. Semi-supervised knowledge transfer for deep learning from private training data. arXiv preprint, 2016. (Preprint).Search in Google Scholar

[51] Andrew J Patton. Copula–based models for financial time series. In Handbook of financial time series, pages (p. 767–785). Springer, 2009.10.1007/978-3-540-71297-8_34Search in Google Scholar

[52] Haoyue Ping, Julia Stoyanovich, and Bill Howe. Datasynthesizer: Privacy-preserving synthetic datasets. In Proceedings of the 29th International Conference on Scientific and Statistical Database Management, pages 1–5, Chicago, USA, 2017. Association for Computing Machinery (ACM).10.1145/3085504.3091117Search in Google Scholar

[53] Marie-Therese Puth, Markus Neuhäuser, and Graeme D Ruxton. Effective use of spearman’s and kendall’s correlation coefficients for association between two measured traits. Animal Behaviour, 102:77–84, 2015.10.1016/j.anbehav.2015.01.010Search in Google Scholar

[54] Thierry Rakotoarivelo. Dpcopula-kendall algorithm, 2019.Search in Google Scholar

[55] Jerome P Reiter. Using cart to generate partially synthetic public use microdata. Journal of Official Statistics, 21(3):441, 2005.Search in Google Scholar

[56] Luc Rocher, Julien M Hendrickx, and Yves-Alexandre De Montjoye. Estimating the success of re-identifications in incomplete datasets using generative models. Nature communications, 10(1):1–9, 2019.10.1038/s41467-019-10933-3665047331337762Search in Google Scholar

[57] Gideon Schwarz et al. Estimating the dimension of a model. The annals of statistics, 6(2):461–464, 1978.10.1214/aos/1176344136Search in Google Scholar

[58] Reza Shokri, Marco Stronati, Congzheng Song, and Vitaly Shmatikov. Membership inference attacks against machine learning models. In 2017 IEEE Symposium on Security and Privacy, pages 3–18, San Jose, USA, 2017. Institute of Electrical and Electronics Engineers (IEEE).10.1109/SP.2017.41Search in Google Scholar

[59] Ryan Singel. Netflix cancels recommendation contest after privacy lawsuit. Wired, 12 février 2010.Search in Google Scholar

[60] Abe Sklar. Fonctions de répartition à n dimensions et leurs marges. Publications de l’Institut de Statistique de l’Université de Paris, 8:229–231, 1959.Search in Google Scholar

[61] Nikolai Vasil’evich Smirnov. Approximate laws of distribution of random variables from empirical data. Uspekhi Matematicheskikh Nauk, (10):179–206, 1944.Search in Google Scholar

[62] Theresa Stadler, Bristena Oprisanu, and Carmela Troncoso. Synthetic data–a privacy mirage. arXiv preprint arXiv:2011.07018, 2020.Search in Google Scholar

[63] William J. Stewart. Introduction to the Numerical Solution of Markov Chains. Princeton: Princeton University Press, 1994.Search in Google Scholar

[64] Daniel B Suits. Use of dummy variables in regression equations. Journal of the American Statistical Association, 52(280):548–551, 1957.10.1080/01621459.1957.10501412Search in Google Scholar

[65] Yi Sun, Alfredo Cuesta-Infante, and Kalyan Veeramachaneni. Learning vine copula models for synthetic data generation. In Proceedings of the 2019 AAAI Conference on Artificial Intelligence, volume 33, pages 5049–5057, Honolulu, USA, 2019. Advancement of Artificial Intelligence (AAAI).10.1609/aaai.v33i01.33015049Search in Google Scholar

[66] Natasa Tagasovska, Damien Ackerer, and Thibault Vatter. Copulas as High-Dimensional Generative Models: Vine Copula Autoencoders. arXiv preprint, 2019. (Preprint).Search in Google Scholar

[67] Uthaipon Tantipongpipat, Chris Waites, Digvijay Boob, Amaresh Ankit Siva, and Rachel Cummings. Differentially private mixed-type data generation for unsupervised learning. arXiv preprint arXiv:1912.03250, 2019.Search in Google Scholar

[68] Texas Department of State Health Services, Austin, Texas. Texas Hospital Inpatient Discharge Public Use Data File 2013 Q1, https://www.dshs.texas.gov/THCIC/Hospitals/Download.shtm. 2013.Search in Google Scholar

[69] Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological), 58(1):267–288, 1996.10.1111/j.2517-6161.1996.tb02080.xSearch in Google Scholar

[70] Reihaneh Torkzadehmahani, Peter Kairouz, and Benedict Paten. Dp-cgan: Differentially private synthetic data and label generation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pages 98–104, Long Beach, USA, 2019.10.1109/CVPRW.2019.00018Search in Google Scholar

[71] David A Williams. Extra-binomial variation in logistic linear models. Journal of the Royal Statistical Society: Series C (Applied Statistics), 31(2):144–148, 1982.10.2307/2347977Search in Google Scholar

[72] IJ Wod. Weight of evidence: A brief survey. Bayesian statistics, 2:249–270, 1985.Search in Google Scholar

[73] Xiaokui Xiao, Guozhang Wang, and Johannes Gehrke. Differential privacy via wavelet transforms. IEEE Transactions on knowledge and data engineering, 23(8):1200–1214, 2010.10.1109/TKDE.2010.247Search in Google Scholar

[74] Liyang Xie, Kaixiang Lin, Shu Wang, Fei Wang, and Jiayu Zhou. Differentially private generative adversarial network. arXiv preprint, 2018. (Preprint).Search in Google Scholar

[75] Chugui Xu, Ju Ren, Deyu Zhang, Yaoxue Zhang, Zhan Qin, and Kui Ren. Ganobfuscator: Mitigating information leakage under gan via differential privacy. IEEE Transactions on Information Forensics and Security, 14(9):2358–2371, 2019.Search in Google Scholar

[76] Jia Xu, Zhenjie Zhang, Xiaokui Xiao, Yin Yang, Ge Yu, and Marianne Winslett. Differentially private histogram publication. The VLDB Journal, 22(6):797–822, 2013.10.1007/s00778-013-0309-ySearch in Google Scholar

[77] Jun Zhang, Graham Cormode, Cecilia M Procopiuc, Divesh Srivastava, and Xiaokui Xiao. Privbayes: Private data release via bayesian networks. ACM Transactions on Database Systems (TODS), 42(4):1–41, 2017.Search in Google Scholar

Recommended articles from Trend MD

Plan your remote conference with Sciendo