1. bookVolume 2022 (2022): Issue 2 (April 2022)
Journal Details
License
Format
Journal
eISSN
2299-0984
First Published
16 Apr 2015
Publication timeframe
4 times per year
Languages
English
access type Open Access

Privacy-preserving training of tree ensembles over continuous data

Published Online: 03 Mar 2022
Volume & Issue: Volume 2022 (2022) - Issue 2 (April 2022)
Page range: 205 - 226
Received: 31 Aug 2021
Accepted: 16 Dec 2021
Journal Details
License
Format
Journal
eISSN
2299-0984
First Published
16 Apr 2015
Publication timeframe
4 times per year
Languages
English
Abstract

Most existing Secure Multi-Party Computation (MPC) protocols for privacy-preserving training of decision trees over distributed data assume that the features are categorical. In real-life applications, features are often numerical. The standard “in the clear” algorithm to grow decision trees on data with continuous values requires sorting of training examples for each feature in the quest for an optimal cut-point in the range of feature values in each node. Sorting is an expensive operation in MPC, hence finding secure protocols that avoid such an expensive step is a relevant problem in privacy-preserving machine learning. In this paper we propose three more efficient alternatives for secure training of decision tree based models on data with continuous features, namely: (1) secure discretization of the data, followed by secure training of a decision tree over the discretized data; (2) secure discretization of the data, followed by secure training of a random forest over the discretized data; and (3) secure training of extremely randomized trees (“extra-trees”) on the original data. Approaches (2) and (3) both involve randomizing feature choices. In addition, in approach (3) cut-points are chosen randomly as well, thereby alleviating the need to sort or to discretize the data up front. We implemented all proposed solutions in the semi-honest setting with additive secret sharing based MPC. In addition to mathematically proving that all proposed approaches are correct and secure, we experimentally evaluated and compared them in terms of classification accuracy and runtime. We privately train tree ensembles over data sets with thousands of instances or features in a few minutes, with accuracies that are at par with those obtained in the clear. This makes our solution more efficient than the existing approaches, which are based on oblivious sorting.

Keywords

[1] R. Cramer, I. Damgård, and J. B. Nielsen, Secure Multiparty Computation and Secret Sharing. Cambridge University Press, 2015.10.1017/CBO9781107337756 Search in Google Scholar

[2] P. Mohassel and Y. Zhang, “SecureML: A system for scalable privacy-preserving machine learning,” in IEEE Symposium on Security and Privacy (SP), pp. 19–38, 2017.10.1109/SP.2017.12 Search in Google Scholar

[3] S. Wagh, D. Gupta, and N. Chandran, “SecureNN: 3-party secure computation for neural network training,” Proc. on Privacy Enhancing Technologies, no. 3, pp. 26–49, 2019.10.2478/popets-2019-0035 Search in Google Scholar

[4] M. De Cock, R. Dowsley, A. C. A. Nascimento, D. Railsback, J. Shen, and A. Todoki, “High performance logistic regression for privacy-preserving genome analysis,” BMC Medical Genomics, vol. 14(23), 2021.10.1186/s12920-020-00869-9781857733472626 Search in Google Scholar

[5] C. Guo, A. Hannun, B. Knott, L. van der Maaten, M. Tygert, and R. Zhu, “Secure multiparty computations in floating-point arithmetic,” arXiv:2001.03192, 2020. Search in Google Scholar

[6] T. G. Dietterich, “Ensemble methods in machine learning,” in International Workshop on Multiple Classifier Systems, vol. 1857 of LNCS, pp. 1–15, Springer, 2000. Search in Google Scholar

[7] D. J. Wu, T. Feng, M. Naehrig, and K. E. Lauter, “Privately evaluating decision trees and random forests.,” Proc. on Privacy Enhancing Technologies, no. 4, pp. 335–355, 2016.10.1515/popets-2016-0043 Search in Google Scholar

[8] K. Fritchman, K. Saminathan, R. Dowsley, T. Hughes, M. De Cock, A. Nascimento, and A. Teredesai, “Privacy-preserving scoring of tree ensembles: A novel framework for AI in healthcare,” in IEEE Big Data, pp. 2413–2422, 2018. Search in Google Scholar

[9] Á. Kiss, M. Naderpour, J. Liu, N. Asokan, and T. Schneider, “Sok: Modular and efficient private decision tree evaluation,” Proc. on Privacy Enhancing Technologies, no. 2, p. 187–208, 2019.10.2478/popets-2019-0026 Search in Google Scholar

[10] J. R. Quinlan, “Induction of decision trees,” Machine learning, vol. 1, no. 1, pp. 81–106, 1986.10.1007/BF00116251 Search in Google Scholar

[11] Y. Lindell and B. Pinkas, “Privacy preserving data mining,” in Annual International Cryptology Conf., pp. 36–54, 2000.10.1007/3-540-44598-6_3 Search in Google Scholar

[12] J. Vaidya and C. Clifton, “Privacy-preserving decision trees over vertically partitioned data,” in IFIP Annual Conf. on Data and Appl. Security and Privacy, pp. 139–152, 2005.10.1007/11535706_11 Search in Google Scholar

[13] S. Samet and A. Miri, “Privacy preserving ID3 using Gini index over horizontally partitioned data,” in 2008 IEEE/ACS Intern. Conf. on Comp. Syst. and Appl., pp. 645–651, 2008.10.1109/AICCSA.2008.4493598 Search in Google Scholar

[14] S. de Hoogh, B. Schoenmakers, P. Chen, and H. op den Akker, “Practical secure decision tree learning in a teletreatment application,” in Intern. Conf. on Financial Cryptography and Data Security, pp. 179–194, Springer, 2014.10.1007/978-3-662-45472-5_12 Search in Google Scholar

[15] M.-J. Xiao, K. Han, L.-S. Huang, and J.-Y. Li, “Privacy preserving C4.5 algorithm over horizontally partitioned data,” in Fifth International Conference on Grid and Cooperative Computing (GCC’06), pp. 78–85, IEEE, 2006.10.1109/GCC.2006.73 Search in Google Scholar

[16] Y. Shen, H. Shao, and L. Yang, “Privacy preserving C4.5 algorithm over vertically distributed datasets,” in Intern. Conf. on Networks Security, Wireless Communications and Trusted Computing, vol. 2, pp. 446–448, IEEE, 2009.10.1109/NSWCTC.2009.253 Search in Google Scholar

[17] G. Behera, “Privacy preserving C4.5 using Gini index,” in 2nd National Conference on Emerging Trends and Applications in Computer Science, pp. 1–4, 2011.10.1109/NCETACS.2011.5751385 Search in Google Scholar

[18] M. Abspoel, D. Escudero, and N. Volgushev, “Secure training of decision trees with continuous attributes,” in Proc. on Privacy Enhancing Technologies, no. 1, pp. 167–187, 2021.10.2478/popets-2021-0010 Search in Google Scholar

[19] J. R. Quinlan, C4.5: programs for machine learning. Elsevier, 2014. Search in Google Scholar

[20] P. Geurts, D. Ernst, and L. Wehenkel, “Extremely randomized trees,” Machine Learning, vol. 63, no. 1, pp. 3–42, 2006.10.1007/s10994-006-6226-1 Search in Google Scholar

[21] K. Deforth, M. Desgroseilliers, N. Gama, M. Georgieva, D. Jetchev, and M. Vuille, “XORBoost: Tree boosting in the multiparty computation setting.” Cryptology ePrint Archive, Report 2021/432, 2021. https://eprint.iacr.org/2021/432. Search in Google Scholar

[22] M. De Cock, R. Dowsley, C. Horst, R. Katti, A. Nascimento, W.-S. Poon, and S. Truex, “Efficient and private scoring of decision trees, support vector machines and logistic regression models based on pre-computation,” IEEE Transactions on Dependable and Secure Computing, vol. 16, no. 2, pp. 217–230, 2019.10.1109/TDSC.2017.2679189 Search in Google Scholar

[23] D. Beaver, “Commodity-based cryptography,” in STOC, vol. 97, pp. 446–455, 1997.10.1145/258533.258637 Search in Google Scholar

[24] B. David, R. Dowsley, R. Katti, and A. C. Nascimento, “Efficient unconditionally secure comparison and privacy preserving machine learning classification protocols,” in International Conference on Provable Security, pp. 354–367, Springer, 2015.10.1007/978-3-319-26059-4_20 Search in Google Scholar

[25] M. De Cock, R. Dowsley, A. C. A. Nascimento, and S. C. Newman, “Fast, privacy preserving linear regression over distributed datasets based on pre-distributed data,” in 8th ACM Workshop on Artificial Intelligence and Security (AISec), pp. 3–14, 2015.10.1145/2808769.2808774 Search in Google Scholar

[26] R. Canetti, “Universally composable security: A new paradigm for cryptographic protocols,” in 42nd Annual Symposium on Foundations of Computer Science, pp. 136–145, IEEE Computer Society, 2001.10.1109/SFCS.2001.959888 Search in Google Scholar

[27] D. Reich, A. Todoki, R. Dowsley, M. De Cock, and A. Nascimento, “Privacy-preserving classification of personal text messages with secure multi-party computation,” in Advances in Neural Information Processing Systems (NeurIPS), pp. 3752–3764, 2019. Search in Google Scholar

[28] N. Attrapadung, G. Hanaoka, S. Kiyomoto, T. Mimoto, and J. C. N. Schuldt, “A taxonomy of secure two-party comparison protocols and efficient constructions,” IEICE Trans. Fundam. Electron. Commun. Comput. Sci., vol. 102-A, no. 9, pp. 1048–1060, 2019. Search in Google Scholar

[29] D. Bogdanov, S. Laur, and J. Willemson, “Sharemind: A framework for fast privacy-preserving computations,” in European Symposium on Research in Computer Security, pp. 192–206, Springer, 2008.10.1007/978-3-540-88313-5_13 Search in Google Scholar

[30] T. Nishide and K. Ohta, “Multiparty computation for interval, equality, and comparison without bit-decomposition protocol,” in International Workshop on Public Key Cryptography, pp. 343–360, Springer, 2007.10.1007/978-3-540-71677-8_23 Search in Google Scholar

[31] L. Breiman, “Random forests,” Machine learning, vol. 45, no. 1, pp. 5–32, 2001.10.1023/A:1010933404324 Search in Google Scholar

[32] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay, “Scikit-learn: Machine learning in Python,” Journal of Machine Learning Research, vol. 12, pp. 2825–2830, 2011. Search in Google Scholar

[33] S. Garcia, J. Luengo, J. A. Sáez, V. Lopez, and F. Herrera, “A survey of discretization techniques: Taxonomy and empirical analysis in supervised learning,” IEEE Transactions on Knowledge and Data Engineering, vol. 25, no. 4, pp. 734–750, 2012.10.1109/TKDE.2012.35 Search in Google Scholar

[34] R. Dowsley, Cryptography Based on Correlated Data: Foundations and Practice. PhD thesis, Karlsruhe Institute of Technology, Germany, 2016. Search in Google Scholar

Recommended articles from Trend MD

Plan your remote conference with Sciendo