1. bookVolume 2020 (2020): Issue 2 (April 2020)
Journal Details
First Published
16 Apr 2015
Publication timeframe
4 times per year
access type Open Access

Secure and Scalable Document Similarity on Distributed Databases: Differential Privacy to the Rescue

Published Online: 08 May 2020
Page range: 209 - 229
Received: 31 Aug 2019
Accepted: 16 Dec 2019
Journal Details
First Published
16 Apr 2015
Publication timeframe
4 times per year

Privacy-preserving collaborative data analysis enables richer models than what each party can learn with their own data. Secure Multi-Party Computation (MPC) offers a robust cryptographic approach to this problem, and in fact several protocols have been proposed for various data analysis and machine learning tasks. In this work, we focus on secure similarity computation between text documents, and the application to k-nearest neighbors (k-NN) classification. Due to its non-parametric nature, k-NN presents scalability challenges in the MPC setting. Previous work addresses these by introducing non-standard assumptions about the abilities of an attacker, for example by relying on non-colluding servers. In this work, we tackle the scalability challenge from a different angle, and instead introduce a secure preprocessing phase that reveals differentially private (DP) statistics about the data. This allows us to exploit the inherent sparsity of text data and significantly speed up all subsequent classifications.


[1] Aspell dictionary creation. http://app.aspell.net/create.Search in Google Scholar

[2] Amazon Product Data. http://jmcauley.ucsd.edu/data/amazon/.Search in Google Scholar

[3] Multi-protocol SPDZ. https://github.com/data61/MP-SPDZ, 2018.Search in Google Scholar

[4] 1000 Genomes Project Consortium et al. A global reference for human genetic variation. Nature, 526(7571):68, 2015.Search in Google Scholar

[5] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Mané, R. Monga, S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan, F. Viégas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu, and X. Zheng. TensorFlow: Large-scale machine learning on heterogeneous systems, 2015. URL http://tensorflow.org/.Software available from tensorflow.org.Search in Google Scholar

[6] M. Al-Rubaie, P. Y. Wu, J. M. Chang, and S. Kung. Privacy-preserving PCA on horizontally-partitioned data. In DSC, pages 280–287, 2017.Search in Google Scholar

[7] M. Bafna and J. Ullman. The price of selection in differential privacy. In COLT, pages 151–168. PMLR, 2017.Search in Google Scholar

[8] K. E. Batcher. Sorting Networks and Their Applications. In AFIPS Spring Joint Computing Conference, volume 32, pages 307–314, 1968.Search in Google Scholar

[9] D. Beaver. Efficient Multiparty Protocols Using Circuit Randomization. In CRYPTO, pages 420–432, 1991.Search in Google Scholar

[10] J. Beel, B. Gipp, S. Langer, and C. Breitinger. Paper recommender systems: a literature survey. International Journal on Digital Libraries, 17(4):305–338, 2016.Search in Google Scholar

[11] A. Beimel, K. Nissim, and E. Omri. Distributed private data analysis: Simultaneously solving how and what. In CRYPTO, pages 451–468. Springer, 2008.Search in Google Scholar

[12] A. Ben-Efraim, Y. Lindell, and E. Omri. Optimizing Semi-Honest Secure Multiparty Computation for the Internet. In ACM Conference on Computer and Communications Security, pages 578–590, 2016.Search in Google Scholar

[13] C. Blundo, E. D. Cristofaro, and P. Gasti. EsPRESSo: Effi-cient privacy-preserving evaluation of sample set similarity. In DPM/SETOP, pages 89–103, 2012.Search in Google Scholar

[14] R. Bost, R. A. Popa, S. Tu, and S. Goldwasser. Machine learning classification over encrypted data. In NDSS, 2015.Search in Google Scholar

[15] S. Buyrukbilen and S. Bakiras. Secure similar document detection with simhash. In Secure Data Management, pages 61–75, 2013.Search in Google Scholar

[16] R. Canetti. Security and composition of multiparty cryptographic protocols. J. Cryptology, 13(1):143–202, 2000.Search in Google Scholar

[17] K. Chaudhuri and S. Dasgupta. Rates of convergence for nearest neighbor classification. In Advances in Neural Information Processing Systems, pages 3437–3445, 2014.Search in Google Scholar

[18] K. Chaudhuri and S. A. Vinterbo. A stability-based validation procedure for differentially private machine learning. In NIPS, pages 2652–2660, 2013.Search in Google Scholar

[19] H. Chen, I. Chillotti, Y. Dong, O. Poburinnaya, I. P. Razenshteyn, and M. S. Riazi. SANNS: scaling up secure approximate k-nearest neighbors search. IACR Cryptology ePrint Archive, 2019:359, 2019.Search in Google Scholar

[20] I. Damgård, V. Pastro, N. P. Smart, and S. Zakarias. Multi-party computation from somewhat homomorphic encryption. In CRYPTO, volume 7417, 2012.Search in Google Scholar

[21] J. Doerner. The Absentminded Crypto Kit. URL https://bitbucket.org/jackdoerner/absentminded-crypto-kit/.Search in Google Scholar

[22] J. Doerner and A. Shelat. Scaling ORAM for Secure Computation. In CCS, pages 523–535, 2017.Search in Google Scholar

[23] C. Dwork and A. Roth. The Algorithmic Foundations of Differential Privacy. Foundations and Trends in Theoretical Computer Science, 9(3–4):211–407, Aug. 2014.Search in Google Scholar

[24] C. Dwork, K. Kenthapadi, F. McSherry, I. Mironov, and M. Naor. Our Data, Ourselves: Privacy Via Distributed Noise Generation. In EUROCRYPT, pages 486–503, 2006.Search in Google Scholar

[25] C. Dwork, F. McSherry, K. Nissim, and A. D. Smith. Calibrating Noise to Sensitivity in Private Data Analysis. In TCC, pages 265–284, 2006.Search in Google Scholar

[26] A. Efros. How to stop worrying and learn to love nearest neighbors. NIPS workshop on Nearest Neighbors for Modern Applications with Massive Data, 2017.Search in Google Scholar

[27] M. Fanaeepour and B. I. P. Rubinstein. Histogramming privately ever after: Differentially-private data-dependent error bound optimisation. In ICDE, pages 1204–1207. IEEE Computer Society, 2018.Search in Google Scholar

[28] A. Gascón, P. Schoppmann, B. Balle, M. Raykova, J. Doerner, S. Zahur, and D. Evans. Privacy-Preserving Distributed Linear Regression on High-Dimensional Data. Proceedings on Privacy Enhancing Technologies, 2017(4):345–364, Oct. 2017.Search in Google Scholar

[29] N. Gilboa. Two party RSA key generation. In CRYPTO, pages 116–129. Springer, 1999.Search in Google Scholar

[30] O. Goldreich. The Foundations of Cryptography – Volume 2, Basic Applications. 2004.Search in Google Scholar

[31] A. Groce, P. Rindal, and M. Rosulek. Cheaper private set intersection via differentially private leakage. PoPETs, 2019 (3):6–25, 2019.Search in Google Scholar

[32] G. Guennebaud, B. Jacob, et al. Eigen v3. http://eigen.tuxfamily.org, 2010.Search in Google Scholar

[33] R. He and J. J. McAuley. Ups and downs: Modeling the visual evolution of fashion trends with one-class collaborative filtering. In WWW, pages 507–517. ACM, 2016.Search in Google Scholar

[34] X. He, A. Machanavajjhala, C. J. Flynn, and D. Srivastava. Composing differential privacy and secure computation: A case study on scaling private record linkage. In ACM Conference on Computer and Communications Security, pages 1389–1406, 2017.Search in Google Scholar

[35] Y. Huang, D. Evans, and J. Katz. Private set intersection: Are garbled circuits better than custom protocols? In NDSS, 2012.Search in Google Scholar

[36] W. Jiang, M. Murugesan, C. Clifton, and L. Si. Similar document detection with limited information disclosure. In ICDE, pages 735–743, 2008.Search in Google Scholar

[37] C. Juvekar, V. Vaikuntanathan, and A. Chandrakasan. Gazelle: A Low Latency Framework for Secure Neural Network Inference. IACR Cryptology ePrint Archive, 2018:73, 2018.Search in Google Scholar

[38] D. D. Lewis, Y. Yang, T. G. Rose, and F. Li. Rcv1: A new benchmark collection for text categorization research. Journal of machine learning research, 5(Apr):361–397, 2004.Search in Google Scholar

[39] F. Li, R. Shin, and V. Paxson. Exploring privacy preservation in outsourced k-nearest neighbors with multiple data owners. In CCSW, pages 53–64, 2015.Search in Google Scholar

[40] Y. Lindell. How to simulate it - A tutorial on the simulation proof technique. In Tutorials on the Foundations of Cryptography, pages 277–346. Springer International Publishing, 2017.Search in Google Scholar

[41] Y. Lindell and B. Pinkas. A Proof of Security of Yao’s Protocol for Two-Party Computation. J. Cryptology, 22(2): 161–188, 2009.Search in Google Scholar

[42] J. Liu, M. Juuti, Y. Lu, and N. Asokan. Oblivious neural network predictions via minionn transformations. In CCS, pages 619–631, 2017.Search in Google Scholar

[43] C. Manning, P. Raghavan, and H. Schütze. Scoring, term weighting and the vector space model. In Introduction to information retrieval, pages 100–122. 2008.Search in Google Scholar

[44] S. Mazloom and S. D. Gordon. Secure computation with differentially private access patterns. In ACM Conference on Computer and Communications Security, pages 490–507, 2018.Search in Google Scholar

[45] F. McSherry and K. Talwar. Mechanism design via differential privacy. In FOCS, pages 94–103, 2007.Search in Google Scholar

[46] I. Mironov, O. Pandey, O. Reingold, and S. P. Vadhan. Computational differential privacy. In CRYPTO, pages 126–142, 2009.Search in Google Scholar

[47] P. Mohassel and Y. Zhang. SecureML: A system for scalable privacy-preserving machine learning. In IEEE Symposium on Security and Privacy, pages 19–38, 2017.Search in Google Scholar

[48] M. Murugesan, W. Jiang, C. Clifton, L. Si, and J. Vaidya. Efficient privacy-preserving similar document detection. VLDB J., 19(4):457–475, 2010.Search in Google Scholar

[49] K. Nayak, X. S. Wang, S. Ioannidis, U. Weinsberg, N. Taft, and E. Shi. GraphSC: Parallel secure computation made easy. In IEEE Symposium on Security and Privacy, pages 377–394. IEEE Computer Society, 2015.Search in Google Scholar

[50] V. Nikolaenko, S. Ioannidis, U. Weinsberg, M. Joye, N. Taft, and D. Boneh. Privacy-preserving matrix factorization. In ACM Conference on Computer and Communications Security, pages 801–812. ACM, 2013.Search in Google Scholar

[51] V. Nikolaenko, U. Weinsberg, S. Ioannidis, M. Joye, D. Boneh, and N. Taft. Privacy-preserving ridge regression on hundreds of millions of records. In IEEE Symposium on Security and Privacy, pages 334–348, 2013.Search in Google Scholar

[52] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. VanderPlas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011.Search in Google Scholar

[53] D. M. W. Powers. Applications and explanations of zipf’s law. In CoNLL, pages 151–160, 1998.Search in Google Scholar

[54] M. S. Riazi, B. Chen, A. Shrivastava, D. S. Wallach, and F. Koushanfar. Sub-linear privacy-preserving search with untrusted server and semi-honest parties. CoRR, abs/1612.01835, 2016.Search in Google Scholar

[55] H. Rong, H. Wang, J. Liu, and M. Xian. Privacy-preserving k-nearest neighbor computation in multiple cloud environments. IEEE Access, 4:9589–9603, 2016.Search in Google Scholar

[56] A. Waksman. A permutation network. J. ACM, 15(1): 159–163, 1968.Search in Google Scholar

[57] X. Wang, T.-H. H. Chan, and E. Shi. Circuit ORAM: On Tightness of the Goldreich-Ostrovsky Lower Bound. In ACM Conference on Computer and Communications Security, pages 850–861, 2015.Search in Google Scholar

[58] X. Wang, A. J. Malozemoff, and J. Katz. EMP-toolkit: Efficient MultiParty computation toolkit. https://github.com/emp-toolkit, 2016.Search in Google Scholar

[59] A. C.-C. Yao. How to Generate and Exchange Secrets (Extended Abstract). In FOCS, pages 162–167, 1986.Search in Google Scholar

[60] S. Zahur and D. Evans. Obliv-C: A Language for Extensible Data-Oblivious Computation. IACR Cryptology ePrint Archive, 2015:1153, 2015.Search in Google Scholar

[61] S. Zahur, X. S. Wang, M. Raykova, A. Gascón, J. Doerner, D. Evans, and J. Katz. Revisiting Square-Root ORAM: Efficient Random Access in Multi-party Computation. In IEEE Symposium on Security and Privacy, pages 218–234, 2016.Search in Google Scholar

Recommended articles from Trend MD

Plan your remote conference with Sciendo