1. bookVolume 2021 (2021): Issue 4 (October 2021)
Journal Details
License
Format
Journal
First Published
16 Apr 2015
Publication timeframe
4 times per year
Languages
English
access type Open Access

Unifying Privacy Policy Detection

Published Online: 23 Jul 2021
Page range: 480 - 499
Received: 28 Feb 2021
Accepted: 16 Jun 2021
Journal Details
License
Format
Journal
First Published
16 Apr 2015
Publication timeframe
4 times per year
Languages
English
Abstract

Privacy policies have become a focal point of privacy research. With their goal to reflect the privacy practices of a website, service, or app, they are often the starting point for researchers who analyze the accuracy of claimed data practices, user understanding of practices, or control mechanisms for users. Due to vast differences in structure, presentation, and content, it is often challenging to extract privacy policies from online resources like websites for analysis. In the past, researchers have relied on scrapers tailored to the specific analysis or task, which complicates comparing results across different studies.

To unify future research in this field, we developed a toolchain to process website privacy policies and prepare them for research purposes. The core part of this chain is a detector module for English and German, using natural language processing and machine learning to automatically determine whether given texts are privacy or cookie policies. We leverage multiple existing data sets to refine our approach, evaluate it on a recently published longitudinal corpus, and show that it contains a number of misclassified documents. We believe that unifying data preparation for the analysis of privacy policies can help make different studies more comparable and is a step towards more thorough analyses. In addition, we provide insights into common pitfalls that may lead to invalid analyses.

Keywords

[1] Kenneth D. Pimple. Emerging Pervasive Information and Communication Technologies (PICT). Springer, 2014. Search in Google Scholar

[2] Willis H. Ware. Records, Computers and the Rights of Citizens. Technical report, The Rand Corporation, Santa Monica, California, 1973. Search in Google Scholar

[3] Christine Utz, Martin Degeling, Sascha Fahl, Florian Schaub, and Thorsten Holz. (Un)informed Consent: Studying GDPR Consent Notices in the Field. In Proceedings of the 2019 ACM SIGSAC Conference on Computer and Communications Security, pages 973–990, 2019. Search in Google Scholar

[4] Julie M. Robillard, Tanya L. Feng, Arlo B. Sporn, Jen-Ai Lai, Cody Lo, Monica Ta, and Roland Nadler. Availability, readability, and content of privacy policies and terms of agreements of mental health apps. Internet Interventions, 17:100243, 2019. Search in Google Scholar

[5] Noriko Tomuro, Steven Lytinen, and Kurt Hornsburg. Automatic Summarization of Privacy Policies using Ensemble Learning. In Proceedings of the Sixth ACM Conference on Data and Application Security and Privacy, pages 133–135, 2016. Search in Google Scholar

[6] Razieh Nokhbeh Zaeem, Rachel L. German, and K. Suzanne Barber. PrivacyCheck: Automatic Summarization of Privacy Policies Using Data Mining. ACM Transactions on Internet Technology (TOIT), 18(4):1–18, 2018. Search in Google Scholar

[7] Dhiren A. Audich, Rozita Dara, and Blair Nonnecke. Extracting keyword and keyphrase from online privacy policies. In 2016 Eleventh International Conference on Digital Information Management (ICDIM), pages 127–132. IEEE, 2016. Search in Google Scholar

[8] Benjamin Fabian, Tatiana Ermakova, and Tino Lentz. Large-Scale Readability Analysis of Privacy Policies. In Proceedings of the International Conference on Web Intelligence, pages 18–25, 2017. Search in Google Scholar

[9] Shomir Wilson, Florian Schaub, Aswarth Abhilash Dara, Frederick Liu, Sushain Cherivirala, Pedro Giovanni Leon, Mads Schaarup Andersen, Sebastian Zimmeck, Kanthashree Mysore Sathyendra, N. Cameron Russell, et al. The Creation and Analysis of a Website Privacy Policy Corpus. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1330–1340, 2016. Search in Google Scholar

[10] Dhiren A. Audich, Rozita Dara, and Blair Nonnecke. Privacy Policy Annotation for Semi-automated Analysis: A Cost-Effective Approach. In IFIP International Conference on Trust Management, pages 29–44. Springer, 2018. Search in Google Scholar

[11] Hamza Harkous, Kassem Fawaz, Rémi Lebret, Florian Schaub, Kang G. Shin, and Karl Aberer. Polisis: Automated Analysis and Presentation of Privacy Policies Using Deep Learning. In Proceedings of the 27th USENIX Security Symposium, pages 531–548, 2018. Search in Google Scholar

[12] Tobias Urban, Martin Degeling, Thorsten Holz, and Nor-bert Pohlmann. “Your Hashed IP Address: Ubuntu.” Perspectives on Transparency Tools for Online Advertising. In Proceedings of the 35th Annual Computer Security Applications Conference, pages 702–717, 2019. Search in Google Scholar

[13] Luca Bufalieri, Massimo La Morgia, Alessandro Mei, and Julinda Stefa. GDPR: When the Right to Access Personal Data Becomes a Threat. arXiv preprint arXiv:2005.01868, 2020. Search in Google Scholar

[14] Coline Boniface, Imane Fouad, Nataliia Bielova, Cédric Lauradoux, and Cristiana Santos. Security Analysis of Subject Access Request Procedures. In Annual Privacy Forum, pages 182–209. Springer, 2019. Search in Google Scholar

[15] Martin Degeling, Christine Utz, Christopher Lentzsch, Henry Hosseini, Florian Schaub, and Thorsten Holz. We Value Your Privacy ... Now Take Some Cookies: Measuring the GDPR’s Impact on Web Privacy. In Proceedings of the 26th Annual Network and Distributed System Security Symposium (NDSS 2019). The Internet Society, February 2019. Search in Google Scholar

[16] Mukund Srinath, Shomir Wilson, and C. Lee Giles. Privacy at Scale: Introducing the PrivaSeer Corpus of Web Privacy Policies. arXiv preprint arXiv:2004.11131, 2020. Search in Google Scholar

[17] Fei Liu, Rohan Ramanath, Norman Sadeh, and Noah A. Smith. A Step Towards Usable Privacy Policy: Automatic Alignment of Privacy Statements. In Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers, pages 884–894, 2014. Search in Google Scholar

[18] Le Yu, Xiapu Luo, Xule Liu, and Tao Zhang. Can We Trust the Privacy Policies of Android Apps? In 2016 46th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), pages 538–549. IEEE, 2016. Search in Google Scholar

[19] Abhijith Athreya Mysore Gopinath, Shomir Wilson, and Norman Sadeh. Supervised and Unsupervised Methods for Robust Separation of Section Titles and Prose Text in Web Documents. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 850–855. Association for Computational Linguistics, 2018. Search in Google Scholar

[20] Timothy Libert. An Automated Approach to Auditing Disclosure of Third-Party Data Collection in Website Privacy Policies. In Proceedings of the 2018 World Wide Web Conference, pages 207–216, 2018. Search in Google Scholar

[21] Keishiro Fukushima, Toru Nakamura, Daisuke Ikeda, and Shinsaku Kiyomoto. Challenges in Classifying Privacy Policies by Machine Learning with Word-based Features. In Proceedings of the 2nd International Conference on Cryptography, Security and Privacy (ICCSP 2018), pages 62–66, Guiyang, China, 2018. ACM. Search in Google Scholar

[22] Tarun Ramadorai, Antoine Uettwiller, and Ansgar Walther. The Market for Data Privacy. https://dx.doi.org/10.2139/ssrn.3352175, 2019. Search in Google Scholar

[23] Martin Boldt and Kaavya Rekanar. Analysis and Text Classification of Privacy Policies From Rogue and Top-100 Fortune Global Companies. International Journal of Information Security and Privacy (IJISP), 13(2):47–66, 2019. Search in Google Scholar

[24] Sebastian Zimmeck, Peter Story, Daniel Smullen, Abhilasha Ravichander, Ziqi Wang, Joel Reidenberg, N. Cameron Russell, and Norman Sadeh. MAPS: Scaling Privacy Compliance Analysis to a Million Apps. Proceedings on Privacy Enhancing Technologies, 2019(3):66–86, 2019. Search in Google Scholar

[25] David Sarne, Jonathan Schler, Alon Singer, Ayelet Sela, and Ittai Bar Siman Tov. Unsupervised Topic Extraction from Privacy Policies. In Companion Proceedings of The 2019 World Wide Web Conference, pages 563–568. IW3C2 (International World Wide Web Conference Committee), 2019. Search in Google Scholar

[26] Mitra Bokaie Hosseini, KC Pragyan, Irwin Reyes, and Serge Egelman. Identifying and Classifying Third-party Entities in Natural Language Privacy Policies. In Proceedings of the Second Workshop on Privacy in NLP, pages 18–27, 2020. Search in Google Scholar

[27] Vinayshekhar Bannihatti Kumar, Roger Iyengar, Namita Nisal, Yuanyuan Feng, Hana Habib, Peter Story, Sushain Cherivirala, Margaret Hagan, Lorrie Cranor, Shomir Wilson, et al. Finding a Choice in a Haystack: Automatic Extraction of Opt-Out Statements from Privacy Policy Text. In Proceedings of The Web Conference 2020, 2020. Search in Google Scholar

[28] Thomas Linden, Rishabh Khandelwal, Hamza Harkous, and Kassem Fawaz. The Privacy Policy Landscape After the GDPR. Proceedings on Privacy Enhancing Technologies, 2020(1):47–64, 2020. Search in Google Scholar

[29] Yoon Kim. Convolutional Neural Networks for Sentence Classification. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1746–1751. Association for Computational Linguistics, 2014. Search in Google Scholar

[30] Ryan Amos, Gunes Acar, Elena Lucherini, Mihir Kshirsagar, Arvind Narayanan, and Jonathan Mayer. Privacy Policies over Time: Curation and Analysis of a Million-Document Dataset. arXiv preprint, arXiv:2008.09159, 2020. Search in Google Scholar

[31] Leonard Richardson. Beautiful Soup. https://www.crummy.com/software/BeautifulSoup/bs4/doc/, 2007. [Online; accessed 24 April 2020]. Search in Google Scholar

[32] Postlight. Mercury Parser – Extracting content from chaos. https://github.com/postlight/mercury-parser. Search in Google Scholar

[33] Stefan Behnel, Martijn Faassen, and Ian Bicking. lxml: Processing XML and HTML with Python. https://lxml.de/, 2005. [Online; accessed 14 June 2021]. Search in Google Scholar

[34] Kanthashree Mysore Sathyendra, Abhilasha Ravichander, Peter Garth Story, Alan W. Black, and Norman Sadeh. Helping Users Understand Privacy Notices with Automated Query Answering Functionality: An Exploratory Study. Technical report, 2017. Search in Google Scholar

[35] Marco Lui and Timothy Baldwin. langid.py: An Off-the-shelf Language Identification Tool. In Proceedings of the ACL 2012 System Demonstrations, pages 25–30. Association for Computational Linguistics, 2012. Search in Google Scholar

[36] Christian Kohlschütter, Peter Fankhauser, and Wolfgang Nejdl. Boilerplate Detection using Shallow Text Features. In Proceedings of the third ACM international conference on Web search and data mining, pages 441–450, 2010. Search in Google Scholar

[37] Rohan Ramanath, Fei Liu, Norman Sadeh, and Noah A. Smith. Unsupervised Alignment of Privacy Policies using Hidden Markov Models. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 605–610, 2014. Search in Google Scholar

[38] Jim Plush and Robbie Coleman. Goose - Article Extractor. https://github.com/goose3/goose3, 2011. [Online; accessed 24 April 2020]. Search in Google Scholar

[39] Nakatani Shuyo. Language Detection Library for Java. http://code.google.com/p/language-detection/, 2010. Search in Google Scholar

[40] Welderufael B. Tesfay, Peter Hofmann, Toru Nakamura, Shinsaku Kiyomoto, and Jetzabel Serna. PrivacyGuide: Towards an Implementation of the EU GDPR on Internet Privacy Policy Evaluation. In Proceedings of the Fourth ACM International Workshop on Security and Privacy Analytics, pages 15–21, 2018. Search in Google Scholar

[41] Matthew E. Peters and Dan Lecocq. Content Extraction Using Diverse Feature Sets. In Companion Publication of the 22nd International World Wide Web Conference, pages 89–90, 2013. Search in Google Scholar

[42] Stuart Rose, Dave Engel, Nick Cramer, and Wendy Cowley. Automatic Keyword Extraction from Individual Documents. Text Mining: Applications and Theory, 1:1–20, 2010. Search in Google Scholar

[43] Rada Mihalcea and Paul Tarau. TextRank: Bringing Order into Text. In Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing, pages 404–411, 2004. Search in Google Scholar

[44] Jonathan Hedley. jsoup: Java HTML Parser. https://jsoup.org, 2009. Search in Google Scholar

[45] Elisa Costante, Yuanhao Sun, Milan Petkovi¢, and Jerry den Hartog. A Machine Learning Solution to Assess Privacy Policy Completeness. In Proceedings of the 2012 ACM Workshop on Privacy in the Electronic Society, pages 91–96. ACM, 2012. Search in Google Scholar

[46] Niharika Guntamukkala, Rozita Dara, and Gary Grewal. A Machine-Learning Based Approach for Measuring the Completeness of Online Privacy Policies. In 2015 IEEE 14th International Conference on Machine Learning and Applications (ICMLA), pages 289–294. IEEE, 2015. Search in Google Scholar

[47] Shuang Liu, Renjie Guo, Baiyang Zhao, Tao Chen, and Meishan Zhang. APPCorp: A Corpus for Android Privacy Policy Document Structure Analysis. arXiv preprint arXiv:2005.06945, 2020. Search in Google Scholar

[48] Cheng Chang, Huaxin Li, Yichi Zhang, Suguo Du, Hui Cao, and Haojin Zhu. Automated and Personalized Privacy Policy Extraction Under GDPR Consideration. In International Conference on Wireless Algorithms, Systems, and Applications, pages 43–54. Springer, 2019. Search in Google Scholar

[49] Parvaneh Shayegh, Vijayanta Jain, Amin Rabinia, and Sepideh Ghanavati. Automated Approach to Improve IoT Privacy Policies. arXiv preprint arXiv:1910.04133, 2019. Search in Google Scholar

[50] Statista. Percentage of mobile device website traffic worldwide from 1st quarter 2015 to 1st quarter 2021. https://www.statista.com/statistics/277125/share-of-website-traffic-coming-from-mobile-devices/. [Online; accessed 14 June 2021]. Search in Google Scholar

[51] Pradeep K. Murukannaiah, Chinmaya Dabral, Karthik Sheshadri, Esha Sharma, and Jessica Staddon. Learning a Privacy Incidents Database. In Proceedings of the Hot Topics in Science of Security: Symposium and Bootcamp, pages 35–44, 2017. Search in Google Scholar

[52] Aaron Swartz and Alireza Savand. HTML2Text. https://alir3z4.github.io/html2text/, 2011. [Online; accessed 20 April 2020]. Search in Google Scholar

[53] Albert Weichselbraun and Fabian Odoni. inscriptis – HTML to text conversion library, command line client and Web service. https://inscriptis.readthedocs.io/en/latest/, 2016. [Online; accessed 20 April 2020]. Search in Google Scholar

[54] Mozilla. Readability.js. https://github.com/mozilla/readability, 2015. [Online; accessed 24 April 2020]. Search in Google Scholar

[55] Jorj X. McKie and Ruikai Liu. PyMuPDF. https://github.com/pymupdf/PyMuPDF, 2016. [Online; accessed 7 January 2021]. Search in Google Scholar

[56] The Apache Software Foundation. Apache Tika – a content analysis toolkit. https://tika.apache.org/, 2019. Online; accessed 15 June 2021. Search in Google Scholar

[57] Dick Sites. Compact Language Detector 2. https://github.com/CLD2Owners/cld2, 2013. Online; accessed 15 June 2021. Search in Google Scholar

[58] Alex Salcianu, Andy Golding, Anton Bakalov, Chris Alberti, Daniel Andor, David Weiss, Emily Pitler, Greg Coppola, Jason Riesa, Kuzman Ganchev, et al. Compact Language Detector v3. https://github.com/google/cld3, 2018. Search in Google Scholar

[59] Kent Johnson and Phi-Long Do. Goose – Article Extractor. https://bitbucket.org/spirit/guess_language/, 2008. [Online; accessed 24 April 2020]. Search in Google Scholar

[60] Burton DeWilde. textacy: NLP, before and after spaCy. https://github.com/chartbeat-labs/textacy, 2016. [Online; accessed 24 April 2020]. Search in Google Scholar

[61] Armand Joulin, Edouard Grave, Piotr Bojanowski, Matthijs Douze, Hérve Jégou, and Tomas Mikolov. FastText.zip: Compressing text classification models. arXiv preprint arXiv:1612.03651, 2016. Search in Google Scholar

[62] Armand Joulin, Edouard Grave, Piotr Bojanowski, and Tomas Mikolov. Bag of Tricks for Efficient Text Classification. arXiv preprint arXiv:1607.01759, 2016. Search in Google Scholar

[63] Trang Ho and Allan Simon. Tatoeba: Collection of sentences and translations. https://tatoeba.org, 2016. [Online; accessed 15 June 2020]. Search in Google Scholar

[64] Jörg Tiedemann. Parallel Data, Tools and Interfaces in OPUS. In Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC ’12), Istanbul, Turkey, May 2012. European Language Resources Association (ELRA). Search in Google Scholar

[65] Dirk Goldhahn, Thomas Eckart, and Uwe Quasthoff. Building Large Monolingual Dictionaries at the Leipzig Corpora Collection: From 100 to 200 Languages. In Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC ’12). Search in Google Scholar

[66] Liling Tan, Marcos Zampieri, Nikola Ljubešic, and Jörg Tiedemann. Merging Comparable Data Sources for the Discrimination of Similar Languages: The DSL Corpus Collection. In Proceedings of the 7th Workshop on Building and Using Comparable Corpora (BUCC), pages 11–15, Reykjavik, Iceland, 2014. Search in Google Scholar

[67] Mitja Trampus. Evaluating language identification performance. https://blog.twitter.com/engineering/en_us/a/2015/evaluating-language-identification-performance, 2015. [Online; accessed 15 April 2021]. Search in Google Scholar

[68] Ralf Steinberger, Bruno Pouliquen, Anna Widiger, Camelia Ignat, Tomaz Erjavec, Dan Tufis, and Dániel Varga. The JRC-Acquis: A Multilingual Aligned Parallel Corpus with 20+ Languages. arXiv preprint cs/0609058, 2006. Search in Google Scholar

[69] Jamie Callan, Mark Hoy, Changkuk Yoo, and Le Zhao. The ClueWeb09 Dataset. http://boston.lti.cs.cmu.edu/Data/clueweb09, 2009. [Online; accessed 14 June 2021]. Search in Google Scholar

[70] David D. Lewis, Yiming Yang, Tony G. Rose, and Fan Li. RCV1: A New Benchmark Collection for Text Categorization Research. The Journal of Machine Learning Research, 5:361–397, 2004. Search in Google Scholar

[71] Tomohiro Kubota. Introduction to i18n. https://www.debian.org/doc/manuals/intro-i18n/, 2003. Online; accessed 24 April 2021. Search in Google Scholar

[72] Quoc Le and Tomas Mikolov. Distributed representations of sentences and documents. In Proceedings of the 31st International Conference on Machine Learning (ICML’14), pages II–1188–II–1196, 2014. Search in Google Scholar

[73] Peng Qi, Yuhao Zhang, Yuhui Zhang, Jason Bolton, and Christopher D. Manning. Stanza: A Python Natural Language Processing Toolkit for Many Human Languages. arXiv preprint arXiv:2003.07082, 2020. Search in Google Scholar

[74] Katrin Ortmann, Adam Roussel, and Stefanie Dipper. Evaluating Off-the-Shelf NLP Tools for German. In Proceedings of the 15th Conference on Natural Language Processing (KONVENS 2019), pages 212–222, 2019. Search in Google Scholar

[75] Matthew Honnibal and Ines Montani. spaCy 2: Natural language understanding with Bloom embeddings, convolutional neural networks and incremental parsing. https://sentometrics-research.com/publication/72/. [To appear]. Search in Google Scholar

[76] Radim Řehůřek and Petr Sojka. Software Framework for Topic Modelling with Large Corpora. In Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks, pages 45–50, Valletta, Malta, May 2010. ELRA. Search in Google Scholar

[77] Helena Gómez-Adorno, Juan-Pablo Posadas-Durán, Grigori Sidorov, and David Pinto. Document embeddings learned on various types of n-grams for cross-topic authorship attribution. Computing, 100(7):741–756, 2018. Search in Google Scholar

[78] Erich Schubert, Jörg Sander, Martin Ester, Hans Peter Kriegel, and Xiaowei Xu. DBSCAN Revisited, Revisited: Why and How You Should (Still) Use DBSCAN. ACM Transactions on Database Systems (TODS), 42(3):1–21, 2017. Search in Google Scholar

[79] Ian H. Witten, Gordon W. Paynter, Eibe Frank, Carl Gutwin, and Craig G. Nevill-Manning. KEA: Practical Automatic Keyphrase Extraction. arXiv preprint arXiv:cs/9902007, 1999. Search in Google Scholar

[80] Xiaojun Wan and Jianguo Xiao. CollabRank: Towards a Collaborative Approach to Single-Document Keyphrase Extraction. In Proceedings of the 22nd International Conference on Computational Linguistics (Coling 2008), pages 969–976, 2008. Search in Google Scholar

[81] Olena Medelyan, Eibe Frank, and Ian H. Witten. Human-competitive tagging using automatic keyphrase extraction. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing (EMNLP ’09), pages 1318–1327, 2009. Search in Google Scholar

[82] Samhaa R. El-Beltagy and Ahmed Rafea. KP-Miner: Participation in SemEval-2. In Proceedings of the 5th International Workshop on Semantic Evaluation, pages 190–193. Association for Computational Linguistics, 2010. Search in Google Scholar

[83] Thuy Dung Nguyen and Minh-Thang Luong. WINGNUS: Keyphrase Extraction Utilizing Document Logical Structure. In Proceedings of the 5th International Workshop on Semantic Evaluation, pages 166–169. Association for Computational Linguistics, 2010. Search in Google Scholar

[84] Adrien Bougouin, Florian Boudin, and Béatrice Daille. TopicRank: Graph-based Topic Ranking for Keyphrase Extraction. In Proceedings of the Sixth International Joint Conference on Natural Language Processing, pages 543–551. Asian Federation of Natural Language Processing, 2013. Search in Google Scholar

[85] Lucas Sterckx, Thomas Demeester, Johannes Deleu, and Chris Develder. Topical Word Importance for Fast Keyphrase Extraction. In Proceedings of the 24th International Conference on World Wide Web (WWW ’15 Companion), pages 121–122, 2015. Search in Google Scholar

[86] Soheil Danesh, Tamara Sumner, and James H. Martin. SGRank: Combining Statistical and Graphical Methods to Improve the State of the Art in Unsupervised Keyphrase Extraction. In Proceedings of the Fourth Joint Conference on Lexical and Computational Semantics, pages 117–126. Association for Computational Linguistics, 2015. Search in Google Scholar

[87] Corina Florescu and Cornelia Caragea. PositionRank: An Unsupervised Approach to Keyphrase Extraction from Scholarly Documents. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1105–1115. Association for Computational Linguistics, 2017. Search in Google Scholar

[88] Rui Meng, Sanqiang Zhao, Shuguang Han, Daqing He, Peter Brusilovsky, and Yu Chi. Deep Keyphrase Generation. arXiv preprint arXiv:1704.06879, 2017. Search in Google Scholar

[89] Florian Boudin. Unsupervised Keyphrase Extraction with Multipartite Graphs. arXiv preprint arXiv:1803.08721, 2018. Search in Google Scholar

[90] Ricardo Campos, Vítor Mangaravite, Arian Pasquali, Alípio Mário Jorge, Célia Nunes, and Adam Jatowt. A Text Feature Based Automatic Keyword Extraction Method for Single Documents. In European Conference on Information Retrieval, pages 684–691. Springer, 2018. Search in Google Scholar

[91] Ricardo Campos, Vítor Mangaravite, Arian Pasquali, Alípio Mário Jorge, Célia Nunes, and Adam Jatowt. YAKE! Collection-independent Automatic Keyword Extractor. In European Conference on Information Retrieval, pages 806–810. Springer, 2018. Search in Google Scholar

[92] Ricardo Campos, Vítor Mangaravite, Arian Pasquali, Alípio Jorge, Célia Nunes, and Adam Jatowt. YAKE! Keyword extraction from single documents using multiple local features. Information Sciences, 509:257–289, 2020. Search in Google Scholar

[93] Swagata Duari and Vasudha Bhatnagar. sCAKE: Semantic Connectivity Aware Keyword Extraction. Information Sciences, 477:100–117, 2019. Search in Google Scholar

[94] Claude Sammut and Geoffrey I. Webb. Tf-idf. In Encyclopedia of Machine Learning and Data Mining, pages 1274–1274. Springer US, Boston, MA, 2017. Search in Google Scholar

[95] Gael Varoquaux. Joblib: running Python functions as pipeline jobs. https://joblib.readthedocs.io/, 2020. [Online; accessed 15 June 2021]. Search in Google Scholar

[96] Joel Nothman, Hanmin Qin, and Roman Yurchak. Stop Word Lists in Free Open-source Software Packages. In Proceedings of Workshop for NLP Open Source Software (NLP-OSS), pages 7–12, 2018. Search in Google Scholar

[97] Florian Boudin. pke: an open source python-based keyphrase extraction toolkit. In Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: System Demonstrations, pages 69–73, Osaka, Japan, December 2016. Search in Google Scholar

[98] Victor Le Pochat, Tom Van Goethem, Samaneh Tajalizadehkhoob, Maciej Korczy«ski, and Wouter Joosen. Tranco: A Research-Oriented Top Sites Ranking Hardened Against Manipulation. In Proceedings of the 26th Annual Network and Distributed System Security Symposium (NDSS 2019). The Internet Society, February 2019. Search in Google Scholar

[99] Steven Englehardt and Arvind Narayanan. Online Tracking: A 1-million-site Measurement and Analysis. In Proceedings of the 26th ACM Conference on Computer and Communications Security, pages 1388–1401, 2016. Search in Google Scholar

[100] Adam Cohen. FuzzyWuzzy: Fuzzy String Matching in Python. https://github.com/seatgeek/fuzzywuzzy, 2011. [Online; accessed 15 December 2020]. Search in Google Scholar

[101] Harald Cramér. Mathematical Methods of Statistics. Princeton University Press, 1946. Search in Google Scholar

[102] Rami Al-Rfou, Bryan Perozzi, and Steven Skiena. Polyglot: Distributed Word Representations for Multilingual NLP. In Proceedings of the Seventeenth Conference on Computational Natural Language Learning, pages 183–192. Association for Computational Linguistics, 2013. Search in Google Scholar

[103] Sebastian Raschka. Python Machine Learning. Packt Publishing Ltd, 2015. Search in Google Scholar

[104] Fabrice Colas and Pavel Brazdil. Comparison of SVM and Some Older Classification Algorithms in Text Classification Tasks. In IFIP AI: International Conference on Artificial Intelligence in Theory and Practice, pages 169–178. Springer, 2006. Search in Google Scholar

[105] Kanish Shah, Henil Patel, Devanshi Sanghvi, and Manan Shah. A Comparative Analysis of Logistic Regression, Random Forest and KNN Models for the Text Classification. Augmented Human Research, 5(1):1–16, 2020. Search in Google Scholar

[106] Leo Breiman. Random Forests. Machine Learning, 45(1):5–32, 2001. Search in Google Scholar

[107] Ciyou Zhu, Richard H. Byrd, Peihuang Lu, and Jorge Nocedal. Algorithm 778: L-BFGS-B: Fortran subroutines for large-scale bound-constrained optimization. ACM Transactions on Mathematical Software (TOMS), 23(4):550–560, 1997. Search in Google Scholar

[108] Pedro G. Fonseca and Hugo D. Lopes. Calibration of Machine Learning Classifiers for Probability of Default Modelling. arXiv preprint arXiv:1710.08901, 2017. Search in Google Scholar

[109] Scott M. Lundberg and Su-In Lee. A Unified Approach to Interpreting Model Predictions. In Proceedings of the 31st International Conference on Neural Information Processing Systems, page 4768–4777. ACM, 2017. Search in Google Scholar

Recommended articles from Trend MD

Plan your remote conference with Sciendo