1. bookVolume 33 (2017): Issue 1 (March 2017)
Journal Details
License
Format
Journal
First Published
01 Oct 2013
Publication timeframe
4 times per year
Languages
English
access type Open Access

Three Methods for Occupation Coding Based on Statistical Learning

Published Online: 21 Feb 2017
Page range: 101 - 122
Received: 01 Mar 2016
Accepted: 01 Oct 2016
Journal Details
License
Format
Journal
First Published
01 Oct 2013
Publication timeframe
4 times per year
Languages
English

Occupation coding, an important task in official statistics, refers to coding a respondent’s text answer into one of many hundreds of occupation codes. To date, occupation coding is still at least partially conducted manually, at great expense. We propose three methods for automatic coding: combining separate models for the detailed occupation codes and for aggregate occupation codes, a hybrid method that combines a duplicate-based approach with a statistical learning algorithm, and a modified nearest neighbor approach. Using data from the German General Social Survey (ALLBUS), we show that the proposed methods improve on both the coding accuracy of the underlying statistical learning algorithm and the coding accuracy of duplicates where duplicates exist. Further, we find defining duplicates based on ngram variables (a concept from text mining) is preferable to one based on exact string matches.

Keywords

ALLBUS. 2015. Available at: http://www.gesis.org/allbus (accessed October 10, 2016).Search in Google Scholar

Appel, M.V. and E. Hellerman. 1983. “Census Bureau Experiments with Automated Industry and Occupation Coding.” In Proceedings of the American Statistical Association, Section on Survey Research Methods. August 15-18, 1983, Toronto, Canada. 32-40.Search in Google Scholar

Belloni, M., A. Brugiavini, E. Meschi, and K. Tijdens. 2014. Measurement Error in Occupational Coding: an Analysis on SHARE Data. Ca’ Foscari University of Venice, Department of Economics, Working Paper 24. Doi: http://dx.doi.org/10.2139/ssrn.2539080.Search in Google Scholar

Bethmann, A., M. Schierholz, K. Wenzig, and M. Zielonka. 2014. “Automatic Coding of Occupations.” In Proceedings of Statistics Canada Symposium. August 29-31, 2014, Québec, Canada. Available at: http://www.statcan.gc.ca/sites/default/files/media/14291-eng.pdf (accessed October 10, 2016).Search in Google Scholar

Chen, B.-C., R.H. Creecy, and M.V. Appel. 1993. “Error Control of Automated Industry and Occupation Coding.” Journal of Official Statistics 9: 729-745. http://www.jos.nu/Articles/abstract.asp?article¼94729 (accessed October 10, 2016).Search in Google Scholar

Clarke, F.R. and S.J. Brooker. 2011. Use of Machine Learning for Automated Survey Coding. In Proceedings of the 58th ISI World Statistics Congress. August 21-26, 2011, Dublin, Ireland.Search in Google Scholar

Conrad, F.G., M.P. Couper, and J.W. Sakshaug. 2016. “Classifying Open-Ended Reports: Factors Affecting the Reliability of Occupation Codes.” Journal of Official Statistics 32: 75-92. Doi: http://dx.doi.org/10.1515/JOS-2016-0003.Search in Google Scholar

Creecy, R.H., B.M. Masand, S.J. Smith, and D.L. Waltz. 1992. “Trading MIPS and Memory for Knowledge Engineering.” Communications of the ACM 35: 48-64. Doi: http://dx.doi.org/10.1145/135226.135228.Search in Google Scholar

Day, J. 2014. Using an Autocoder to Code Industry and Occupation in the American Community Survey. Presentation for the Federal Economic Statistics Advisory Committee Meeting. Available at: http://www2.census.gov/adrm/fesac/2014-06-13_day.pdf (accessed October 10, 2016).Search in Google Scholar

Elias, P. 1997. “Occupational Classification (ISCO-88): Concepts, Methods, Reliability, Validity and Cross-National Comparability.” OECD Labour Market and Social Policy Occasional Papers 20, OECD Publishing. Available at: https://ideas.repec.org/p/oec/elsaaa/20-en.html (accessed October 10, 2016).Search in Google Scholar

Elias, P. and M. Birch. 2010. Tuning CASCOT for Industry and Occupation Coding in the Scottish Census of Population 2011. Technical Report, Institute for Employment Research. Coventry: University of Warwick.Search in Google Scholar

Ferrillo, A., S. Macchia, and P. Vicari. 2008. “Different Quality Tests on the Automatic Coding Procedure for the Economic Activities Descriptions.” In Proceedings of the European Conference on Quality in Official Statistics - Q2008. July 8-11, 2008, Rome, Italy. Available at: http://q2008.istat.it/sessions/paper/15Ferrillo.pdf (accessed January 2017).Search in Google Scholar

Fix, E. and J.L. Hodges. 1951. Discriminatory Analysis, Nonparametric Discrimination: Consistency Properties. Technical Report, USAF School of Aviation Medivine, Randolph Field, Texas. Project 21-49-004, Rept. 4, Contract AF41(128)-31, February 1951.Search in Google Scholar

Friedman, J.H. 2001. “Greedy Function Approximation: A Gradient Boosting Machine.” The Annals of Statistics 29: 1189-1232. Available at: http://www.jstor.org/stable/2699986 (accessed October 10, 2016).Search in Google Scholar

Ganzeboom, Harry B.G. and Donald J. Treiman. 2003. “Three Internationally Standardised Measures for Comparative Research on Occupational Status.” In Advances in Cross-National Comparison: A European Working Book for Demographic and Socio-Economic Variables, edited by J.H.P. Hoffmeyer-Zlotnik and C. Wolf, pp. 159-193. Doi: http://dx.doi.org/10.1007/978-1-4419-9186-7_9.Search in Google Scholar

Geis, A. 2011. Handbuch fu¨r die Berufsvercodung. Technical Report, GESIS, Mannheim, Germany. Available at: http://www.gesis.org/fileadmin/upload/dienstleistung/tools_standards/handbuch_der_berufscodierung_110304.pdf (accessed October 10, 2016). Search in Google Scholar

Geis, A.J. and J.H.P. Hoffmeyer-Zlotnik. 2000. “Stand der Berufsvercodung.” ZUMA Nachrichten 24: 103-128.Search in Google Scholar

Iezzi, D.F., M. Lori, F. Lorenzini, M. Nicosia, and S. Stoppiello. 2014. “An Application of Text Mining Technique for the Census of Nonprofit Institutions.” In Statistical Methods and Applications from a Historical Perspective, edited by F. Crescenzi and S. Mignani, pp. 143-152. Springer. Doi: http://dx.doi.org/10.1007/978-3-319-05552-7_13.Search in Google Scholar

International Labour Office. 1990. International Standard Classification of Occupations, ISCO-88. International Labour Office. Available at: http://www.ilo.org/public/libdoc/ilo/1990/90B09_411_engl.pdf (accessed October 10, 2016).Search in Google Scholar

Joachims, T. 1998. “Text Categorization with Support Vector Machines: Learning with Many Relevant Features.” In Proceedings of the 10th European Conference on Machine Learning, Volume 1398. April 21-23, 1998, Chemnitz, Germany, 137-142. Doi: http://dx.doi.org/10.1007/BFb0026683.Search in Google Scholar

Jones, R. and P. Elias. 2004. CASCOT: Computer-Assisted Structured Coding Tool. Technical Report, Institute for Employment Research. Coventry: University of Warwick. Available at: http://www2.warwick.ac.uk/fac/soc/ier/publications/software/cascot/ (accessed October 10, 2016).Search in Google Scholar

Jung, Y., J. Yoo, S.-H. Myaeng, and D.-C. Han. 2008. “A Web-Based Automated System for Industry and Occupation Coding.” In Web Information Systems Engineering - WISE 2008, edited by J. Bailey, D. Maier, K.-D. Schewe, B. Thalheim, and X. Wang. Volume 5175, 443-457. Springer. Doi: http://dx.doi.org/10.1007/978-3-540-85481-4_33.Search in Google Scholar

Kalpic, D. 1994. “Automated Coding of Census Data.” Journal of Official Statistics 10: 449-463.Search in Google Scholar

Knaus, R. 1987. “Methods and Problems in Coding Natural Language Survey Data.” Journal of Official Statistics 3: 45-67.Search in Google Scholar

Koch, A. and M. Wasmer. 2004. “Der ALLBUS als Instrument zur Untersuchung sozialen Wandels: Eine Zwischenbilanz nach 20 Jahren.” In Sozialer und Politischer Wandel in Deutschland, edited by R. Schmitt-Beck, M. Wasmer, and A. Koch, 13-41. VS Verlag fu¨r Sozialwissenschaften.Search in Google Scholar

Maitra, R. and I.P. Ramler. 2010. “A k-mean-directions Algorithm for Fast Clustering of Data on the Sphere.” Journal of Computational and Graphical Statistics 19: 377-396. Doi: http://dx.doi.org/10.1198/jcgs.2009.08155.Search in Google Scholar

Meyer, D., E. Dimitriadou, K. Hornik, A. Weingessel, and F. Leisch. 2014. e1071: Misc Functions of the Department of Statistics, TU Wien. Available at: http://CRAN.R-project.org/package¼e1071 (accessed October 10, 2016).Search in Google Scholar

O’Reagan, R.T. 1972. “Computer-Assigned Codes from Verbal Responses.” Communications of the ACM 15: 455-459. Doi: http://dx.doi.org/10.1145/361405.361419.Search in Google Scholar

Ossiander, E.M. and S. Milham. 2006. “A Computer System for Coding Occupation.” American Journal of Industrial Medicine 49: 854-857. Doi: http://dx.doi.org/10.1002/ajim.20355.Search in Google Scholar

Platt, J. 1999. “Probabilistic Outputs for Support Vector Machines and Comparisons to Regularized Likelihood Methods.” In Advances in Large Margin Classifiers, edited by A.J. Smola, P. Bartlett, B. Scho¨lkopf, and D. Schuurmans, 61-74. Cambridge, Massachusetts: MIT Press. Search in Google Scholar

R Core Team. 2014. “R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing.” Available at: http://www.R-project.org/ (accessed October 10, 2016).Search in Google Scholar

Russ, D.E., K.-Y. Ho, C.A. Johnson, and M.C. Friesen. 2014. “Computer-Based Coding of Occupation Codes for Epidemiological Analyses.” In Proceedings of the 27th IEEE International Symposium on Computer-Based Medical Systems. May 27-29, 2014, New York, USA, 347-350. Doi: http://dx.doi.org/10.1109/CBMS.2014.79.Search in Google Scholar

Schierholz, M. 2014. “Automating Survey Coding for Occupation.” Master’s thesis, Ludwig-Maximilians-Universita¨t Munich. Available at: https://epub.ub.uni-muenchen.de/21444/index.html (accessed October 10, 2016).Search in Google Scholar

Scholtus, S., R. van de Laar, and L. Willenborg. 2014. The Memobust Handbook on Methodology for Modern Business Statistics. Available at: https://ec.europa.eu/eurostat/cros/system/files/NTTS2013fullPaper_246.pdf (accessed January 2017).Search in Google Scholar

Scholz, E., and M. Wasmer. 2009. German General Social Survey 2006. English Translation of the German “ALLBUS”- Questionnaire. Technical Report, GESIS, Mannheim, Germany. Available at: http://nbn-resolving.de/urn:nbn:de:0168-ssoar-207035 (accessed October 10, 2016).Search in Google Scholar

Schonlau, M., and N. Guenther. 2016. Text Mining Using N-Grams. Social Science Research Network. Doi: http://dx.doi.org/10.2139/ssrn.2759033.Search in Google Scholar

Silla, C.N., and A.A. Freitas. 2011. “A Survey of Hierarchical Classification across Different Application Domains.” Data Mining and Knowledge Discovery 22: 31-72. Doi: http://dx.doi.org/10.1007/s10618-010-0175-9.Search in Google Scholar

Snowball. 2015. Available at: http://snowball.tartarus.org/algorithms/german/stemmer.html (accessed October 10, 2016).Search in Google Scholar

Statistisches Bundesamt. 2010. Demographische Standards. Technical Report, Wiesbaden, Germany. Available at: https://www.destatis.de/DE/Methoden/StatistikWissenschaft- Band17.html (accessed October 10, 2016).Search in Google Scholar

Thompson, M., M.E. Kornbau, and J. Vesely. 2012. “Creating an Automated Industry and Occupation Coding Process for the American Community Survey.” Available at: http://ftp.census.gov/adrm/fesac/2014-06-13_thompson_kornbau_vesely.pdf (accessed October 10, 2016).Search in Google Scholar

Tijdens, K. 2014. “Dropout Rates and Response Times of an Occupation Search Tree in a Web Survey.” Journal of Official Statistics 30: 23-43. Doi: http://dx.doi.org/10.2478/jos-2014-0002.Search in Google Scholar

Tijdens, K. 2015. “Self-Identification of Occupation in Web Surveys: Requirements for Search Trees and Look-Up Tables.” Survey Methods: Insights from the Field (SMIF). Doi: http://dx.doi.org/10.13094/SMIF-2015-00008.Search in Google Scholar

Tourigny, J.Y., and J. Moloney. 1995. “The 1991 Canadian Census of Population Experience with Automated Coding.” In United Nations Statistical Commission on Statistical Data Editing.Search in Google Scholar

Vapnik, V.N. 2000. The Nature of Statistical Learning Theory. 2nd edition. New York: Springer.Search in Google Scholar

Weiss, S.M., N. Indurkhya, T. Zhang, and F. Damerau. 2010. Text Mining: Predictive Methods for Analyzing Unstructured Information. New York: Springer. Search in Google Scholar

Wenzowski, M.J. 1988. “ACTR - A Generalised Automated Coding System.” Survey Methodology 14: 299-308.Search in Google Scholar

Yu, C. 2002. High-Dimensional Indexing: Transformational Approaches to High- Dimensional Range and Similarity Searches. Volume 2341. Berlin: Springer. Doi: http://dx.doi.org/10.1007/3-540-45770-4.Search in Google Scholar

Züll, C. 2014. Berufscodierung. Technical Report, GESIS - Leibniz Institut fu¨r Sozialwissenschaften (SDM Survey Guidelines). Mannheim. Doi: http://dx.doi.org/10.15465/sdm-sg_019. Search in Google Scholar

Recommended articles from Trend MD

Plan your remote conference with Sciendo