1. bookVolume 37 (2021): Issue 2 (June 2021)
    Special Issue on New Techniques and Technologies for Statistics
Journal Details
License
Format
Journal
First Published
01 Oct 2013
Publication timeframe
4 times per year
Languages
English
access type Open Access

A Hybrid Technique for the Multiple Imputation of Survey Data

Published Online: 22 Jun 2021
Page range: 505 - 531
Received: 01 Mar 2019
Accepted: 01 Dec 2020
Journal Details
License
Format
Journal
First Published
01 Oct 2013
Publication timeframe
4 times per year
Languages
English
Abstract

Most of the background variables in MICS (Multiple Indicator Cluster Surveys) are categorical with many categories. Like many other survey data, the MICS 2014 women’s data suffers from a large number of missing values. Additionally, complex dependencies may be existent among a large number of categorical variables in such surveys. The most commonly used parametric multiple imputation (MI) approaches based on log linear models or chained Equations (MICE) become problematic in these situations and often the implemented algorithms fail. On the other hand, nonparametric MI techniques based on Bayesian latent class models worked very well if only categorical variables are considered. This article describes how chained equations MI for continuous variables can be made dependent on categorical variables which have been imputed beforehand by using latent class models. Root mean square errors (RMSEs) and coverage rates of 95% confidence intervals (CI) for generalized linear models (GLM’s) with binary response are estimated in a simulation study and a comparison is made among proposed and various existing MI methods. The proposed method outperforms the MICE algorithms in most of the cases with less computational time. The results obtained by the simulation study are supported by a real data example.

Keywords

Arnold, B.C., and S.J. Press. 1989. “Compatible Conditional Distributions”. Journal of the American Statistical Association 84:152–156. DOI: https://doi.org/10.2307/2289858.Search in Google Scholar

Allison P.D. 2002. Missing Data. Thousand Oaks. CA: Sage Publications. DOI: https://dx.doi.org/10.4135/9781412985079.Search in Google Scholar

Abdella, M., and T. Marwala, 2005. “The use of genetic algorithms and neural networks to approximate missing data in database”. In Proceedings of the IEEE 3rd International Conference on Computational Cybernetics, 2005. 24: 207–212. DOI: DOI: https://doi.org/10.1109/ICCCYB.2005.1511574.Search in Google Scholar

Ankaiah, N., and V.Ravi. 2011. “A novel soft computing hybrid for data imputation”. In Proceedings of the 7th International Conference on Data Mining (DMIN). Las Vegas. USA. Available at: https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.217.7984&rep=rep1&type=pdf.Search in Google Scholar

Akande, O., F. Li, and J. Reiter. 2017. “An empirical comparison of multiple imputation methods for categorical data”. The American Statistician 71: 162–170. DOI: https://doi.org/10.1080/00031305.2016.1277158.Search in Google Scholar

Andridge, R.R., and R.J.A. Little. 2017. “A Review of Hot Deck Imputation for Survey Non-response”. International statistical review 78(1): 40–64. DOI: https://doi.org/10.1111/j.1751-5823.2010.00103.x.Search in Google Scholar

Armina, R., A.M. Zain, N.A. Ali, and R. Sallehuddin, 2017. “A review on missing value estimation using imputation algorithm”. Journal of Physics: Conference Series 892(1). DOI: https://doi.org/10.1088/1742-6596/892/1/012004.Search in Google Scholar

Bengio, Y., and F. Gingras. 1995. “Recurrent neural networks for missing or asynchronous data. In Touretzky, D.S., Mozer, M.C. and Hasselmo, M.E. editors”. Advances in Neural Information Processing Systems 8: 95–401. MIT Press, Cambridge, MA. Available at: https://proceedings.neurips.cc/paper/1995/file/ffeed84c7cb1ae7bf4ec4bd78275bb98-Paper.pdf.Search in Google Scholar

Barnard, J., and X. Meng. 1999. “Applications of multiple imputation in medical studies: From AIDS to NHANES”. Statistical Methods in Medical Research 8:17–36. DOI: https://doi.org/10.1177/096228029900800103.Search in Google Scholar

Breiman, L. 2001. “Random Forests”. Machine Learning 45(1): 5–32. DOI: https://doi.org/10.1023/A:1010933404324.Search in Google Scholar

Batista, G., and M.C. Monard. 2003. Experimental comparison of K-nearest neighbour and mean or mode imputation methods with the internal strategies used by C4.5 and CN2 to treat missing data. University of Sao Paulo. Available at: https://www.semanticscholar.org/paper/Experimental-comparison-pf-K-NEAREST-NEIGHBOUR-and-BatistaMonard/35346d559d1bcfdf27acff66267e8f1d67190f23.Search in Google Scholar

Burton, A., D. G. Altman, P. Royston, and R.L. Holder. 2006. “The design of simulation studies in medical statistics”. Statistics in Medicine 25: 4279–4292. DOI: https://doi.org/10.1002/sim.2673.Search in Google Scholar

Chung, D., and F.L. Merat. 1996. Neural network based sensor array signal processing. In: Proc Int Conf Multisens Fusion Integr Intell Syst. Washington. USA: 757–764. DOI: https://doi.org/10.1109/MFI.1996.572313.Search in Google Scholar

Chandra, A., G.M. Martinez, W.D. Mosher, J.C. Abma, and J. Jones. 2005. “Fertility, family planning, and reproductive health of U.S. women: data from the 2002 National Survey of Family Growth”. Vital Health Stat 23: 1–160. Available at: https://pubmed.ncbi.nlm.nih.gov/16532609/Search in Google Scholar

Corsi, D.J., J.M. Perkins, and S.V. Subramanian. 2017. “Child anthropometry data quality from Demographic and Health Surveys, Multiple Indicator Cluster Surveys, and National Nutrition Surveys in the West Central Africa region: are we comparing apples and oranges?”. Global Health Action. DOI: https://doi.org/10.1080/16549716.2017.1328185.Search in Google Scholar

Dunson, D.B., and C. Xing. 2009. “Nonparametric Bayes modeling of multivariate categorical data”. Journal of the American Statistical Association 104: 1042–1051. DOI: https://doi.org/10.1198/jasa.2009.tm08439.Search in Google Scholar

Gelman, A., and T.P. Speed. 1993. “Characterizing a joint probability distribution by conditionals”. Journal of the Royal Statistical Society Series B: Statistical Methodology 55: 85–188. DOI: https://doi.org/10.1111/j.2517-6161.1993.tb01477.x.Search in Google Scholar

Graham, J.W., and J.L. Schafer. 1999. “On the performance of multiple imputation for multivariate data with small sample size. In R. Hoyle (Ed.)”. Statistical strategies for small sample research: 1–29.Search in Google Scholar

Gulliford, M.C., O.C. Ukoumunne, and, S. Chinn. 1999. “Components of Variance and Intra class Correlations for the Design of Community-based Surveys and Intervention Studies: Data from the Health Survey for England”. American Journal of Epidemiology 149(9): 876–883. DOI: https://doi.org/10.1.1.565.7897.Search in Google Scholar

Harel, O., and X.H. Zhou. 2007. “Multiple imputation: Review of theory, implementation and Software”. Statistics in Medicine 26: 3057–3077. DOI: https://doi.org/10.1002/-sim.2787.Search in Google Scholar

Horton, N.J., and K.P. Kleinman. 2007. “Much ado about nothing: a comparison of missing data methods and software to fit incomplete regression models”. The American Statistician 61: 79–90. DOI: https://doi.org/10.1198/000313007X172556.Search in Google Scholar

Honaker, J., G. King, and M. Blackwell. 2011. “Amelia II: A program for missing data”. Journal of Statistical Software 45(7): 1–47. DOI: https://doi.org/10.18637/jss.v045.i07.Search in Google Scholar

Hardt, J., M. Herke, and R. Leonhart. 2012. “Auxiliary variables in multiple imputation in regression with missing X: a warning against including too many in small sample research”. BMC Medical Research Methodology 12(1). DOI: https://doi.org/10.1186/1471-2288-12-184.Search in Google Scholar

Kohonen, T. 1995. Self-Organizing Maps. Springer. Heidelberg. Available at: https://www.springer.com/gp/book/9783642976100.Search in Google Scholar

Lazarsfeld, P.F. 1950. The logical and mathematical foundation of latent structure analysis. In S. A. Stouffer, L. Guttman, E. A. Suchman, P. F. Lazarsfeld, S. A. Star, & J. A. Clausen, Studies in social psychology in World War II: Vol. 4. Measurement and prediction.Chap. 10: 362–412. Princeton, NJ: Princeton University Press. Available at: https://psycnet.apa.org/record/1951-03037-000.Search in Google Scholar

Li, F., Y. Yu, and D.B. Rubin. 2012. Imputing missing data by fully conditional models: some cautionary examples and guidelines. Duke University Department of Statistical Science Discussion Paper: 11–24. Available at: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.228.7010.Search in Google Scholar

Little, R.J.A. 1988. “A Test of Missing Completely at Random for Multivariate Data with Missing Values”. Journal of the American Statistical Association 83(404): 1198–1202. DOI: https://doi.org/10.1080/01621459.1988.10478722.Search in Google Scholar

Little, R.J. 2018. “On Algorithmic and Modeling Approaches to Imputation in Large Data Sets”. Statistica Sinica. http://www3.stat.sinica.edu.tw/statistica/J30N4/J30N401/J30N401.htmlSearch in Google Scholar

Little, R.J.A., and D.B. Rubin. 2002. Statistical analysis with missing data (2nd edition.). New York: Wiley. Available at: https://www.wiley.com/en-us/Statistical+Analysis+with+Missing+Data%2C+2nd+Edition-p-9781119013563.Search in Google Scholar

McLachlan, G.J., and D. Peel. 2000. Finite mixture models. New York: Wiley. DOI: http://dx.doi.org/10.1002/0471721182.Search in Google Scholar

Marseguerra, M., and A. Zoia. 2005. “The autoassociative neural network in signal analysis. II. Application to on-line monitoring of a simulated BWR component”. Annals of Nuclear Energy 32(11): 1207–1223. DOI: https://doi.org/10.1016/j.anucene.2005.03.005.Search in Google Scholar

Marwala, T., and S. Chakraverty. 2006. “Fault classification in structures with incomplete measured data using auto associative neural networks and genetic algorithm”. Current Science India 90(4): 542-548. JSTOR. Available at: www.jstor.org/stable/24088946.Search in Google Scholar

Morris, T.P., R.W. Ian, and R. Patrick. 2014. “Tuning Multiple Imputation by Predictive Mean Matching and Local Residual Draws. BMC Medical Research Methodology 14 (1): 75. DOI: https://doi.org/10.1186/1471-2288-14-75.Search in Google Scholar

Murray, J.S., and J.P. Reiter. 2016. “Multiple imputation of missing categorical and continuous values via Bayesian mixture models with local dependence”. Journal of the American Statistical Association 111: 1466–1479. DOI: https://doi.org/10.1080/01621459.2016.1174132.Search in Google Scholar

Narayanan, S., J.L.Vian, J. Choi, M. El-Sharkawi, and B.B.Thompson. 2002. Set constraint discovery: missing sensor data restoration using auto-associative regression machines. In Proceedings of the international Joint Conference on Neural Networks (IJCNN): 2872–2877. DOI: https://doi.org/10.1109/IJCNN.2002.1007604.Search in Google Scholar

Oja, E., and S. Kaski. 1999. Kohonen Maps. Elsevier. Amsterdam. Available at: https://www.elsevier.com/books/kohonen-maps/oja/978-0-444-50270-4.Search in Google Scholar

Oba, S., M. Sato, I. Takemasa, M. Monden, K. Matsubara, and S. Ishii. 2003. “A Bayesian missing value estimation method for gene expression profile data”. Bioinformatics 19: 2088–2096. DOI: https://doi.org/10.1093/bioinformatics/btg287.Search in Google Scholar

Pyle, D. 1999. Data preparation for data mining. Morgan Kaufmann Publishers Inc. San Francisco. Available at: https://dl.acm.org/doi/book/10.5555/299577.Search in Google Scholar

Pérez, A., R.J. Dennis, J.F. Gil, M.A. Rondón, and A. López. 2002. “Use of the mean, hot deck and multiple imputation techniques to predict outcome in intensive care unit patients in Colombia”. Statistics in Medicine 21: 3885–3896. DOI: https://doi.org/10.1002/sim.1391.Search in Google Scholar

Quanli, W., M.V. Danial, J.P. Reiter, and H. Jigchen. 2018. NPBayesImputeCat: Non-Parametric Bayesian Multiple Imputation for Categorical Data. R package version 0.1, Available at: https://CRAN.R-project.org/package=NPBayesImputeCat.Search in Google Scholar

Rubin, D.B. 1976. “Inference and Missing Data”. Biometrika 63: 581–590. DOI: https://doi.org/10.2307/2335739.Search in Google Scholar

Rubin, D.B. 1987. Multiple Imputation for Nonresponse in Surveys. Wiley, New York. Available at: https://www.wiley.com/en-us/Multiple+Imputation+for+Nonresponse+in+Surveys-p-9780471655749.Search in Google Scholar

Roth, P.L. 1994. “Missing data: A conceptual review for applied psychologysts”. Personnel Psychology 47: 537–560. DOI: https://doi.org/10.1111/j.1744-6570.1994.tb01736.x.Search in Google Scholar

Rubin, D.B. 1996. “Multiple imputation after 18 + years”. Journal of the American Statistical Association 91: 473–489. DOI: https://doi.org/10.1080/01621459.1996.10476908.Search in Google Scholar

Raghunathan, T.W., J.M. Lepkowksi, J. van Hoewyk, and P.A. Solenbeger. 2001. “Multivariate technique for multiply imputing missing values using a sequence of regression models”. Survey Methodology 27: 85–95. Available at: https://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.405.4540.Search in Google Scholar

Reiter, J.P., T.E. Raghunathan, and S. Kinney. 2006. “The importance of modeling the survey design in multiple imputation for missing data”. Survey Methodology 32: 143–149. Available at: http://www2.stat.duke.edu/~jerry/Papers/SM06.pdf.Search in Google Scholar

Royston, P., and I.R. White. 2011. “Multiple imputation by chained equations (mice): Implementation in Stata”. Journal of Statistical Software 45(4): 1–20. DOI: https://doi.org/10.18637/jss.v045.i04.Search in Google Scholar

R Core Team. 2018. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. Available at: https://www.R-project.org.Search in Google Scholar

Sharpe, P.K., and R.J. Solly. 1995. “Dealing with missing values in neural network-based diagnostic systems”. Neural Computing and Applications 3(2): 73–77. DOI: https://doi.org/10.1007/BF01421959.Search in Google Scholar

Schafer, J.L. 1997. Analysis of incomplete multivariate data. London: Chapman and Hall. DOI: https://doi.org/10.1201/9780367803025Search in Google Scholar

Schafer, J.L. and J.W. Graham. 2002. “Missing data: Our view of the state of the art”. Psychological methods 7: 147–177. DOI: https://doi.org/10.1037/1082-989X.7.2.147.Search in Google Scholar

Schlomer, G.L., S. Bauman, and N.A. Card. 2010. “Best Practices for Missing Data Management in Counseling Psychology”. Journal of Counseling Psychology 57(1): 1–10. DOI: https://doi.org/10.1037/a0018082.Search in Google Scholar

Si, Y., and J.P. Reiter. 2013. “Nonparametric Bayesian multiple imputation for incomplete categorical variables in large-scale assessment surveys”. Journal of Educational and Behavioral Statistics 38: 499–521. DOI: https://doi.org/10.3102/1076998613480394.Search in Google Scholar

Templ, M., A. Andreas, K. Alexander, and P. Bernd. 2012. VIM: Visualization and Imputation of Missing Values. Available at: http://cran.r-project.org/web/packages/VIM/VIM.pdf.Search in Google Scholar

Van Buuren, S. 2007. “Multiple imputation of discrete and continuous data by fully conditional specification”. Statistical Methods in Medical Research 16: 219–242. DOI: https://doi.org/10.1177/0962280206074463.Search in Google Scholar

Van Buuren, S. 2012. Flexible Imputation of Missing Data, London: Chapman and Hall/CRC. DOI: https://doi.org/10.1201/b11826.Search in Google Scholar

Van Buuren, S., and K. Groothuis-Oudshoorn. 1999. Flexible multivariate imputation by MICE. TNO Prevention and Health. Leiden. Available at: https://stefvanbuuren.name/publications/Flexible%20multivariate%20-%20TNO99054%201999.pdf.Search in Google Scholar

Van Buuren, S., and K. Groothuis-Oudshoorn. 2011. “mice: Multivariate imputation by chained equations”. R. Journal of Statistical Software 45(3): 1–67. DOI: https://doi.org/10.18637/jss.v045.i03.Search in Google Scholar

Van Ginkel, J.R. 2007. Multiple imputation for incomplete test, questionnaire and survey data. Ph.D. dissertation. Tilburg University. Department of Methodology and Statistics. Available at: https://pure.uvt.nl/ws/portalfiles/portal/839209/224433.pdf.Search in Google Scholar

Vermunt, J.K., J.R. van Ginkel, L.A. van der Ark, and K. Sijtsma. 2008. “Multiple imputation of incomplete categorical data using latent class analysis”. Sociological Methodology 38: 369–397. DOI: https://doi.org/10.1111/j.1467-9531.2008.00202.x.Search in Google Scholar

WHO (World Health Organization). 2003. Community-based Strategies for Breastfeeding Promotion and Support in Developing Countries, 2003. Dept. of child and adolescent health and development. Geneva. Available at: https://www.who.int/maternal_child_adolescent/documents/9241591218/en/.Search in Google Scholar

Wilkinson, L., and Task Force on Statistical Inference. 1999. “Statistical methods in psychology journals: Guidelines and explanations”. American Psychologist 54: 594–604. DOI: https://doi.org/10.1037/0003-066X.54.8.594.Search in Google Scholar

Zhu, J., and T.E. Raghunathan. 2016. “Convergence Properties of a Sequential Regression Multiple Imputation Algorithm”. Journal of the American Statistical Association 110(511): 1112–1124. DOI: https://doi.org/10.1080/01621459.2014.948117.Search in Google Scholar

Recommended articles from Trend MD

Plan your remote conference with Sciendo