Vol. 19 No. 2 (2020): Revista UIS Ingenierías
Articles

Missing value imputation and outlier detection for functional data: an application for PM10 data

Rafael Alfonso Meléndez
Universidad de La Guajira
Stevenson Bolívar
Pontificia Universidad Javeriana
Roberto Rojano
Universidad de La Guajira

Published 2020-03-05

Keywords

  • functional data,
  • principal component analysis functional,
  • functional outliers,
  • particulate PM10

How to Cite

Meléndez, R. A., Bolívar, S., & Rojano, R. (2020). Missing value imputation and outlier detection for functional data: an application for PM10 data. Revista UIS Ingenierías, 19(2), 1–10. https://doi.org/10.18273/revuin.v19n2-2020001

Abstract

The data collected in the air pollution monitoring such as PM10 is obtained at automated stations that generally contained missing values due to machine failures, routine maintenance, or human errors. Incomplete data sets may cause information bias. Therefore, it is important to find the best way to estimate these missing values to ensure the quality of the analyzed data. In this paper PM10 particulate data considered in time as a functional object were evaluated, for this case the database of the environmental monitoring network of the Environmental Corporation of La Guajira (Corpoguajira) was used. In this study we have implemented the methodology by Jeng-Min Chiou (2014) to impute functional data. The detection of outliers of pollutants is very important for monitoring and control of air quality. Additionally, we have implemented the method of imputation of missing data and detection of outliers for functional data. We considered PM10 particle concentrations in the environmental monitoring stations over the area of influence of the open pit mining during 2012. To impute functional missing data, it was based on applying tools such as functional principal component analysis (ACPF) and graphic procedures to detect outlier curves such as the bagplot and functional highest density region (HDR) boxplot. The results indicate that Barranca station is an atypical curve and it was observed that the imputed intervals capture the dynamics that are shared with the other trajectories of the different stations.

Downloads

Download data is not yet available.

References

[1] J. Duyzer, D. Van Den Hout, P. Zandved & S. Van Ratingen, “Representativeness of air quality monitoring networks,” Atmospheric Environment, no. 104, pp. 88-101, 2015. doi: 10.1016/j.atmosenv.2014.12.067

[2] L. Zhao, Y. Xie, J. Wang, & J. Xu, “A performance assessment and adjustment program for air quality monitoring networks in Shanghai,” Atmospheric Environment, no. 122, pp. 382-392, 2015. doi: 10.1016/j.atmosenv.2015.09.069

[3] M. Dostal, A. Pastorkova, S. Rychlik, E. Rychlikova, V. Svecova,, Schallerova, E. & R. Sram, “Comparison of child morbidity in regions of Ostrava, Czech Republic, with different degrees of pollution: a retrospective cohort study,” Environmental Health, no. 12, pp. 1-11, 2013. doi: 10.1186/1476-069X-12-74

[4] G. Muránszky, M. Ovari, I. Virag, P. Csiba, R. Dobai & G. Zaray, “Chemical characterization of PM10 fractions of urban aerosol,” Microchemical Journal, no. 98, pp. 1-10, 2011. doi: 10.1016/j.microc.2010.10.002

[5] F. Lu, D. Xu, Y. Cheng, “Systematic review and meta-analysis of the adverse health effects of ambient PM2.5 and PM10 pollution in the Chinese population,” Environmental Research, 136, pp. 196–204, 2015. doi: 10.1016/j.envres.2014.06.029

[6] Plan Nacional para el desarrollo minero, Visión año 2019. Ministerio de Minas y Energías, Bogotá, 2012.

[7] N. Noor & M. Zainudin, “A review: Missing values in environmental data sets,” In Proceeding of International Conference on Environment, 2008 [En línea]. Disponible en: https://scholar.google.com.pk.

[8] N. Noor, M. Abdullah, A. Yahaya, & N. Ramli, “Comparison of Linear Interpolation Method and Mean Method to Replace the Missing Values in Environmental Data Set,” Materials Science Forum, vol. 803, pp. 278-281, 2015. doi: 10.4028/www.scientific.net/MSF.803.278

[9] H. Lee and K. Kang, “Interpolation of Missing Precipitation Data Using Kernel Estimations for Hydrologic Modeling,” Advances in Meteorology, 2015. doi: 10.4028/www.scientific.net/MSF.803.278

[10] M. Fitri, N. Ramli, A. Yahaya, N. Sansuddin, N. Ghazali, & W. Al Madhoun, “Monsoonal differences and probability distribution of PM10 concentration,” Environmental Monitoring Assessment, vol. 163, pp. 655-667, 2010. doi: 10.1007/s10661-009-0866-0

[11] N. Noor, A. Yahaya, N. Ramli, & M. Abdullah, “The replacement of missing values of continuous air pollution monitoring data using mean top bottom imputation technique,” Journal of Engineering Research & Education, vol. 3, pp. 96-105, 2006.

[12] N. Shaadan, S. Deni, & A. Jemain, “Assessing and comparing PM10 pollutant behaviour using functional data approach,” Sains Malaysiana, vol. 41, no. (11), pp. 1335-1344, 2012.

[13] H. Ahn, “Outlier detection in total phosphorus concentration data from South Florida rainfall,” J. Am. Water Resour. Assoc., vol. 35, no. 2, pp. 301–310, 1999. doi: 10.1111/j.1752-1688.1999.tb03591.x

[14] R. Gilbert, R. Statistical methods for environmental pollution monitoring. Van Nostrand Reinhold, New York, 1987.

[15] K. Reckhow, & S. Chapra, Engineering approaches for lake management. Volume 1: Data analysis and empirical modeling, Butterworth, Boston. 1983.

[16] C. C. Aggarwal, Outlier Analysis. New York, NY, USA: Springer, 2013.

[17] V. Chandola, A. Banerjee, & V. Kumar, “Anomaly detection: A survey,” ACM Comput. Surv., vol. 41, no. 3, pp. 15:1–15:58, 2009.

[18] V. J. Hodge and J. Austin, “A survey of outlier detection methodologies,” Artif. Intell. Rev., vol. 22, no. 2, pp. 85–126, 2004. doi: 10.1007/s10462-004-4304-y

[19] C. C. Aggarwal & P. S. Yu, “Outlier detection for high dimensional data,” SIGMOD Rec., vol. 30, pp. 37–46, May 2001.

[20] J. P. Burman & M. C. Otto, “Census bureau research project: Outliers in time series,” Bureau of the Census, SRD Res. Rep., CENSUS/SRD/RR-88114, May 1988.

[21] A. J. Fox, “Outliers in time series,” J. Roy. Statist. Soc. B Methodol., vol. 34, no. 3, pp. 350–363, 1972.

[22] H. Cho, Y. jin Kim, H. J. Jung, S. W. Lee, and J. W. Lee, “OutlierD: An R package for outlier detection using quantile regression on mass spectrometry data,” Bioinformatics, vol. 24, no. 6, pp. 882–884, 2008.

[23] C. D. Muniz, P. G. Nieto., J. A. Fernandez, J. M. Torres, L. Taboada, “Detection of outliers in water quality monitoring samples using functional data analysis in San Esteban estuary (Northern Spain),” Science of the Total Environment, 2012. doi: 10.1016/j.scitotenv.2012.08.083

[24] J. Martinez, A. Saavedra, P. J. Garcia-Nieto, J. L. Piñeiro., C. Iglesias, J. Taboada, J. Sancho, J. Pastor, “Air quality parameters outlier’s detection using functional data analysis in the Langreo urban area (Northern Spain),” Applied Mathematics and Computation, 2014. doi: 10.1016/j.amc.2014.05.004

[25] R. Hyndman and H. Shang, “Rainbow Plots, Bagplots, and Boxplots for Functional Data,” Journal of Computational and Graphical Statistics, vol. 19 no. 1, pp. 29–45, 2010.

[26] P. Allison, Missing Data. California: Thousand Oaks, Sage, 2001.

[27] J. Schafer & J. Graham, “Missing Data: Our View of the State of the Art,” Psychological Methods, vol. 7, no. 2, 147–177, 2002.

[28] J. Schafer, Analysis of Incomplete Multivariate Data. London: Chapman and Hall, 1997.

[29] E. Beale & R. Little, “Missing Values in Multivariate Analysis,” Journal of the Royal Statistical Society, Series B, vol.3, no.1, pp. 129–145, 1975.

[30] L. Qu, L. Li, Y. Zhang, & J. Hu, “PPCA-Based Missing Data Imputation for Traffic Flow Volume: A Systematical Approach,” IEEE Transactions on Intelligent Transportation System, vol. 10, no. 3, pp. 512–522, 2009. doi: 10.1109/TITS.2009.2026312

[31] C. Chen, J. Kwon, J. Rice, A. Skabardonis & P. Varaiya, “Detecting Errors and Imputing Missing Data for Single-Loop Surveillance System,” Transportation Research Record: Journal of the Transportation Research Board, vol. 18, no. 55, pp. 160–167, 2003. doi: 10.3141/1855-20

[32] R. Hyndman & M. Ullah, “Robust Forecasting of Mortality and Fertility Rates: A Functional Data Approach,” Computational Statistics and Data Analysis, vol. 51, no. 10, pp. 4942–4956, 2007. doi: 10.1016/j.csda.2006.07.028

[33] R. Hyndman & H. Shang, “Rainbow Plots, Bagplots, and Boxplots for Functional Data,” Journal of Computational and Graphical Statistics, vol. 19, no. 1, pp. 29–45, 2010. doi: 10.1198/jcgs.2009.08158

[34] J. Chiou, Y. Zhang, W. Chen & Ch. Chang, “A functional data approach to missing value imputation and outlier detection for traffic flow data,” Transportmetrica B: Transport Dynamics, vol. 2, no. 2, 106-129, 2014. doi: 10.1080/21680566.2014.892847

[35] D. Rubin, “Inference and Missing Data,” Biometrika, vol. 63, no. 3, pp. 581–592,1976. doi: 10.2307/2335739

[36] R. Little & D. Rubin, Statistical Analysis with Missing Data, 2nd ed, New York: Wiley, pp. 255-260. 2002.

[37] P. Hall & M. Hosseini-Nasab, “On properties of functional principal components analysis,” J. R. Statist. Soc. B., vol. 68, Part 1, pp. 109-126. 2006.

[38] F. Yao, H. G. Müller, & J. L. Wang, “Functional Data Analysis for Sparse Longitudinal Data,” Journal of American Statistical Association, vol. 100, no. 470, pp. 577–590. 2005. doi: 10.1198/016214504000001745

[39] K. Deregowski & M. Krzysko, “Principal components analysis for functional data,” Colloquium Biometricum, vol. 41, pp. 5-7. 2011.