CLASSIFICATION OF HIGH AND LOW LEVEL OF PM10 CONCENTRATIONS IN KLANG AND SHAH ALAM, MALAYSIA

Particulate matter (PM) comprises of a complex mixture of small solid or liquid particles of organic and inorganic elements that floats freely in air. PM10 is defined as a particulate matter with an aerodynamic diameter of 10 m or less. The main objective of this paper is to classify the level of PM10 in selected locations in Peninsular Malaysia using discriminant analysis. Two important components considered in this study, namely; the meteorological factors and pollutant factors. The meteorological factors comprise of wind speed, wind direction, humidity and temperature while pollutant factors consist of Carbon Monoxide (CO), Sulphur Dioxide (SO2), Nitrogen Dioxide (NO2) and Ozone (O3). The classification of high or low level of PM10 concentrations was based on the Malaysia Ambient Air Quality Guideline (MAAQG). The findings indicated that the classification equation differs from location to location due to different levels of PM10 concentrations, location of monitoring stations and factors affecting air pollution in that location. The simulation data also verified that the classification of PM10 concentration was almost similar to the real condition that occurred in Klang in October 2015.


Introduction
Air pollution can be defined as the presence of unwanted chemical or other elements in air that affects the quality of air and human health (World Health Organization, 2018). In 2015, over 90% of the world's population lived in air-polluted areas (HEI International Scientific Oversight Committee, 2017). One of the most vital causes of the deterioration in air quality is particulate matter (PM) and it instigates some adverse health effects (Capasso et al., 2015). Exposure to air pollutants for both short and long-term period has been associated with health effects (World Health Organization, 2018).
Five major risk factors for total deaths in the world are high blood pressure, smoking, high fasting plasma glucose, high total cholesterol and ambient particulate matter (HEI International Scientific Oversight Committee, 2017). The particulate matter consists of tiny solid or liquid particles that float freely in the air. PM10 refers to the particles which have sized up to 10 microns (μm). The smaller the particles' size such as PM1, the more severe it will affect human health if the particles are inhaled excessively into lungs (Beh et al., 2013). The dominant pollutant in Malaysia is PM10 (Department of Environment Malaysia, 2018). A study by Elhadi et al. (2018) stated that vehicles' exhaust and non-exhaust, industrial emission, resuspension dust and oil combustion were the most dominant sources of PM10. PM10 may cause adverse effect on the environment, increase the risk of health problems to individuals with asthma or cardiopulmonary diseases, the elderly and children as well as reduced in visibility (Abd Rahman, 2013 andWeinmayr et al., 2010).
There are quite a number of statistical analyses which involve PM10 in Malaysia. Some of the statistical analyses that were of interest of past researchers are the regression, used in the studies by Abdullah et al. (2017), Juneng et al. (2011), Mert Cubukcu & Sinem Ozcan (2015 and Ul-saufie et al. (2012), correlation analysis in Biancofiore et al. (2017), How & Ling (2016) and Wie & Moon (2017), path analysis in Sahanavin et al. (2018) and Markov Chain Model in a study by Mohamad et al. (2017).
Other studies that applied multivariate analysis were Hama et al. (2018) andDominick et al. (2012) which utilized principal component analysis (PCA); and Isiyaka &Azid (2015) andShah Ismail et al. (2017) which used discriminant analysis but focusing only on meteorological factors. Meanwhile, some researchers applied time series analysis as in Latif et al. (2014), Wan Mahiyuddin et al. (2013, Sharma et al. (2018) and Gupta et al. (2018). Some other researchers used the classical probability distribution (Md Yusof, 2009;Md Yusof et al., 2011;Mohamed Noor et al., 2011) and extreme value distributions (Ahmat et al., 2014;Ahmat et al., 2015Ahmat et al., , 2016. Though discriminant analysis has been applied in some air pollution studies, the focus of the studies was only on gaseous pollutants (Isiyaka & Azid, 2015;Shah Ismail et al., 2017). The study of PM10 which incorporates both gaseous pollutants and meteorological factors, however, is still lacking. In addition, none of the studies classifies the PM10 concentrations into high and low category based on the national guideline. The majority of the studies conducted focused only on the prediction or the forecast of the PM10 concentrations but not on the classification of the PM10 concentrations. In view of this situation, this research was carried out to classify low or high level of PM10 concentrations based on an interim guideline by the Department of Environment, Malaysia (DOE) which incorporated both gaseous pollutants and meteorological factors.

Scope of study
This study utilized the hourly data of meteorological parameters and pollutants in urban areas (Klang and Shah Alam) for a period of 17 years i.e. from 2000 to 2016. The data was furnished by the Department of Environment (DOE), Malaysia. The selection of these two locations i.e. Klang and Shah Alam were made due to the factor that these two locations constantly experienced high level of PM10 concentrations. This research examined the effects of meteorological parameters (temperature, humidity, wind speed and wind direction) and gaseous pollutants (SO2, NO2, O3 and CO) on PM10 concentrations. The level of PM10 was classified as high or low based on the Malaysia new interim guideline by the Department of Environment, Malaysia (DOE) of 150g/m 3 . For the purpose of discriminant analysis, these data were divided into two parts with 80% of the data were used for training (to find the discriminant functions) and another 20% were used for validation.

Missing Value Treatment
Missing data is not a rare problem in air quality datasets as it is usually due to some unavoidable problems such as failures of machines, changes on the setting of air station monitors or human error in handling the datasets. There are three types of missing data which are missing completely at random (MCAR), missing at random (MAR) and missing not at random (MNAR) (Gelman & Hill, 2006). The multiple imputation technique was used in this study to overcome the problem of missing data. Multiple imputation can lead to consistent, efficient and normal estimates when the data is MAR (Soley-Bori et al., 2013).

Discriminant Analysis
Discriminant analysis is a statistical technique that can be used to classify or separate individuals into different groups (dependent variable) based on a set of quantitative independent random variables. The main objective of discriminant analysis is to predict group membership based on a set of quantitative variables. Discriminant function analysis is used to determine which continuous variables discriminate between two or more naturally occurring groups and could be used to determine which variables are the best predictors. The two-step processes involved were (Poulsen & French, 2008) ; i. testing significance of a set of discriminant functions, and ii. the classification.
In this study, the data was carefully checked and cleaned so that it did not violate all the assumptions needed for the discriminant analysis to be carried out. The statistical method used for the selection of the significant factors to be included in the discriminant equation was the stepwise model. The statistical method used for the selection of the significant factors to be included in the discriminant equation was the stepwise model. The equation for cases with an equal sample size for each group the classification function coefficient ( ) j D is equal to the sum as shown in Eq. (1): ...
For the j th group, j is 1...k, x is a raw score of each predictor and j c 0 is a constant. If M is a column matrix of means for group j, then the constant ( )

Performance Indicators
The performance of the classification function is assessed via its error rates (probabilities of the misclassification). The error rate and the percentage of the observations misclassified by the discriminant functions are used to measure the performance of any discriminant function (Helwig, 2017). The Apparent Error Rate (APER) was used to identify the goodness of fit of the function in this study and calculated using the fraction of observations in the training sample that are misclassified by the sample classification functions as shown in Table 1.  ( 2) which is recognized as the proportion of items in the training set that are misclassified (Johnson & Wichern, 2014). Table 2 provides the sample calculation of the APER. i. Sample is split into training and validation.
ii. Training sample is used to build the discriminant function.
iii. Validation sample is used to evaluate the performance of the discriminant functions. iv. Cross validation error rate is the percentage of observations in the validation data, which are misclassified by the classification functions. v. Cross validation rate can overcome bias problem, but it requires large sample.

Software
IBM SPSS statistics version 25.0 was used in this research for the discriminant analysis. SPSS was used to understand and interpret the results of research. Table 3 and Table 4 provide the significance test result for pollutant and meteorological parameters in Klang and Shah Alam respectively. The null hypothesis would be the parameters are not significant vs the alternative hypothesis that the parameters are significant. The significant p-value = 0.000 less than 0.05, hence, the parameters are deemed significant.  As summarized in Table 5, both the pollutant factors (CO and SO2) affected PM10 concentrations since both locations are located nearby and affected by similar pollutants. However, different meteorological factors affected PM10 concentrations in these two locations. It was found that only humidity affected PM10 concentrations in Shah Alam compared to three significant meteorological factors in Klang (windspeed, humidity and temperature).

Classification of High and Low concentrations of PM10
The concentrations of PM10 was classified into high or low using discriminant analysis based on the Malaysia Ambient Air Quality Guideline (MAAQG). The daily maximum PM10 concentration with value more than 150 μg/m 3 will be classified as high while the daily maximum PM10 concentration with value less than 150 μg/m 3 will be initially classified as low. Table 6 tabulates the discriminant equations for Klang and Shah Alam. The SO2 was identified as the most significant factor affecting the level of PM10 concentrations in Klang and Shah Alam. After the discriminant equations have been identified, the classification of PM10 concentrations of either high or low can be done via classification scores. The concentrations will be classified into the group for which it has the highest classification score.

Performance Indicator
As shown in Table 7, the discriminant functions were considered good since all the misclassification rate were less than 5%. In general, the acceptable misclassification rate is about 30%.

Simulation
For illustration, Table 8 shows the calculation of discriminant score using discriminant equations obtained in Section 3.2 and the classification using discriminant category. The illustration data used was from 1 October -19 October 2015 for Klang. Several incidences of high PM10 concentrations were recorded during this period. These phenomena were due to four tropical cyclones namely "Dujuan", "Mujigae", "Koppu" and "Champi" that caused southwesterly wind and brought about substantial smoke from the burning areas in Sumatra and Kalimantan resulting in a prolonged haze in September and October 2015 (Department of Environment Malaysia, 2016). The results in Table 8 show an excellent agreement with the findings in Malaysia Environmental Quality Report 2015 (MEQR).

Conclusion
The research had identified that the main pollutants affected the level of PM10 concentrations in Klang and Shah Alam were Carbon Monoxide (CO) and Sulphur Dioxide (SO2). Klang and Shah Alam are located nearby main roads, industrial and residential areas and thus experienced high density of vehicles which contributed to high concentrations of these two pollutants (Azid et al., 2015). The misclassification rate shows that the discriminant functions obtained were good since both the misclassification rate were less than 5%. The simulation results show an excellent agreement with the real condition that occurred in Klang in October 2015 as reported in Malaysia Environmental Quality Report 2015 (MEQR). Therefore, the discriminant functions can be used to classify high and low level of PM10 concentrations in Klang and Shah Alam.