TIME SERIES MODEL FOR CARBON MONOXIDE (CO) AT SEVERAL INDUSTRIAL SITES IN PENINSULAR MALAYSIA

Malaysia is reported to experience explosive rise in the demand of transport vehicles in recent years due to rapid economic development and population growth. As a result, air pollution is expected to increase in conjunction with the increase in the number of the vehicles. In particular, Carbon Monoxide (CO) has been identified as the main component of the emission sources from vehicles other than Nitrogen Oxide (NOx), hydrocarbon lead and particulate matter of size less than 10 micron (PM 10 ) . This provides the reason why CO concentration is often used to reflect traffic density in an area. CO has both short-term and long-term effect on human’s health. Thus, knowledge on CO behaviour and the future levels at an area is important to help decision makers in managing air pollution due to vehicles emission in the country. This study was conducted to describe CO data and to determine a suitable time series model to enable the prediction of CO levels at two industrial sites; Perai and Pasir Gudang, Malaysia. The model obtained could help management to mitigate CO pollution at the sites. The analysis was conducted using daily maximum data which was obtained from the Department of Environment Malaysia from 2010 to 2014. The performance of the best model was determined using several performance measures such as MAE, RMSE and MAPE. The study has found that the most appropriate time series model for Perai is


Introduction
In Malaysia, power stations, motor vehicles and open burning has been identified as the main contributors to the country's air pollution emissions (Afroz et al., 2003). In relation to this phenomenon, it is expected that there will be an increase in the amount of primary pollutant emitted by vehicles and industries such as carbon monoxide in the future years. In the context of air pollution from vehicles emission, Carbon Monoxide (CO) has been identified as the largest contributor followed by NO2, SO2 and PM10 (Azedah et al. 2015). Other than the mentioned pollutant, non-methane volatile organic compounds and ammonia are also omitted by industries (Fore and Mbohwa, 2010). Thus, this problem should have become a driven and motivating factor for the responsible body and the decision makers for the planning and designing of policy to mitigate and sustain the quality of atmospheric environment.
It has been reported that the number of transport vehicles registered in all types of categories has increased year by year in Malaysia (Noresah et al., 2012). This increment is in conjunction with the rapid economy growth, increase in the population and urban development. The private own vehicles registered in Malaysia had also show an increasing trend. These increasing trends can be traced back. In 1980 only 80,400 units of new vehicles registered. It had increased to 106,400 units (1990), 224,900 units (1995), 282,103 units (2000) and 416,600 units (2005) respectively (MAA, 2016). Malaysia Automotive Association (MAA), had also forecasted a further positive growth in number of new vehicle registration from 2017 to 2020 as shown in Table 1. Exposure to air pollution is the main environmental threat to human's health in many busy and high population areas including industrial areas, cities and urban towns (Pasca et al., 2013). Long time exposure to high pollutant level in the air also contributes to wide range of chronic respiratory diseases, aggravates heart diseases and other types of particulates pollution (Miller et al., 2016). In particular, due to increase of human activities, emissions of particulate matters and gaseous matters including CO have been rising. Expansion of industries and transport systems has made this situation more critical. Other than human's health, such industrial activities also have great impact on the ecology and agriculture as well as safety effects (Xu et al., 2014).
CO is an odorless gas. It is a product of incomplete combustion and occurs when carbon in the fuel is partially oxidized rather than fully oxidized to Carbon Dioxide (CO2) (Shuhaili et al., 2016). CO also being released when engines burn fossil fuel which is a gas that comes out from the burning fossil fuels, mostly in motor vehicles. Emissions of CO are higher when engines are not tuned properly, and when fuel is not completely burned. Cars emitted lot of the CO found outdoors. CO has been identified as the main components of the emission sources from vehicles other than NOx, hydrocarbon lead and PM10. Vehicles sources have contributed to at least 70 % to 75% of the total air pollution in the Malaysia environment (Afroz et al., 2003;Yahya et al., 2006). CO is a significantly toxic gas that can lead to significant toxicity of the central nervous system and heart, thus affect human's health, in fact can cause fatality (Chen et al., 2011).
Having knowing the health effect of CO emitted by vehicles to human and the growing trends of vehicles in Malaysia, there is a need to investigate the temporal pattern of the pollutant as well as to predict its future concentration levels. So that the study results will be able to assist government agencies and authorities to provide inputs in solving air pollution problems and issues due to vehicles usage. This requires an appropriate time series prediction model in providing the input and the insights. Even though, for Malaysia data set, several modelling techniques have been used to predict CO including the application of Artificial Neural Network (ANN) and Imperial Competitive Algorithm (Mahmoudzadeh et al., 2013;Rathi et. al. 2018), a hybrid approach of Radial Basis Function network (Mamat et al., 2003), Multi-Layer Perceptron Techniques (MLP) and the hybrid (Raji et al., 2005), multiple regression (Afsaneh et al., 2014;Sembiring et. al., 2010), only a few had applied ARIMA time series modelling. The studies focused on CO level at the chosen locations such as east coast region (Kelantan, Terengganu and Pahang) and Bachang, Melaka (Ibrahim et al., 2009;Hamid et al., 2017).
Unlike the previous studies conducted for Malaysia data, this research is conducted with the purpose to investigate and predict the reading of daily maximum CO focuses at industrial sites; Perai in Penang and Pasir Gudang in Johor states of Malaysia. The SARIMA model application on CO data which was never being applied for these two popular industrial sites data set in Malaysia, can be considered as the novelty of this paper. The methodology is further discussed in the following sections.

Data and the Study Locations
Secondary data of daily maximum CO levels in (ppm) within the period from year 2010 until 2014 were obtained from Air Quality Division of Department of Environment, Malaysia. The analysis on CO data was conducted to investigate CO variation and to obtain the appropriate time series model to predict CO level at two selected industrial sites, Pasir Gudang in Johor and Perai in Penang, Malaysia. The reasons for choosing these sites is because the sites are among the busiest industrial location in Peninsular Malaysia, equipped with air quality monitoring stations and have high number of registered vehicles recorded. The instrument that used to collect CO is called Gas Filter Correlation CO Analyzer (Model T300). The software used in the analysis are R and MINITAB.

Research Framework
Figure 1 describes the process flow on how this study is conducted to achieve the objectives. First and foremost, after the data have been collected, a quality checking process was conducted. Missing values were replaced using median imputation method. Exploratory analysis was then conducted to describe the variation and pattern of CO as well as to detect data stationary and seasonal influence. Based on the exploratory analysis the process of building the time series model was conducted followed by performance evaluation and model validation.

Time Series Models
Within the framework of the Box-Jenkins methodology, it consists of four basic models including the Autoregressive (AR), Moving Average (MA), Mixed Autoregressive and Moving Average (ARMA) and Mixed Autoregressive Integrated Moving Average (ARIMA) (Lazim, 2007;Mah et. al. 2018).
In the AR model, the current value of the variable is defined as a function of its previous values plus an error term. In other words, the dependent variable, yt is taken as the function of the time-lagged values of itself. While in the MA model, the moving average is the average of the actual observation, yt whereas in this case it is a function of the error terms. The MA model links the current values of the time series to random errors that have occurred in the previous periods rather that the values of the actual series themselves. Alternatively, there are series that fit better by combining both AR and MA models under the assumption of data stationary. Such model is known as the Mixed Autoregressive Moving Average (ARMA) model. However, when data is found not stationary, an improved model should be used involving an integrated (I) component. It is known as the Mixed Autoregressive Integrated Moving Average (ARIMA) model. This model is formulated when the stationary assumption of the variable is not met. A degree of data differencing is conducted to overcome stationarity problem. Stationary means that the properties of the series don't depend on the time when it is captured.
The general term for this model is ARIMA (p,d,q) where p refers to the AR order, d is degree of differencing and q is the order of MA. Generally, estimating parameter of ARIMA model is describing dependent variable Y by combining a p th order AR model, d th degree of differencing and a q th order MA model. A non-seasonal ARIMA forecasting equation can be written like this. Let Y denote the original series; y denotes the differenced series. It is defined that: (2) Second differencing, when d=2: And so on.
The forecasting equation for y is denoted as follows: The construction of an ARIMA model must follows several stages as shown below: Stage 1 : Stationaries the series, if necessary, by differencing. Stage 2: Plot ACF and PACF to study the pattern of autocorrelations and partial autocorrelations to determine if lags of the stationaries series and/or lags of the forecast errors should be included in the forecasting. Stage 3: Identify the initial value for parameter p and q from ACF and PACF plot. Stage 4: Fit the model that is suggested and check its residual diagnostics.
Often time series possess a seasonal component that repeats itself after every S observation (S=12 for monthly and S=4 for quarterly). In order to deal with seasonality, ARIMA processes have been generalized. ARIMA model for series with seasonal component present, an additional differencing is necessary to be performed in order to eliminate the seasonal effect. It is called 'seasonal differencing'. This model is known as SARIMA. The non-seasonal & seasonal differences are combined to stationarize the series such that: No difference, when d=0: And so on.
Seasonal ARIMA models rely on seasonal lags and differences to fit the seasonal pattern. The seasonal part of an ARIMA model is summarized by three additional numbers; Pthe number of seasonal autoregressive terms, Dthe number of seasonal differences and Qthe number of seasonal moving-average terms. The SARIMA term can be written as SARIMA (p,d,q)(P,D,Q)s where s is the number of period in a season. This is represented more formally by equation 7; Where S is the number of periods per season, yt is the time series observation at time t, t  is white noise,  is the mean of the series,  is seasonal AR parameter,  is non-seasonal AR parameter,  is the seasonal MA parameter,  is non-seasonal MA parameter and B is the back shift operator. The back shift operator can be simplified in AR and MA term as follows: The non-seasonal components for AR and MA are; The seasonal components for AR and MA are: It is noticed that, on the left hand side of equation (7), the seasonal and non-seasonal AR components multiply each other, and on the right hand side, the seasonal and non-seasonal MA components multiply each other.

Performance Indicator
Several performance indicators were used to identify the most appropriate Time Series model for the CO data at the study locations which include Mean Absolute Error (MAE), Root Mean Squared Error (RMSE), Mean Absolute Percentage Error (MAPE) and Akaike Information Criterion (AIC) (Lazim, 2007). For n series of data points, with k numbers of parameter in the model, t y is an actual data at time t, t ŷ is the predicted value, t  is the error defined by The most appropriate or optimal model is chosen based on the lowest value of MSE, RMSE, MAPE and AIC.

Exploratory Analysis
Figure 2 (a) shows the average maximum CO cycle over the seven days' period in a week and Figure 2 (b) exhibits the average monthly fluctuation. CO has increased during the weekdays that reach the peak on Friday at Perai and Thursday at Pasir Gudang. Meanwhile the concentration had drop during weekend. Over a year period, on the average there are two cycles of CO movement with the peak in May and October at Pasir Gudang, meanwhile no obvious cyclical pattern was observed at Perai. Month of May is the lowest while the highest is in December. The time series plot for the data set in Figure 3 (a) indicates there exist an irregular pattern at Perai. For Pasir Gudang data set, Figure 3 (b) also depicts irregular and some cyclical pattern which indicates the present of seasonal behavior. With the null hypothesis that CO time series is stationary, the Augmented Dickey-Fuller (ADF) test results of p-value equals 0.01 has provided the evidence that the CO data are not stationary for both Perai and Pasir Gudang. This results suggest for data differencing to be conducted to improve this situation of non-stationarity. Other than that, based on Pettitt's test, the CO data is also proven not homogeneous for Perai and Pasir Gudang sites. The null hypothesis that a time series is homogeneous need to be rejected due to very small p-value (0.000) at both sites respectively.

Determining Time Series Model for Daily Maximum CO at Perai Site
In order to increase the quality of data stationary, first differencing was performed based on the results in Figure 4 (a) which is based on the decay pattern of ACF plot and one high spike followed by several smaller spike in PACF plot. Figure 4 (b) shows that the data is stationary. The next step is to identify the parameter p, d and q for ARIMA model. Parameter d=1 while based on Figure 4 (b), from the ACF plot, q is between 1 and 2 while PACF suggest that p is between 2 to 7. Thus we begin the initial model with the first p=2 significant spike for AR and the first q=1 significant spike for degree of MA shown by ACF and PACF respectively; ARIMA (2, 1, 1). Based on this model several other models will be proposed and the performance are compared to determine which one the best is given by the lowest MSE, RMSE, MAPE and AIC values. The best model is ARIMA (3,1,1) and the equation can be written as follows: Figure 5 illustrates the forecasting of the Daily Maximum CO for Prai sites for 30 days ahead. It is shows that CO maximum can be between 0.5 to 2.0 ppm in the next 30 days' period. The thick blue line represents the mean of the forecasted values. The ARIMA (3,1,1) model is fitted to the validation data, the daily maximum CO for the period from January 2014 until December 2014. The result is shown in Table 3. Based on the results, it can be concluded that model ARIMA (3,1,1) is a valid model since the performance measures of the training data set has improved (i.e. shown by the reduction of MAE, RMSE, MAPE and AIC).

Determining Times Series Model for Daily Maximum CO at Pasir Gudang
Seasonality in a time series is a regular pattern of changes that repeats over S time periods, where S defines the number of time periods until the pattern repeats again. Figure 3 indicates that the daily maximum CO data contains seasonal component. However, the number of time periods of the seasonal cycles was not obvious. For this study, based on the results shown in Figure 2 (b), since the data is daily data, it is decided to let S=7 the period of a cyclical CO to repeat over seven days in a week. Because there is no obvious linear trend and no obvious seasonality based on exploratory data analysis results in section 3.1, both seasonal and non-seasonal differencing will be performed. The results of ACF and PACF after the process has been conducted are shown in Figure 6.
The ACF and PACF were plotted to collect conclusive evidence on stationarity condition and so as to guest for initial model parameter for SARIMA (p,d,q)(P,D,Q)7. The results of Figure  4 (a) indicates, that it is still cannot be concluded that the data series in seasonal differencing is not yet stationary, so in order to render the original series stationary, the first differencing for non-seasonal must be performed. When looking at the results after the first differencing for non-seasonal component, the ACF has shown a decay pattern, unfortunately not for PACF as shown in Figure 4 (b). Hence, aiming for improvement, a second differencing was conducted. However, it is found that the results were getting worse. Therefore, it was then decided to stop apply differencing and conclude that data series only be performed with first differencing. In order to determine the appropriate model formulations to be fitted to the data series one needs to observe for the behavior of significant spike(s) in ACF and PACF plots.
From ACF plot in Figure 4 (b), two significant spikes are observed which at point lag 1 and 34. This one spike can be used to specify the non-seasonal MA part of the model. Another significant spike is also observed at lag 35 to suggest the seasonal SMA part of the model. Similarly, to identify the AR part of the model one needs to observe the PACF plot. Eight significant spikes are observed, one at lag 1, 2, 3, 4, 5, 6, 8 and lag 11 to suggest the non-seasonal AR part of the model. There are also fairly significant spikes at lag 7, 14 and 28 to indicate the seasonal SAR part of the model. However, even if these observations are made it is still cannot be perfectly sure of the correct values of the respective p, q, P and Q that can be assigned to the model. Hence, to ensure that a well specified model is not missed out, several model formulations will be proposed and the performances are compared. In this study, guided and supported by the previous results, the parameter we let the d=D=1 and p=2, q is in the range [1,8], P=1 and Q is in the range [1,3]. The results and the performance of the several proposed models are given in Table 4. It is found that the model that gave the best performance is SARIMA (2, 1, 8) (1, 1, 2)7 when fitted to the training data set. Backward shift operator for SARIMA is written as:  Figure 7 illustrates the forecasting of the Daily Maximum CO for Pasir Gudang site for 30 days ahead. It is shows that CO maximum can be between 0.01 to 1.5 ppm in the next 30 days' period. The thick blue line represents the mean of the forecasted values. The validity of the obtained model is verified, using validation data set (i.e. daily maximum CO for year 2014) and the results is shown in Table 4. The reduction in the value of MAE, RMSE, MAPE and AIC provided the evidence that the model is valid.

Conclusion
In summary, this study has shown that Perai and Pasir Gudang sites have different temporal behavior of CO levels. By day of week basis, the results indicate that Perai site has the highest traffic level on Thursday. While at Pasir Gudang the day with highest maximum CO is Friday. However, both sites experiencing lowest CO during weekend (Sunday) and less prominent at the beginning month of the year. Pasir Gudang has clear pattern of bimodal peak in May and October while Perai has the lowest maximum CO in May with no clear pattern of monthly cyles. Daily maximum CO time series at Perai shows an irregular pattern at Perai while for Pasir Gudang data, the series depicts irregular and some cyclical pattern which indicates the present of seasonal behavior. Based on several performance measures such as MAE, RMSE, MAPE and AIC, the study has found that the best appropriate prediction model for Perai is ARIMA(3,1,1) and SARIMA (2,1,8)(1,1,2)7 for Pasir Gudang.