NEURO-FUZZY DATA MINING SYSTEM FOR IDENTIFYING E-COMMERCE RELATED THREATS

E-commerce is driven via Information Technology (IT), especially the web, and it mostly relies upon on innovative technologies that are facilitated by Electronic Data Interchange (EDI) and Electronic Payment over the web. Several researches have shown that e-commerce platforms are compromised by means of phishing and fraud attacks. This has necessitated the importance of trying to find innovative methodologies for protecting e-commerce systems and users from the said threats. This research integrates Case Based Reasoning Module (CBRM) and Adaptive Neuro-Fuzzy Inference System (ANFIS) to spot and categorise e-commerce websites transactions as legitimate or illegitimate by analysing and evaluating some attributes. This may provide an invulnerable platform for e-commerce users. The system which was implemented on MATLAB can be deployed on e-commerce systems and servers to watch e-commerce requests with the aim to identify legitimate and illegitimate websites and transactions. The result of the implementation indicates that the developed system is promising.


Introduction
E-commerce is widely considered as buying and selling of products and services over the web, and any transaction that is achieved fully through digital measures are often viewed as ecommerce (Kishor, 2013;Singh et al., 2016). Personal computers, laptops, mobile phones and the internet are considered as the infrastructures that aided the emergence of e-commerce and e-transaction (Akinyede & Akinyede, 2015). E-commerce has brought numerous benefits to technology-driven commerce which makes the process of shopping and selling faster and easier at any given time. Despite the convenience associated with shopping and selling on the web, ecommerce is bedevilled by some security threats like phishing and fraud which is largely because that it makes use of the web as its driving infrastructure. Securing customers' data is a major challenge hindering the growth of e-commerce (Bandara et al., 2019). Fraudsters are constantly seeking ways to take advantage of online shoppers who commit novice errors. Common errors that make people susceptible to security threats include the following: shopping on websites that are not secure, giving out too many personal information and leaving computers vulnerable to viruses (Murphy, 2018). Phishing is a criminal mechanism employing both social engineering and technical subterfuge to steal purchasers' private identification data and financial account credentials for the malicious purpose (Niranjanamurthy & Dharmendra, 2013;Ramachandran & Chang, 2016). Usually, phishing attacks will direct the recipient to a website designed to mimic the target enterprise's actual visual identity with the aim of obtaining private personal data, which often result in the victim being unaware of the malicious event. Acquiring this type of private data is appealing to black hat hackers because it allows them to impersonate their victims and make fraudulent financial transactions. Victims often incur huge financial losses or have their entire identification stolen usually for criminal purposes (Chirag et al., 2012;Irvin-Erickson, 2019).
Despite numerous researches on enhancing the safety of e-commerce websites and transactions, yet various threats still exist with more experienced threats, namely phishing and financial fraud. Most e-commerce users are unaware that their web browsers can expose useful information about their transactions to hackers and fraudsters after visiting malicious/phishing websites. (Ramachandran & Chang, 2016).
With the increasing number of phishing websites which pose a threat to the overall security of e-commerce platforms, there is a need to develop a strong and effective solution to identify e-commerce phishing websites and fraudulent transactions. The study adopts a neurofuzzy based system and data mining techniques to assist in the evaluation and classification of e-commerce websites and transactions into "legitimate" and "illegitimate" The study also aims to employ knowledge mining which is processing data into information, which mostly involves identifying patterns within large data sets that are impossible for humans to discover manually (Lee & Yoon, 2017). Adaptive Neuro-Fuzzy Inference System (ANFIS) is one of the system that has the potential of acquiring knowledge from data that is inherently not accurate and maintain a high level of performance within the presence of doubt to supply solutions to problems (Arinkoola, 2016).

Related Work
Detecting phishing websites and fraud is a crucial step towards ensuring security in e-commerce platforms. Several approaches are adopted to unravel these problems. This section reviews different studies on phishing and fraud detection schemes. A fraud detection system for e-commerce transactions was developed by employing a prudential multiple consensus model (Carta et al., 2019). This was achieved using data intelligence technique based on a prudential multiple consensus model which integrates the effectiveness of some modern classification algorithms by using a two-fold criterion, probabilistic and majority based selection. The aim was to maximise the effectiveness of the model in detecting fraudulent transactions regardless of any data imbalance. This model was validated with a set of experiments on a large real-world dataset characterised by a high degree of data imbalance and results confirmed that the proposed model performed best compared to other existing classification algorithms. However, the model could not be evaluated and also characterised by a high degree of data imbalance.
One approach is the phishing webpage detection for secure online transactions (Fowdur & Khader, 2018) that was designed to detect phishing websites used for e-commerce transactions. In this approach, three layers of criteria are used: Google page rank, IP address in URL and quality of webpage content. The phishing website detection system consists of three modules; data collection module which finds phishing and genuine e-commerce websites for analysis, fuzzy rule base containing fuzzy rules to assist the inference ripper engine make logical conclusion about the genuineness of a webpage or website, and classification module that uses symbolic logic to classify websites according to associated risk factors. In the second approach, a prototype intelligent Intrusion Detection System (IIDS) for e-banking was developed using the effectiveness of fuzzy logic and data mining techniques (Khraisat et al., 2019). The system was designed using fuzzy logic to provide more information for risk managers to efficiently manage and detect website phishing associated risks by combining historical data and expert input. Fuzzy logic and data mining algorithms which include; C4.5, RIPPER, PART, PRISM and CBA were used to assess e-banking phishing websites risk using twenty-seven (27) factors. Linguistic variables were used to represent key phishing characteristic indicators associated with e-banking phishing website probability. The system was implemented using WEKA and MATLAB. Two publicly available data sets were used to test the implemented system.
A secure environment for client-side e-commerce payment system using an encryption system (Akinyede et al., 2014) was developed to provide a secure means of protecting customers' personal and transaction data from fraud using encryption. The system was divided into three parts namely: merchant server-side scripting which handles customers' requests, customers-side scripting that makes a request to the online server and the host-side that deals with funds transfer. The security mechanism employed in this system is the symmetric cryptographic scheme supported by Advanced Encryption System (AES) encryption and decryption algorithm as a means of protecting transaction data and credentials in e-commerce transactions and this technique provided an efficient solution in protecting the transactions of consumers.
With the aim of protecting e-commerce systems from internet fraud (Phani & Mahaboob, 2013), a prototype application that detects fraudulent e-commerce transactions was developed. A genetic algorithm with multiple criteria is developed to detect fraud, namely payment card usage frequency, payment card usage location, overdraft on the payment card and payment card balance. A prototype application was built using JAVA and it was developed using the Graphical interface (GUI) to ensure user friendliness. Intensive performance evaluation of the prototype was also performed.
A neuro-fuzzy approach was employed to detect phishing websites and protect purchasers when performing online transactions (Aburrous et al., 2010). A hybrid neuro-fuzzy method was used to develop a phishing website detection model that offered an effective solution, using two-fold cross-validation. The results from this model suggest that the proposed Neuro-Fuzzy system that used five (5) inputs was powerful in detecting phishing websites with high accuracy in real-time. The proposed system made use of rules, user-behavior profile, phish-tank, pop-ups from emails). Two-Fold cross-validation was applied to carry out training of the proposed model and a set of 243 rules was generated. The researchers have proposed a Transductive Support Vector Machine (TSVM)-based system way of phishing page detection. The system was independent of the attack method and did not affect the users' behaviour. Though the system performs well but the result is only a preliminary investigation of detecting phishing web page using TSVM. As a result, much are expected to be done in improving its performance.

Methodology
ANFIS and CBRM will be adopted in developing the proposed Neuro-Fuzzy Data Mining System model. The model, which will be referred to as Adaptive Neuro-Fuzzy Inference System Design Model (ANFIS-DM), will intelligently identify and classify e-commerce websites and transactions into either legitimate or illegitimate entities by systematically evaluating features or attributes of e-commerce websites and transaction data to detect phishing websites and fraudulent transactions. A set of defined linguistic variables are modelled for correct interpretation of results using a scale as shown in Table 1 and Table 2 where the extracted features from e-commerce transactions are the formulated classification based on the possible outcomes or conditions of each parameter/attribute. Figure 1 depicts the procedure of the ANFIS-DM model and the crisp values of input parameters representing the model's attributes. The fuzzy set of parameters (attributes) is represented by 'X' which is defined as in equation 1.
where xn represent the nth parameter or attribute of X and n is the total number of parameters in X (here n=15). For each of the parameter, a group of constraints are defined which makes it easy to scale properly. In each parameter, standard or acceptable range of values/labels is assigned as in  In this research, features/attributes of e-commerce transactions listed in Table 2 are extracted from an e-commerce transaction request which serves as an input/sample case to the ANFIS-DM. These extracted attributes are included into query of the case-based system for the closest similar case to the new sample case input into the ANFIS-DM. The closest similar case is retrieved by employing K-nearest Neighbour (KNN) algorithm using the Euclidean distance. The retrieved case serves as input to the fuzzy module where fuzzification and inference take place using generated rules to return a distinct final output/result; and the classification formulated based on the possible conditions of each parameter/attribute. Basically, the procedure of the proposed ANFIS-DM model is composed of six functional blocks (see Figure  1). a. The input is a rule base containing a number of fuzzy if-then rules; b. a database which defines the membership functions of the fuzzy sets used in the fuzzy rules; c. CBR system is a decision-making paradigm that performs the inference operations on the rules; d. A fuzzification module which transforms the fuzzy inputs into degrees of match with linguistic values; e. ANFIS module serves as a basis for constructing a set of fuzzy if-then rules with appropriate membership functions to generate the stipulated input-output pairs. f. A defuzzification module which transform the fuzzy results of the inference into a fuzzy output. Usually, the rule base and the database are jointly referred to as the knowledge base.

Case Base Reasoning (CBR) Module
CBR is a decision-making paradigm where new cases are solved relying on previously solved comparable instances (Yikun et al., 2019). CBR approach mimics how humans' reason and learn; hence it makes it a promising approach for building intelligent systems (Zhai et al., 2019). In this research, a database of solved cases is employed, and every case is described via a group of input attributes associated with a designated output. Extracting useful information from this database can help the CBR system in providing a reliable result on yet to be solved cases. The CBR model was adopted to ensure the efficiency and reliability of the system. In this research, the CBR module focuses on two primary steps of the CBR cycle which involves retrieval and reuse of solutions from previous cases. Case retrieval is performed based upon similarity of the solved case to the new (unsolved) case. Here, the closest similar cases to the new case are retrieved by employing K-nearest Neighbour (KNN) algorithm using the Euclidean distance as shown in equation 2. (2) where i= 1, 2, 3, …, n d(x,y) computes the distance between new (unsolved) case and retrieved case, xi represents the value of each attribute for the new case, yi represents the value of each attribute for the retrieved cases and n is the total number of attributes. The retrieved cases are provided as input to the ANFIS module for further processing which incorporates model training and rules generation.

Adaptive Neuro-Fuzzy Inference System Module
The ANFIS is a Takagi-Sugeno-Kang (TSK) type of fuzzy model proposed by Takagi-Sugeno-Kang (Takagi & Sugeno, 1985;Shafaei et al., 2017). It integrates both neural networks and fuzzy logic principles and it has the potential of capturing the benefits of both techniques into a single framework (Sampson et al., 2019). ANFIS is a data-driven technique representing a neural network approach for the solutions of function approximation problems. Data-driven approaches for the synthesis of the networks are typically based on clustering a training set of numerical samples of the unknown function to be approximated. On account of its introduction, its networks have been efficiently applied in classification tasks, rule-based process control, pattern recognition and similar problems. This fuzzy model generates fuzzy rules from an input/output data set.
ANFIS under consideration has a number of inputs and one output. The rule base contains the fuzzy IF-THEN rules of Takagi and Sugeno's type as follows: where xi is the antecedent, w is the firing strength of the rule and f(xi) is a crisp function in the consequent. The ANFIS structure usually consists of 5 layers. Figure 2 shows the architecture of the ANFIS module of the ANFIS-DM.

In layer 1 (L1), Gaussian Membership (GM) function is used to map input values of each xi node to its appropriate membership value (see equation 3). The Gaussian Membership
Function is specified by two parameters ci and σi, where ci represents Membership Function's centre (threshold) and σi represents its width. These parameters are called the premise parameters and are used to adjust the shape of the membership function.
where i= 1, 2, 3, …, n In layer 2 (L2), each node in this layer calculates the firing strength of a rule via multiplication. Here, the calculation of the weight (wi) or firing strength of each rule output is computed. In this layer, the input values are the membership functions and each node multiply inputs and gives an output which represents the firing strength of a rule. The output of this layer is given by equation (4).
where i= 1, 2, 3, …, n In layer 3 (L3), the nodes calculate the ratio of the rule's firing strength to the sum of all the rules firing strength. The result is a normalized firing strength shown by equation (5).
where i= 1, 2, 3, …, n In layer 5 (L5), the single node in this layer computes the overall output as the summation of contribution from each rule. This simply implies that the output of the fuzzy inference system is calculated by summing all rule outputs using equation (7).

Implementation
To successfully implement the model, the MATLAB programming tool was used. Several MATLAB commands was applied and stored in an M file (M file is a MATLAB code file). The implementation stages are listed as follows.
a. Building the Graphic User Interface (GUI), the GUI was designed using the MATLAB GUIDE command. b. Retrieving the dataset from the database, the dataset named "e-commerce Phishing" was retrieved from the University of California machine learning repository (UCI). It is an Attribute Relation File Format (ARFF). It was converted to a Comma Separated Value (CSV) file and was later transferred to MySQL database called e-commerce. The database and tables in the e-commerce database is retrieved for pre-processing using MATLAB code fragments. c. Data conversion and pre-processing, the retrieved tables are converted into matrix for further prepossessing using cc=traintable.Data; dd=testtable.Data; ee=checktable.Data; c=cell2mat(cc); d=cell2mat(dd); e=cell2mat(ee); The ANFIS-DM model was developed using the ANFIS model development in three phases as follows: i) Phase One: Generating Initial Rules in_fis = genfis2(intrain,outtrain,radii); Note: genfis2 is a MATLAB command. genfis2 generates an ANFIS structure using subtractive clustering and requires separate sets of input and output data as input arguments. When there is only one output, genfis2 may be used to generate an initial FIS for model training. genfis2 accomplishes this by extracting a set of rules that models the data behavior or generate patterns from the initial data set.
ii) Phase Two: Generating Adaptive Model structures [fis1,error,stepsize,fis,chkErr] = anfis(datatrain,in_fis, trnOpt, dispOpt,datatest); iii) Phase Three: Assigning names to inputs and outputs. The results from the refined rules generated from the FIS as well as the input are assigned names for easy identification.
The Surface Viewer is a graphical interface that lets you examine the output surface of a fuzzy inference system for any one or two input variables. In Figure 3, the input variables URL length and IP address in URL was considered. The Surface Viewer is a read-only editor because it does not alter the fuzzy system or its associated fuzzy inference system structure in anyway. The rule viewer is used to view the entire implication process of the Fuzzy Inference System from the beginning to the end. The line of indices can be moved around corresponding to the inputs. The system re-adjusts and computes new output as shown in Figure 4. Eight rules were generated to drive the inference mechanism for the ANFIS-DM as shown in Figure 5.  The ANFIS-DM model is selected from the name drop down list and input attributes are successfully loaded from the database. These input attributes provide the model with the required attribute data for analysis and evaluation (see Figure 6). The computing clusters run the model and then the interface (in Figure 7) displays the parameters and information regarding result of the ANFIS-DM Model.  The interface in Figure 8 displays the result of the ANFIS-DM model result by showing plotted graphs of the desired output and the actual output for different instances of testing data. The screenshot in Figures 9a and 9b show the result for instances 1-33 and 860-892 of testing data. For every instance, the integer value on the left is the desired/expected output and the real value on the right is that of the actual output.

Discussions
The ANFIS-DM was trained using 4499 data instances of known output and validated with 1001 instances of test data. Statistical tests offer a certain level of assurance about the validity and accuracy of a model. In this research, the performance of the proposed ANFIS-DM was computed using the Root Mean Square Error (RMSE). The performance of ANFIS-DM was evaluated and compared with other predictive models namely: Linear Regression and Artificial Neural Network (ANN). This was done to ascertain how accurate the ANFIS-DM model classifies e-commerce websites compared to Linear Regression and Artificial Neural Network. Also, the estimated time taken to completely build each of the models was also captured. The obtained result is presented in Table 3. The root mean squared error is the square root of the variance in the residuals and it indicates the absolute fitness of the model to the data. The model with the least RMSE value indicates the best fit model. Table 3 shows that Linear Regression had RMSE value of 0.561 on training data and 0.579 on testing data, while ANN had RMSE value of 0.526 on training data and 0.532 on testing data and ANFIS-DM had RMSE value of 0.501 on training data and 0.510 on testing data. From the result on Table 3, it was observed that ANFIS-DM had the least RMSE values of 0.501 on training data and 0.510 on testing data, this indicates that the proposed ANFIS-DM performed best compared to the other two models (ANN and Linear Regression). This also suggests that ANFIS-DM will offer the best classification with future data and instances than Linear Regression and ANN.
The time taken to build each model was considered in evaluating the performance of the two models and the proposed ANFIS-DM. It took Linear Regression 10 seconds to build its model, while ANN was built in 29 seconds and the proposed ANFIS-DM took 47 seconds in building its model. Analysing the time taken to build each of the models, it was observed that Linear Regression built its model in 10 seconds and that makes it the fastest running model and used less computing resources compared to ANN and ANFIS-DM. ANN was the next fastest model to run with an execution time of 29 seconds. ANFIS-DM built its model in 47 seconds, and this simply signifies that a lot of computing resources was used and also the slowest of the three models because it took a longer time in building its model compared to Linear Regression and ANN. The performance evaluation on Table 3 shows that ANFIS-DM model was the most accurate among the three models; however, it required more computing resources compared to Linear Regression and ANN.
This research was carried out with the goal of providing an end-user software model that help analyse and mine e-commerce website data to discover hidden patterns that are used in identifying and classifying e-commerce websites into either; legitimate or illegitimate/phishing e-commerce websites. ANFIS-DM provides the best fit solution to new instances based on existing instances. Results obtained from the system were satisfactory. The system can handle instances which may be very complex, since rules can be amended or added to adjust the decision mechanism of the software model.

6.
Conclusion It is noted that ANFIS-DM performed accurately in identifying and classifying the legitimacy and illegitimacy of e-commerce and transactions. A hybrid of CBR and ANFIS was used to develop a software model that assists e-commerce platforms and servers in the analysis of ecommerce data to identify and classify e-commerce websites and transactions into legitimacy or illegitimacy. It should, however, be noted that the system was not designed to prevent transactions or websites from security threat to e-commerce, it was intended only to identify and classify e-commerce websites and transactions as legitimate or illegitimate. Further research in designing the preventive measures is of most importance to the field. engineering and the committee on graduate studies of African university of science and technology in partial fulfillment of the requirements for the degree of Doctor of Philosophy.