WHAT APPS USED?
I will use RapidMiner in analysing the data pemilu dataset, to make a prediction model of Elektabilitas Caleg.
RapidMiner is a software platform developed by the company of the same name that provides an integrated environment for machine learning, data mining, text mining, predictive analytics and business analytics. It is used for business and commercial applications as well as for research, education, training, rapid prototyping, and application development and supports all steps of the data mining process including data preparation, results visualization, validation and optimization. RapidMiner is developed on an open core model. The RapidMiner (free) Basic Edition is limited to 1 logical processor and 10,000 data rows is available under the AGPL license.
I am going to create prediction model of prediksi elektabilitas caleg using data sets given on dropbox. There are many algorithms and operators available in RapidMiner, but in this prediction, I will use three main algorithms, which are; Decision Tree (C4.5), Naïve Bayes (NB) and K-Nearest Neighbor (K-NN). I am creating the prediction model in order to know the legislative prediction, whether he/she are going to be elected or not.
2. HOW TO MAKE ELECTION PREDICTION MODEL
After opening RapidMiner (I am using RapidMiner v7.2) we can start by creating new process, blank process, since we are going to create the process by ourselves. First, In operators, choose Read CSV not Read Excel, because the data sets that we had is in CSV format. Drag it right away to our process canvas. Import the configuration wizard, and adjust the options with the given parameters in dropbox. Second, drag Set Role to our canvas, then In Set Role parameters, set attributes name for it, which is TERPILIH ATAU TIDAK, with the target role is LABEL. Third, add one more operators named X-Validation to estimate. Then click in the validation, we will proceed to TRAINING and TESTING canvas, here we can use the algorithms, such as DECISION TREE, NAÏVE BAYES, and K-NEAREST NEIGHBOR (K-NN). After finished connecting the whole process, click on play. Then RapidMiner will show the results.
DECISION TREE ALGORITHMS (C4.5)
Design of Decision Tree
For the first algorithms, we will use Decision Tree as our modelling, it generates classification of both nominal and numerical data. In RapidMiner an attribute with label role will be predicted by the Decision Tree operator. According to RapidMiner website, we could know that each interior node of tree corresponds to one of the input attributes. The number of edges of a nominal interior node is equal to the number of possible values of the corresponding input attribute. Outgoing edges of numerical attributes are labeled with disjoint ranges. Each leaf node represents a value of the label attribute given the values of the input attributes represented by the path from the root to the leaf.
To make it easier to understand, let’s go to our cases and learn by doing with the example below:
The results has shown that the decision tree have an accuracy 93.16% with Prediction TIDAK are – (TRUE TIDAK is 362 and TRUE YA is 14) while Prediction YA are – (TRUE TIDAK 15 and TRUE YA 34).
NAÏVE BAYES ALGORITHMS
Design of Naive Bayes
Acccording to RapidMiner website, A Naive Bayes classifier is a simple probabilistic classifier based on applying Bayes’ theorem (from Bayesian statistics) with strong (naive) independence assumptions. A more descriptive term for the underlying probability model would be ‘independent feature model’. In simple terms, a Naive Bayes classifier assumes that the presence (or absence) of a particular feature of a class (i.e. attribute) is unrelated to the presence (or absence) of any other feature. For example, a fruit may be considered to be an apple if it is red, round, and about 4 inches in diameter. Even if these features depend on each other or upon the existence of the other features, a Naive Bayes classifier considers all of these properties to independently contribute to the probability that this fruit is an apple.
The advantage of the Naive Bayes classifier is that it only requires a small amount of training data to estimate the means and variances of the variables necessary for classification. Because independent variables are assumed, only the variances of the variables for each label need to be determined and not the entire covariance matrix.
- training set (Data Table)The input port expects an ExampleSet. It is the output of the Select Attributes operator in our example process. The output of other operators can also be used as input.
- model (Model)The Naive Bayes classification model is delivered from this output port. This classification model can now be applied on unseen data sets for prediction of the label attribute.
example set (Data Table)The ExampleSet that was given as input is passed without changing to the output through this port. This is usually used to reuse the same ExampleSet in further operators or to view the ExampleSet in the Results Workspace.
Naïve Bayes (NB) have an accuracy of 89.14%. Distribution model for label attribute TERPILIH ATAU TIDAK with class TIDAK (0.887) and class YA (0.113). Reference of legislative candidates may be considered to be elected or not is by measuring its valid vote. Prediction TIDAK are – (TRUE TIDAK is 357 and TRUE YA is 26) while Prediction YA are – (TRUE TIDAK 20 and TRUE YA 22).
Design of K-NN
From RapidMiner website, K-Nearest Neighbour model is to generates from the input ExampleSet, this model can be a classification or regression model depending on the input ExampleSet. The k-Nearest Neighbor algorithm is based on learning by analogy, that is, by comparing a given test example with training examples that are similar to it. The training examples are described by n attributes. Each example represents a point in an n-dimensional space. In this way, all of the training examples are stored in an n-dimensional pattern space. When given an unknown example, a k-nearest neighbor algorithm searches the pattern space for the k training examples that are closest to the unknown example. These k training examples are the k “nearest neighbors” of the unknown example. “Closeness” is defined in terms of a distance metric, such as the Euclidean distance.
K-Nearest Neighbour have an accuracy 89.63%. The datasets contain K-NN Classification 1-nearest neighbour model have 425 examples with 10 dimension with the following classes YES or NOT. 1 Special attribute and 9 regular attribute. Prediction TIDAK are – (TRUE TIDAK is 358 and TRUE YA is 25) while Prediction YA are – (TRUE TIDAK 19 and TRUE YA 23).