A prerequisite for any data mining project is to understand data flows. The process illustrated below has five basic steps:

  • Pull data
  • Prepare data
  • Split data into training and test sets
  • Train the model
  • Evaluate model

dataflow

Pull Data

The data source can range from a enterprise data warehouse to a single flat file. Where we get the data is second to having **quality** data. We will use the RODBC driver to pull data from SQL Server in the demo below.

Prepare Data

Data mining models rarely work with data straight from the source system. Depending on the model we may need to perform one or more transformations. Typical data mining related transformations include but are not limited to:
  • Replace NULL values with the column mean
  • Convert nominal data types to numerical
  • Discretize continuous values
  • Convert target variables "Yes" and "No" to 1 and 0

The R language provides many built in functions to perform these transformations. Several are demonstrated in the script below.

Split Data into Training and Test Sets

We reserve a portion of the data for model evaluation. There is no hard rule for what percentage of the total data we should holdout for test. Start off with a 70/30 or an 80/20 split. Adjust from there. A move in either direction involves tradeoffs.

Train the Model

A model is nothing more than a meta data structure. An algorithm populates the meta data based on the training data. In theory we should see the model error rate gradually decline the more training iterations we perform. It's appropriate to stop training once a desired error is reached or at a fixed number of iterations.

Evaluate Model

After training, we can start making predictions on the test set. This is where the fun begins. Measuring the prediction accuracy is a good place to start. One should also analyze the model error rate, the processing time and compute resources.

R script

The script below goes through all the data flow steps mentioned in this blog post. It uses the same randomly generated data set used in my previous post located here. http://bit.ly/2umfXYH
#install.packages("RODBC")
#install.packages("RSNNS")
#install.packages("scales")
#install.packages("reshape")
#install.packages("devtools")
#install.packages("ggplot2")

library(RSNNS)
library(RODBC)
library(scales)
library(reshape)
library(devtools)
library(ggplot2)
source_url('https://gist.githubusercontent.com/fawda123/7471137/raw/466c1474d0a505ff044412703516c34f1a4684a5/nnet_plot_update.r')
source_url('https://gist.githubusercontent.com/fawda123/6206737/raw/d6f365c283a8cae23fb20892dc223bc5764d50c7/gar_fun.r')


myServer <- "." #local server, change to your server
myDatabase <- "classificationDemo"
myDriver <- "SQL Server"

connectionString <- paste0(
    "Driver=", myDriver,
    ";Server=", myServer,
    ";Database=", myDatabase,
   ';trusted_connection = true ')

conn <- odbcDriverConnect(connectionString)



my.data <- sqlQuery(conn, 'SELECT flavor ,likeIt FROM dbo.candySummary')

close(conn)




my.data


##Decode Class label has prepared the data for model use
my.dataInputs <- decodeClassLabels(my.data[, 1])

my.dataTargets <- my.data[, 2]



##Split the data into train and test
my.splitData <- splitForTrainingAndTest(my.dataInputs, my.dataTargets, ratio = 0.33)




my.model <- mlp(x = my.splitData$inputsTrain, y = my.splitData$targetsTrain,
                    size = c(3), # Don't worry about this parameter yet
                    maxit = 200 # How many training iterations i.e. pick up a piece of candy - yes I like it or no I don't like it

                 )



my.testprediction <- predict(my.model, my.splitData$inputsTest)



my.testprediction <- ifelse(my.testprediction >= .9, 1, 0)




my.outputtest <- cbind(my.splitData$inputsTest, my.testprediction)



colnames(my.outputtest) <- c("apple", "blueRaspberry", "cherry", "grape", "watermelon", "prediction")

#take a look


confusionMatrix(my.splitData$targetsTest, my.testprediction)


plotIterativeError(my.model)


##lets get a visual of the model


plot.nnet(my.model)