A prerequisite for any data mining project is to understand data flows. The process illustrated below has five basic steps:
- Pull data
- Prepare data
- Split data into training and test sets
- Train the model
- Evaluate model
Pull DataThe data source can range from a enterprise data warehouse to a single flat file. Where we get the data is second to having **quality** data. We will use the RODBC driver to pull data from SQL Server in the demo below.
Prepare DataData mining models rarely work with data straight from the source system. Depending on the model we may need to perform one or more transformations. Typical data mining related transformations include but are not limited to:
- Replace NULL values with the column mean
- Convert nominal data types to numerical
- Discretize continuous values
- Convert target variables "Yes" and "No" to 1 and 0
The R language provides many built in functions to perform these transformations. Several are demonstrated in the script below.
Split Data into Training and Test SetsWe reserve a portion of the data for model evaluation. There is no hard rule for what percentage of the total data we should holdout for test. Start off with a 70/30 or an 80/20 split. Adjust from there. A move in either direction involves tradeoffs.
Train the ModelA model is nothing more than a meta data structure. An algorithm populates the meta data based on the training data. In theory we should see the model error rate gradually decline the more training iterations we perform. It's appropriate to stop training once a desired error is reached or at a fixed number of iterations.
Evaluate ModelAfter training, we can start making predictions on the test set. This is where the fun begins. Measuring the prediction accuracy is a good place to start. One should also analyze the model error rate, the processing time and compute resources.
R scriptThe script below goes through all the data flow steps mentioned in this blog post. It uses the same randomly generated data set used in my previous post located here. http://bit.ly/2umfXYH
#install.packages("RODBC") #install.packages("RSNNS") #install.packages("scales") #install.packages("reshape") #install.packages("devtools") #install.packages("ggplot2") library(RSNNS) library(RODBC) library(scales) library(reshape) library(devtools) library(ggplot2) source_url('https://gist.githubusercontent.com/fawda123/7471137/raw/466c1474d0a505ff044412703516c34f1a4684a5/nnet_plot_update.r') source_url('https://gist.githubusercontent.com/fawda123/6206737/raw/d6f365c283a8cae23fb20892dc223bc5764d50c7/gar_fun.r') myServer <- "." #local server, change to your server myDatabase <- "classificationDemo" myDriver <- "SQL Server" connectionString <- paste0( "Driver=", myDriver, ";Server=", myServer, ";Database=", myDatabase, ';trusted_connection = true ') conn <- odbcDriverConnect(connectionString) my.data <- sqlQuery(conn, 'SELECT flavor ,likeIt FROM dbo.candySummary') close(conn) my.data ##Decode Class label has prepared the data for model use my.dataInputs <- decodeClassLabels(my.data[, 1]) my.dataTargets <- my.data[, 2] ##Split the data into train and test my.splitData <- splitForTrainingAndTest(my.dataInputs, my.dataTargets, ratio = 0.33) my.model <- mlp(x = my.splitData$inputsTrain, y = my.splitData$targetsTrain, size = c(3), # Don't worry about this parameter yet maxit = 200 # How many training iterations i.e. pick up a piece of candy - yes I like it or no I don't like it ) my.testprediction <- predict(my.model, my.splitData$inputsTest) my.testprediction <- ifelse(my.testprediction >= .9, 1, 0) my.outputtest <- cbind(my.splitData$inputsTest, my.testprediction) colnames(my.outputtest) <- c("apple", "blueRaspberry", "cherry", "grape", "watermelon", "prediction") #take a look confusionMatrix(my.splitData$targetsTest, my.testprediction) plotIterativeError(my.model) ##lets get a visual of the model plot.nnet(my.model)