Photo by Caleb George / Unsplash

UPDATE: This blog post was part of presentation delivered 03/27. The full code demo is located at the bottom of this post.

So there I was, working late into the night trying to find the best configurations for my predictive model. I had already scripted out a simulation in hopes of finding the most optimal model for the problem. It seemed like a good idea at the time, however I quickly learned that my laptop could not handle the workload efficiently. After letting the simulations run for 20 hours, I decided that it was time to find an alternative approach. Within a couple of google searches I found a solution that cut processing time down to 20 minutes.

Before I reveal the specifics of this solution I'll set the stage for a toy problem. It's one commonly encountered in elementary statistics: What is the probability of flipping a fair coin 10 times and getting heads exactly 10 times? The probability of 10 heads in 10 flips is computed by taking the probability of one coin flip landing heads (50%) raised to the number of trials (10).

(1) 0.510 = 0.0009766

If I wanted to witness 10 heads in 10 trials then I would need to perform about 1000 iterations of the 10 flip trial, on average. We can simulate what this would look like in R using the SAMPLE function on a vector defined as c(1,0) where 1 = heads and 0 = tails.


coin <- c(1, 0)

flips <- 10

sample(coin, flips, replace = TRUE)

Running this code will return a vector of 10 elements, each of which has a 50% chance of head. In the screenshot below we landed six heads in 10 attempts.

p1

We'd usually get four to six heads, with the occasional three or seven, as seen in the screenshot below. Outside those ranges we might start to think the coin was rigged.

2

Now what I really want to see is a 10 out of 10 occurrence at least once. I'm pretty confident we will see a 10 out of 10 occur in 3000 attempts.


coin <- c(1, 0)
n <- 3000
flips <- 10

x<- colSums(matrix(sample(coin,  n * flips, replace = TRUE)
       ,nrow = flips))

x[x==10]

In fact, we find three occurrences of a perfect 10 heads.

n1

10 heads out of 10 trials is neat, but what about 20 out of 20 attempts? Going back to the first equation gives us the probability.

.520 = 0.0000009537

We would need to run a little over one million simulations before we could expect to see 20 out of 20 heads. I want to add a little cushion, so we'll run three million. How does my laptop handle this? It hardly breaks a sweat, finishing in 2.26 seconds.


coin <- c(1, 0)
n <- 3000000
flips <- 20

starttime <- Sys.time()
x<- colSums(matrix(sample(coin,  n * flips, replace = TRUE)
       ,nrow = flips))
endtime <- Sys.time()
endtime - starttime

x[x==20]

We observe four occurrences of 20 heads out of 20 flips.

n2

As the number of flips witin a trial increases the probability of seeing heads for every flip decreases exponentially. I might see 30 heads out of 30 flips once in a billion attempts. If it took 2.26 seconds to run three million simulations then three billion should finish in 38 minutes, assuming a linear scale.

Enter the Alternative Solution
Instead of waiting 38 minutes I'll take a shortcut and demo the "alternative approach" I alluded to earlier. The doAzureParallel package enables me to run my simulation against a cluster provisioned in Azure. The steps for setting up the cluster are well documented on github - https://github.com/Azure/doAzureParallel.

For demo purposes, I'm spinning up 20 total VMs with two vcpu and 16 GB each. The last I checked, a user could spin up over 16,000 nodes.

cluster

With my cluster ready, I can send the simulation workload to the cloud using the foreach package. Specifying %do% runs the simulations locally against my laptop and specifying %dopar% runs the simulations remote against the cluster.


library(doAzureParallel)
coin <- c(1, 0)
flipsper <- 30
trials <- 18750000  # Any more trials and VMs cannot allocate enough memory for vector sizes
iter <- 160

starttime <- Sys.time()
x <- foreach(i = 1:iter, .combine = 'c') %dopar% {

 y <-   colSums(matrix(sample(coin, trials * flipsper, replace = TRUE), nrow = flipsper))

    y[y == flipsper]

}
endtime <- Sys.time()
endtime - starttime

Nodes actively running code turn green. Once a node has finished executing it is automatically assigned a new task from the queue.

ActiveNodes

The simulations took a little over four minutes. In that time I got three perfect 30 out 30 heads.

finaltime

Closing Thoughts
Data Science teams have an increasing need for powerful algorithms and massive data that go beyond toy problems and coin flips. They need massive computer infrastructure to test ideas or run big data apps. Azure Batch helps teams implement these solutions and overcome infrastructure challenges. If the probability of finding the right answer is one in a billion then we should be able to find the right answer in four minutes1.

UPDATE: Presentation code.


coin <- c(1, 0) # 1 = Heads, 0 = Tails
trials <- 3000
flipsper <- 10
#sample(coin,1)

x <- colSums(matrix(sample(coin, trials * flipsper, replace = TRUE)
       , nrow = flipsper))

length(x)

x[x == 10]

1 / .5 ^ 20

coin <- c(1, 0)
trials <- 3000000
flipsper <- 20

starttime <- Sys.time()
x <- colSums(matrix(sample(coin, trials * flipsper, replace = TRUE), nrow = flipsper))
endtime <- Sys.time()
endtime - starttime

x[x == 20]

hist(x)

1 / .5 ^ 30

rm(list = ls())
gc()
coin <- c(1, 0)
trials <- 3000000000 #50000000 #
flipsper <- 30
starttime <- Sys.time()
x <- colSums(matrix(sample(coin, trials * flipsper, replace = TRUE), nrow = flipsper))
endtime <- Sys.time()
endtime - starttime
x[x == 30]

library(doAzureParallel)


setwd(set your directory here)

generateClusterConfig("pool_config.json")
generateCredentialsConfig("credentials.json")

setCredentials("credentials.json")

pool <- makeCluster("pool_config.json")

registerDoAzureParallel(pool)


rm(list = ls())
gc()
coin <- c(1, 0)
flipsper <- 30
trials <- 18750000
iter <- 160

starttime <- Sys.time()

x <- foreach(i = 1:iter, .combine = 'c') %dopar% {
    y <- colSums(matrix(sample(coin, trials * flipsper, replace = TRUE), nrow = flipsper))
    y[y == flipsper]
}
endtime <- Sys.time()
endtime - starttime
x[x == 30]



stopCluster(pool) #end

1. It’s possible your results will differ. Many variables impact the performance and this is by no means intended to be a literal benchmark.