tdjr-mlstudio-theme

Machine Learning, where are we so far?

In part 1 we introduced a few basic concepts in Machine Learning, ML, along with a mildew problem affecting my orchids which ML will be helping us tackle.
In part 2 we continued by building upon these concepts and assembled a model in Azure Machine Learning Studio and prepared it for training.
In this the final part we’ll train our model, score it’s output, then look at how we might integrate it into a system we could use to warn us when conditions are ripe for our unwelcome visitor to reappear.

Time for some training.

The proper type and calibre of training makes all the difference. Whether my Jesuit [1] instructors whom I’m paraphasing were completely correct may be debateable. In the case of Machine Learning, proper training is critical.

We left off needing only a few final tweeks before sending our model off to train so let’s have at it.
Here’s a quick look at where we left off. If you’re building along your model should have a very similar graph.

tdjr-mlStudio-ModelWiredUp

Assure the Two-Class Boosted Decision Tree module is selected then set the properties as so:

Properties for Two-Class Boosted Decision Tree training module

Let’s take a moment to briefly review these settings.

Create trainer mode is a property of all the various training modules in ML Studio, The options are Single Parameter, which we are using, and Parameter Range. In our case our data is composed of specific data that is tagged in with a key, categories, and indicator. Thinking of this organization as arguments for a method, also referred to as parameters, and indicator as result the ML system is looking for an f(x) where categories returns indicator for each items in the provided training DataSet. While we are providing fixed parameters in many case one may need to provide a range of parameters which is set up by setting this property to Parameter Range and configuring a Sweep Parameters module.

Maximum number of leaves,  Minimum number of samples, Learning rate, and  Number of trees set bounds on the depth, breadth, and complexity permissible for finding a solution.

Random number seed allows for repeatability between full iterations of a model. Machine learning uses randomization when trying variations, by specifying the seed value used by the randomizer one can repeat the same “random” pattern by supplying the same seed value.

Next select the Train Model module then on the Properties table select Launch column selector.  It should be set with the target column. If it’s not, we’ll set it as shown:

Train Model module column selector dialog

Lastly select the Score Model module and assure its Append score column property is checked.

tdjr-MLStudio-ScoreModelProperties

 

Fire in the hole

tdjr-MLStudio_runButton

The middle button is titled RUN. Let’s train it, RUN!

After you click RUN Azure will queue your model for processing then start it. As it works through the modules green check boxes will be added to modules as their work is completed. The process can be halted at any time by pressing Stop. Once training has completed the canvas will refresh and you should see something similar to this:

Training completed

Let’s see how we did. Right click on the output of the Score Model module and select Visualize

tdjr-MLStudio_ScoreFirstVisualize

Hmm, that’s not what you expected is it.

tdjr-MLStudio_ScoreVisualizeSomethingsAmiss

Scored Labels? Shouldn’t we be scoring for how well the model has learned to predict the target value. Yes, that should be what’s scored. In our attempt to be very specific about our data we have taught the machine to predict classifications based on the target (indicator value) not the target itself. We have chosen to classify instead of regress.

“Experiment”

Our situation is easily rectified.

First let’s get rid of our excessive classification of data:
1. Remove two Metadata Indicator modules leaving only the one which At Risk is the Selected Column.
2. Remove the Convert to indicator values module
3. Wire the output of the Clean Missing Data module to the input of the one remaining Metadata Editor module
4. Wire the output  of the one remaining Metadata Editor module to the input of the Split Data module

Then, since we aren’t going to be classifying our data to any significant level we’ll change our learning to focus on regression instead of classification
1. Remove the Two Class Boosted Decision Tree module
2. Add a Boosted Decision Tree Regression module
3. Set up the properties of the Boosted Decision Tree the same as the Two Category Boosted Decision Tree.
4. Wire the Boosted Decision Tree Regression module’s output to the input of the Train Model module.

We’re left with a much simpler model to train.

Updated model

Finding the correct training path is often trial and error. As this example shows ML Studio greatly eases that process, so experiment with you experiments.

Run the new simpler model then Visualize the Score.

Scoring that makes sense

This makes a lot more sense. The data appears just as our source data but with the additional Scored Labels column to the right. The value of the Scored Labels is what the ML System predicts the target value should be.
Take the first example, the target value is false or zero. The ML System predicts it should be -0.000248 giving is a margin of error of a few hundredths of a percent. Offhand, that sounds pretty good. But like checking the incoming data we don’t want to hand proof every resulting result item. ML Studio provides Evaluation modules to help.
Add an Evaluate Model module to the canvas then wire its input to the output of the Train Model module.

tdjr-MLStudio-EvaluateModelWireup

Run the model again.

…lies, damn lies, and statistics[2]

Right click on the output of the Evaluate Model module, select Visualize, and we are presented with a statistical overview of how well our trained predicts outcomes in our dataset.
An old professor of mine drilled, “statistics don’t provide information, but they may provide insight into information.” True or not it is worthwhile to keep in mind.
Mean Absolute Error – how close were predictions to actual outcomes. Our model is very close here
Root Mean Squared Error – is a characterization of overall error in the model. Somewhat akin to using standard deviation to characterize a mean.
Relative Absolute Square Error – this is the absolute relative difference between expected and predicted values
Relative Squared Error – another normalization using the square of error in actual and predicted values.
Coeff of Determination – Also known as R2. A value from 0 to 1 used intidicate the overall ability of the model to predict outcomes, 1 is perfect and 0 is perfectly random. Useful, but with care.

Taken as a whole our new model does a pretty fair job. Statistically the error rates are low and the overall predictability is within the 90th percentile.

Evaluation Result

Doing something useful with it

Let’s put our model to use by converting it to a web service which we’ll provide daily Temp/GDD data and in return we’ll find out if we’re at risk and should take action.

After successfully training our module a new Action is available, SET UP WEB SERVICE.

tdjr-MSStudio_SetupWebServiceButton

Selecting Setup Web Service | Predictive Web Service will automate much of the process. Let’s do it.
After a few minutes we are provided our model is transformed into a new Experiment using our trained model.

Web Service transformation

Run the new Experiment and let it complete.
Select Deploy Web Service and let it complete.
And we find ourselve and the new Web Service’s Dashboard.

tdjr-MLStudio_WSDashboard_1

And from here we can easily connect of daily temperature tracking (and GDD deriving) to make a simple call once a day and alert us if we’re at risk.
Once my MeWatts show up, I’ll even set the alert to trigger the MeWatts to flush and ventilate automatically.

What all have we covered?

An odd scenario with mildew on orchids. This is actually very similar class of diseases called Powdery Mildews which affect many crops.
Some basic concepts of what is Machine Learning and how does it conceptually differ from other methods of solving problems with software.
Assembling some data to use for training.
Getting that data into Azure Machine Learning Studio and assuring the data is in good shape.
Ways of characterizing the data to make it more useful in training.
Training Machine Learning methods with our data.
Understanding that blind alleys and dead ends are sometimes part of the process and Azure Machine Learning Studio makes dealing with those straightforward.
Scoring and evaluating a trained model to get a sense of it ability to predict.
Publishing the trained model as a web service so we and other can make use of its learning and new found ability to predict.

Thank you for sticking with this long blog post. I hope you’ve found it valuable and informative.
Azure Machine Learning Studio is a fabulous tool, play with it. It’s addictive.

Footnotes ===
1. Men astutely trained in letters and fortitude.
2. Popularized by Mark Twain who attributed it to the great British PM Benjamin Disreali. However, Twain’s attribution is likely inaccurate as the line doesn’t occur in any of Disreali’s writings and is unknown until well after his death.