In part one we introduced a few basic concepts in Machine Learning, ML, along with a mildew problem affecting my orchids which ML will be helping us tackle.

ML Training Wheels with a Blower a.k.a. Azure Machine Learning Studio

With Azure Microsoft has introduced a capable platform for creating, training, and consuming Machine Learning. In conjunction they have added an excellent visual modeling tool, Azure Machine Learning Studio, a tool that greatly eases getting started yet will continue to serve one quite capably far beyond the basics.

Head over to Get started with Azure Machine Learning (it’s free for 30 days).
Complete the first three steps and I’ll meet you there.
Once you’ve completed sign up, you’ll find yourself at Machine Learning Studio.
When you’re ready to continue select New then Blank Experiment to continue.

tdjr - MLStudio-landingpageMLStudio-newExperiment

At last, Machine Learning Studio itself

tdjr-MLStudio-layoutThe layout and organization is familiar

  1. a palette of tools, modules in ML Studio parlance, are to the left,
  2. properties for the currently selected module are edited to the far right,
  3. a canvas in the center for placing modules and wiring them together.

Providing data.

At the heart of all Machine Learning is data. First up is providing ML Studio some data to work with. While my opening narrative is quite accurate, I will be using weather data I retrieved from NOAA‘s Climate Data Online service to which I added GDD calculations and saved off the result in a CSV. If you haven’t already retrieved a copy of my data you can get it here.

Importing data into ML Studio is little or no different than other programs: select a local file to upload, add a description, then wait for it to upload.

View Importing Data, detailed steps


1. Go to DATASETS and select NEW




3. Be sure to give it a decent name


4. And after a brief wait, your data is ready

Time to start modeling

We’ll start by adding our new DataSet to a new Model, then we’ll take a quick “visualization” peek at our data, and finish up by adding a Clean Missing Data module to our DataSet. Large sets of data often have low levels of missing or erroneous data. We’ll use a Clean Missing Data module will help to correct any missing data in our small DataSet.

See detailed steps of adding data to our model


1. Drag our newly created DataSet onto the canvas


2. Take a quick peek at our data right click on the DataSet and select Visualize


3. Drag a Clean Missing Data module to the canvas


4. Connect the output of our DataSet to the input of Clean Missing Data

So, just how good is our source data?

Let’s take a quick peek at our data again by right clicking on the ML Blog data DataSet module then picking DataSet | Visualize
tdjr - MLStudio-missingDataVisualizedIt seems we have no temperature readings for 2010-01-08. Where there’s a piece of missing data there may be more, but we don’t really want to manually proof the data line by line. We will let the Clean Missing Data module take care of this row and any others missing data. There are many options for handling missing data, ranging from simple solutions such as deleting problematic rows to sophistical spatial analysis algorithms to determine best fit candidates. If the available choices don’t cover one’s needs it’s straightforward to create custom methods, typically with Python.

For our current issue, the occasional missing bit of data, to fill in the blanks we’ll use MICE  with no cheese required. MICE imputes  replacement values using a statistic view which factors the proximity of data in its weighting, it’s “spatial”. For the curious MICE is based on work by Schafer and Graham, 2002.

Make it so! (arguably my least favorite line in all of television)

First assure the Clean Missing Data module is selected. In the Properties panel on the far right select Replace using MICE for Cleaning mode, Propogate for Cols with all missing values, and check Generate missing values.

You might be wondering why the Clean Missing Data module has only one input, but two outputs. Output 1 emits the results of the cleaning operations. Output 2 emits the cleaning transformation used by the module; the transformation can be saved and or applied to additional DataSets, useful for consistency, with or without Ralph Waldo Emerson and his hobgoblins, it is a useful option to have 🙂

The path from data to information

For our problem today we will train a model using Supervisory methods. In a nutshell we are going to provide the ML system data which is composed of the source conditions mathematically factors, in ML categories– tied to their actual results, called indicators. The ML system will learn a way to reliably predict indicator values based on category data. If you’ve done some linear programming the paradigm will feel somewhat familiar with indicator taking the role of objective function.

Time to tell the ML System about its data by classifying it into categories and indicator. I’ve marked up our data definition to illustrate. The ML System should learn to determine Risk Present, /* indicator /  based on some or all of the seven fields marked  / category */  .

To state in simple English, the ML System should learn to determine Risk Present based on some or all of DailyMinF, DailyMaxF, GDD, GDD-3Day, GDD-4Day, GDD-5Day, GDD-6Day.

Lets’s do it

I am sure you’re way ahead of me here. We typically refer information which tags data as metadata. And no surprise this is exactly the term ML Studio uses. We have three types of data here, a key which is the datetime uniquely identifying a related set,  a variety of temperature related factors, and one indicator, At Risk.

We’ll use three Metadata Editor modules to separate the data into three groups then we’ll mark the Indicator.

Under Data Transformation | Manipulation select and drag three Metadata Editors onto the canvas then wire them sequentially to Output 1 of the Clean Missing Data module.


Order doesn’t matter here, nonetheless we’ll start with our Indicator, At Risk.

Assure the first Metadata Editor is selected, then on the Properties panel select Launch column selector. Check Allow duplicates and preserve column order in selection (this will make a little more sense later). Then since we want to select one column out of the existing set make sure Begin With is set to No columns. In the next line select Include from the first drop down, column names from the second, and add the At Risk column to the edit field.

Then back in Properties Set the Data type to Boolean, Categorical to Make categorical (we’ll adjust later), Fields to Label, and for convenience we’ll use a new column name of target.



You probably have a pretty good idea how we’ll categorize the other two sets so give it a go.  I’ll include images of the settings I used which you can compare against.

See Metadata details


tdjr-MLStudio_MetadataProperties_Factors                                      tdjr-MLStudio_MetadataProperties_Index

To review we’ve added our data, provided some ability to compensate for missing values in our data, and separated our data into three groups: Index, At Risk (also target), and everything else.

Time to be explicit about the indicator.

First find a Convert to Indicator Values module by entering “Convert to ind” in the Search experiment items search box.

Drag a Convert to Indicator Values module onto the canvas and connect the output of the last Metadata editor to its input.
Assure the Convert to Indicator Values module is selected and head on over to the Properties palette.
Select Launch column selector and select only the target column

On the Properties tab assure Overwrite categorical is unchecked

We are ready to start training our model.

The first step is to split our data into two discrete sets, one the model with use to determine a predictive solution and the other we’ll use to to see how well its predictive solution works out.

The module which does this is -no surprise- named Split Data. Add one to the canvas then wire the output of the Convert to Indicator Values modules to its input.

You’ll notice similar to the Clean Data module it has one input and two outputs; one output is for training and one for comparison. Comparison is a qualitative assessment of how well the model predicts and results in a “score” indicating its ability to predict.

Let’s go ahead and add two more modules to the canvas which should help if my description isn’t clear.

Add a Train Model module (we’re getting there) and a Score Model module and wire them in like so:


We take our data, split it into two parts one the model uses in training, then we take the result of training and compare to see how well it works.

Data can be split many ways, much good ink is spilled on how and when to split data it certain manners. In our case our data is sequential and the sequence we’re guessing is important. So, we’ll split it by date with the first three years to be used for training and the last two for scoring. The Split Data module provides many options including simple regular expression (no lookbacks for example) so we’ll use \”Index” > 12-31-2013 to split our data on the Properties tab of the Split Data module.


Looking at our data having a bit of a numeric background would give reason to believe a solution could be found in some form of regression. So, we’ll start with what Microsoft calls a Two-Class Boosted Decision Tree. A decision tree challenges the data in an attempt to determine and classify important features of the data, boosting refers to combining various features and weighting into new super features. It is popular for a variety of biological problems. Had we not provided any GDD accumulation information we might have chosen a far more complex K-Means Clustering to start and progressed from them. From the density of jargon in these last few sentences it should be no surprise that a blog series covering just the surface of available training methods could run several months.

So let’s add a Two-Class Boosted Decision Tree and wire it up. Your model should look something like this:


Feel free to play with the model.

In part 3 we’ll train the model, score its output, then look at how we might integrate the model into a system we can use to warn us when conditions are right for our unwelcome visitor to reappear.