Friday, March 19, 2021

Why you should use ML.NET for your next machine learning project

 

ML.Net Logo
If you are a bit into the ML space, you would already know that the de-facto language for ML tasks is Python. Thanks to the fact that the most popular libraries for artificial intelligence exist for python, such as Scikit Learn, Spacy, Tensorflow etc, it becomes really hard not to choose Python as the preferred choice for ML tasks. 

But what if I told you that you can still use these models, pre-created in Python and a lot more in a Microsoft .NET environment?


WHY ML.NET?

C# .NET it is often the base language for many large-scale systems. This sets us up for a problem if we want to use ML in that environment without setting up a separate ML system. 


In other words, you would have to more or less set up a separate machine learning pipeline, with OS, python environment, training and validation, iteration of models, deployment to a server and add an API for accessing the results. 


Doing it all in C# makes all of this (almost) a breeze. We can have the full ML cycle from data import, training and serving predictions in the same environment. (of course fully possible in Python as well, but here we want to stay connected to an existing .NET system)







SPEED

Also another important thing is speed! Handling large amounts of data in C# is up to 5 times faster than in Python. I am not only talking about the GPU crunching part but the (pre & post) processing of data and other parts where large amounts of data is handled. 








 

COMPATIBILITY


ML.Net works great with all your existing ML models and is compatible with other ML frameworks thanks to ONNX.
So, if you already have a model created with tensorflow in Python, you can start using it right away in ML.NET. 

More on ONNX here:ONNX.AI

 

RUNS ANYWHERE

 

Based on .Net 5 it can be run on any of the main cloud providers (GCP, AWS and Azure) and also locally on Windows, Linux and Mac.  

 

 


INTEGRATION TO EXISTING .NET CODEBASE

 

Being both a C# and Python programmer, I do enjoy using Python Notebooks to develop and iterate on different ML models. But detaching and converting it from the notebook and using it in production systems is not always too handy.  Keeping it to a single C# environment is really making it easier for both us and our customers. We can develop production ML capabilities in a much faster way than just keeping it pythonic. You can always detach it to a separate micro service with a custom API if you would to connect to other systems or languages. 

 

 

 

C# IS A COMPLETE LANGUAGE FOR MACHINE LEARNING

 

The ML.NET framework uses most of the techniques and algorithms used by other ML friendly languages, such as R and Python.. A lot of ready made implementations of the classical ML algorithms and frameworks exist (Such as Regression, LightGBM, NLP, Tensorflow and a lot more) 

 
 

 A simple Example

 

As a comparison example, I will show you how to use ML.Net with C# and the similar code for python. 

 

I will showcase different supervised ML algorithms for regression in both environments.

(Regression is a common ML technique used to predict values such as sales forecasting or house prices)


It will be a simple walk-thru of the needed steps.

The flow will be as following, for both C# and Python

 

 

  1. Data Import

  2. Data Preparation

  3. Model training

  4. Evaluation

 

 

Traditionally, and greatly simplified, you would after the first evaluation (step 4) go back and tweak the model (step 3) and iterate that over and over until satisfied. After that you would do a deployment where the model is served for other systems. 

The serving of the trained model is quite easy in ML.NET and you can easily add it to your code, API, or web application, but that will be discussed in a future post. 


The data is borrowed from Kaggle, so if you want to repeat the examples head over to https://www.kaggle.com/c/tabular-playground-series-jan-2021 and download it.

Then place the extracted data in a subfolder called 'tabular-playground-series-jan-2021' 


The Kaggle data consists of 14 columns with decimal data and a 15th column which is the result (target in the code)

Both the training set and test set consists of 300.000 rows. 

 

Here is a snapshot of a few rows of the data we will be using:


Id   cont1     cont2       cont3     cont4     cont5     cont6  … cont10        cont11      cont12    cont13    cont14    target

                                                             ...

1   0.670390  0.811300  0.643968  0.291791  0.284117  0.855953  ...  0.779418  0.921832  0.866772  0.878733  0.305411  7.243043      

3   0.388053  0.621104  0.686102  0.501149  0.643790  0.449805  ...  0.432632  0.439872  0.434971  0.369957  0.369484  8.203331      

4   0.834950  0.227436  0.301584  0.293408  0.606839  0.829175  ...  0.823312  0.567007  0.677708  0.882938  0.303047  7.776091      

5   0.820708  0.160155  0.546887  0.726104  0.282444  0.785108  ...  0.580843  0.769594  0.818143  0.914281  0.279528  6.957716


 


  
 
using System;
using Microsoft.ML;
using Microsoft.ML.Data;
using Microsoft.ML.Trainers;
using System.Linq;
using System.IO;
using System.Diagnostics;
using System.Collections.Generic;
using Microsoft.ML.Trainers.FastTree;
using xplot.plotly;
 
 
namespace regression
{
    class Program
    {
 
        //Using same dataset as for python we will walk down the path tree
 
        private static readonly string TrainingDataPath = Path.Combine(Path.GetFullPath(@"..\..\..\..\..\"), "tabular-playground-series-jan-2021", "train.csv");
        private static readonly string TestingDataPath = Path.Combine(Path.GetFullPath(@"..\..\..\..\..\"), "tabular-playground-series-jan-2021", "test.csv");
 
        static void Main(string[] args)
        {
            Console.WriteLine("************** Hello ML.Net! ***************");
 
 
            // Initalize ML.NET environment
            var mlContext = new MLContext();
 
            //Load Data with IDataView interface (forward read-only cursor)
            IDataView trainingDataView = mlContext.Data.LoadFromTextFile(
                                                        path: TrainingDataPath,
                                                        hasHeader: true,
                                                        separatorChar: ','
                                                        );
 
            IDataView validateDataView = mlContext.Data.LoadFromTextFile(
                                                       path: TestingDataPath,
                                                       hasHeader: true,
                                                       separatorChar: ','
                                                       );
 
            //Similar to head() in python
            var data = trainingDataView.Preview(maxRows: 5);
 
            //Create a validation split
            var split = mlContext.Data.TrainTestSplit(trainingDataView, testFraction: 0.60, seed: 0);  
            var trainData = split.TrainSet;
 
 
            //Load list of regressors to evaluate
            var regressors = new List, object>>()
                {
                    mlContext.Regression.Trainers.FastForest(new FastForestRegressionTrainer.Options { LabelColumnName= "target",  NumberOfTrees=50, FeatureFraction = 0.8 }),
                    mlContext.Regression.Trainers.FastTree(labelColumnName: "target", numberOfLeaves:20, numberOfTrees:100 ),
                    mlContext.Regression.Trainers.FastTreeTweedie(labelColumnName:"target"),
                    mlContext.Regression.Trainers.LbfgsPoissonRegression(labelColumnName:"target", optimizationTolerance:1e-07f),
                    mlContext.Regression.Trainers.LightGbm(labelColumnName: "target", numberOfIterations:500, learningRate:0.1, numberOfLeaves:30, minimumExampleCountPerLeaf:20), //Light Gradient Boosting Machine
                    mlContext.Regression.Trainers.OnlineGradientDescent(labelColumnName: "target", learningRate:0.1f ), //Closest to Stochastic Gradient Descent in scikit learn
                    //Extra regressors, not included in python comparison
                    mlContext.Regression.Trainers.Sdca(labelColumnName:"target", maximumNumberOfIterations:200),  //Stochastic Dual Coordinate Ascent, needs the extra lightning package for scikit learn
                    mlContext.Regression.Trainers.Gam(labelColumnName:"target", numberOfIterations:10000), //Generalized Additive Model (GAM) does not exist in scikit learn, but there is pyGam
        };
 
 
            foreach (var algo in regressors)
            {
                var stopWatch = Stopwatch.StartNew();
 
                //Add the columns (cont1-cont14) as features for the model. Skip first ID column and last target column as input.
                var features = split.TrainSet.Schema
                    .Select(col => col.Name)
                    .Where(colName => colName != "id" && colName != "target")
                    .ToArray();
 
             
                //Create basic pipeline, here we could also add transforms, cleanup and 
                var pipeLine = mlContext.Transforms.Concatenate("Features", features)
                 .Append(algo);
                var model = pipeLine.Fit(trainData);
                var predictions = model.Transform(validateDataView);
                var metrics = mlContext.Regression.Evaluate(predictions, "target", "Score");
                stopWatch.Stop();
                PrintRegressionMetrics(algo.ToString(), metrics);
                Console.WriteLine("Total time:  {0} Milliseconds", (stopWatch.ElapsedMilliseconds));
 
            }
           
        }
 
        //Print stats on model
        public static void PrintRegressionMetrics(string name, RegressionMetrics metrics)
        {
            Console.WriteLine($"*************************************************");
            Console.WriteLine($"*       Metrics for {name} regression model      ");
            Console.WriteLine($"*------------------------------------------------");
            Console.WriteLine($"*       LossFn:        {metrics.LossFunction:0.##}");
            Console.WriteLine($"*       Absolute loss: {metrics.MeanAbsoluteError:#.##}");
            Console.WriteLine($"*       Squared loss:  {metrics.MeanSquaredError:#.##}");
            Console.WriteLine($"*       RMS loss:      {metrics.RootMeanSquaredError:#.##}");
            Console.WriteLine($"*************************************************");
        }
        
    
 
        //Compared to Python, we need a clearly defined datamodel. 
        public class ModelInput
        {
           //Column 0 is ID so we can skip it.
            [ColumnName("id"), LoadColumn(1)]
            public float id { get; set; }
            [ColumnName("cont1"), LoadColumn(1)]
            public float Cont1 { get; set; }
            [ColumnName("cont2"), LoadColumn(2)]
            public float Cont2 { get; set; }
            [ColumnName("cont3"), LoadColumn(3)]
            public float Cont3 { get; set; }
            [ColumnName("cont4"), LoadColumn(4)]
            public float Cont4 { get; set; }
            [ColumnName("cont5"), LoadColumn(5)]
            public float Cont5 { get; set; }
            [ColumnName("cont6"), LoadColumn(6)]
            public float Cont6 { get; set; }
            [ColumnName("cont7"), LoadColumn(7)]
            public float Cont7 { get; set; }
            [ColumnName("cont8"), LoadColumn(8)]
            public float Cont8 { get; set; }
            [ColumnName("cont9"), LoadColumn(9)]
            public float Cont9 { get; set; }
            [ColumnName("cont10"), LoadColumn(10)]
            public float Cont10 { get; set; }
            [ColumnName("cont11"), LoadColumn(11)]
            public float Cont11 { get; set; }
            [ColumnName("cont12"), LoadColumn(12)]
            public float Cont12 { get; set; }
            [ColumnName("cont13"), LoadColumn(13)]
            public float Cont13 { get; set; }
            [ColumnName("cont14"), LoadColumn(14)]
            public float Cont14 { get; set; }
 
            //Label Column
            [ColumnName("target"), LoadColumn(15)]
            public float Target { get; set; }
        }
        
    }
}
 
 




  
  
 
###### Prerequisites
#Python 3.7+
 
import pandas as pd
from pathlib import Path
import time
import os
        
 
 
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error
from sklearn.ensemble import RandomForestRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.linear_model import TweedieRegressor
from sklearn.linear_model import PoissonRegressor
from sklearn.experimental import enable_hist_gradient_boosting  # noqa
from sklearn.ensemble import HistGradientBoostingRegressor
from sklearn.linear_model import SGDRegressor
 
 
print("************** Hello scikit-learn! ***************");
        
script_dir = os.path.dirname(__file__)
 
input_path = Path(script_dir + os.path.sep + 'tabular-playground-series-jan-2021/')
 
train = pd.read_csv(input_path / 'train.csv', index_col='id')
 
print("train length:" + str(len(train)))
print(train.head())
 
 
print("test length:" + str(len(train)))
test = pd.read_csv(input_path / 'test.csv', index_col='id')
print(test.head())
 
 
 
# Pull out the target, and make a validation split
target = train.pop('target')
X_train, X_test, y_train, y_test = train_test_split(train, target, train_size=0.60)
 
#Load list of regressors to evaluate
model_names = ["Random Forest", "Tweedie", "Decision Tree", "Poisson", "HistGradientBoosting (LightGbm)", "Stocastic Gradent Descent"]
models = [
    #Keeping parameters the same between ML.Net and python whenever possible.
    RandomForestRegressor(max_features=0.8, n_estimators=50, n_jobs=-1),
    DecisionTreeRegressor(max_leaf_nodes=20, max_depth=100),
    TweedieRegressor(power=1.9, alpha=.1, max_iter=10000),
    PoissonRegressor(alpha=1e-07),  
    HistGradientBoostingRegressor(max_iter=500, learning_rate=0.1, max_leaf_nodes=30, min_samples_leaf=20), #Experimental, Similar to lightGbm
    SGDRegressor(eta0=0.1)
]
for name, model in zip(model_names, models):
    startTimeStamp = round(time.monotonic() * 1000)
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    #plot_results(name, y_test, y_pred)
    squared_loss = mean_squared_error(y_test, y_pred, squared=False)
    rms_loss = mean_squared_error(y_test, y_pred, squared=True)
    
    absolute_loss = mean_absolute_error(y_test, y_pred)
    endTimeStamp = round(time.monotonic() * 1000)
    print(f'*************************************************');
    print(f'*        Metrics for {name}          ');
    print(f'*------------------------------------------------');
    print(f'*        Absolute Loss:{absolute_loss:0.5f}')
    print(f'*        Squared Loss:{squared_loss:0.5f}')
    print(f'*        RMS Loss:{rms_loss:0.5f}')
    print(f'*        Total time: %3d Milliseconds' %(endTimeStamp-startTimeStamp))  
    print(f'*************************************************');
  
  

Results

 

C# Python

************** Hello C# and ML.Net! ***************
*************************************************
*       Metrics for FastForestRegressionTrainer 
*------------------------------------------------
*       LossFn:        62,5
*       Absolute loss: 7,91
*       Squared loss:  62,5
*       RMS loss:      7,91
*************************************************
Total time:  2433 Milliseconds
*************************************************
*       Metrics for  FastTreeTweedieTrainer 
*------------------------------------------------
*       LossFn:        62,53
*       Absolute loss: 7,9
*       Squared loss:  62,53
*       RMS loss:      7,91
*************************************************
Total time:  2531 Milliseconds
*************************************************
*       Metrics for  FastTreeRegressionTrainer  
*------------------------------------------------
*       LossFn:        62,54
*       Absolute loss: 7,91
*       Squared loss:  62,54
*       RMS loss:      7,91
*************************************************
Total time:  2209 Milliseconds
*************************************************
*       Metrics for  LbfgsPoissonRegressionTrainer 
*------------------------------------------------
*       LossFn:        62,51
*       Absolute loss: 7,91
*       Squared loss:  62,51
*       RMS loss:      7,91
*************************************************
Total time:  870 Milliseconds
*************************************************
*       Metrics for  LightGbmRegressionTrainer  
*------------------------------------------------
*       LossFn:        62,55
*       Absolute loss: 7,91
*       Squared loss:  62,55
*       RMS loss:      7,91
*************************************************
Total time:  3612 Milliseconds
*************************************************
*       Metrics for  OnlineGradientDescentTrainer  
*------------------------------------------------
*       LossFn:        62,3
*       Absolute loss: 7,89
*       Squared loss:  62,3
*       RMS loss:      7,89
*************************************************
Total time:  570 Milliseconds
*************************************************
    

************** Hello Python and Scikit Learn! ***************
*************************************************
*        Metrics for Random Forest
*------------------------------------------------
*        Absolute Loss:0.59550
*        Squared Loss:0.71243
*        RMS Loss:0.50755
*************************************************
*        Total time: 21250 Milliseconds
*************************************************
*        Metrics for Tweedie
*------------------------------------------------
*        Absolute Loss:0.60990
*        Squared Loss:0.72387
*        RMS Loss:0.52399
*************************************************
*        Total time: 1563 Milliseconds
*************************************************
*        Metrics for Decision Tree
*------------------------------------------------
*        Absolute Loss:0.61793
*        Squared Loss:0.73085
*        RMS Loss:0.53414
*************************************************
*        Total time: 516 Milliseconds
*************************************************
*        Metrics for Poisson
*------------------------------------------------
*        Absolute Loss:0.61367
*        Squared Loss:0.72758
*        RMS Loss:0.52938
*************************************************
*        Total time: 328 Milliseconds
*************************************************
*        Metrics for HistGradientBoosting (LightGbm)
*------------------------------------------------
*        Absolute Loss:0.59042
*        Squared Loss:0.70321
*        RMS Loss:0.49450
*************************************************
*        Total time: 5312 Milliseconds
*************************************************
*        Metrics for Stocastic Gradent Descent
*------------------------------------------------
*        Absolute Loss:0.61564
*        Squared Loss:0.72904
*        RMS Loss:0.53150
*************************************************
*        Total time: 281 Milliseconds
*************************************************
    

 

As we can see some quite a few of the regression examples where similar or faster in Python except for Random forest regression where C# fast considerably faster.
Also the metrics are evaluated differently showing dissimilar number between C# and Python.  


Conclusion

Using C# has its benefits and so has Python. I do like training ML models in both languages and will in future projects use the language that is most beneficial for the customer project environment. That is the right tool for the right task instead of shoehorning in yet another coding language.
Hope this showed you the usability of using ML (well, at least regression) depending on tool in your belt you feel most comfortable with.

 

 

Why you should use ML.NET for your next machine learning project

  If you are a bit into the ML space, you would already know that the de-facto language for ML tasks is Python. Thanks to the fact that the ...