But what if I told you that you can still use these models, pre-created in Python and a lot more in a Microsoft .NET environment?
WHY ML.NET?
C# .NET it is often the base language for many large-scale systems. This sets us up for a problem if we want to use ML in that environment without setting up a separate ML system.
In other words, you would have to more or less set up a separate machine learning pipeline, with OS, python environment, training and validation, iteration of models, deployment to a server and add an API for accessing the results.
Doing it all in C# makes all of this (almost) a breeze. We can have the full ML cycle from data import, training and serving predictions in the same environment. (of course fully possible in Python as well, but here we want to stay connected to an existing .NET system)
SPEED
Also another important thing is speed! Handling large amounts of data in C# is up to 5 times faster than in Python. I am not only talking about the GPU crunching part but the (pre & post) processing of data and other parts where large amounts of data is handled.
COMPATIBILITY
More on ONNX here:ONNX.AI
RUNS ANYWHERE
Based on .Net 5 it can be run on any of the main cloud providers (GCP, AWS and Azure) and also locally on Windows, Linux and Mac.
INTEGRATION TO EXISTING .NET CODEBASE
Being both a C# and Python programmer, I do enjoy using Python Notebooks to develop and iterate on different ML models. But detaching and converting it from the notebook and using it in production systems is not always too handy. Keeping it to a single C# environment is really making it easier for both us and our customers. We can develop production ML capabilities in a much faster way than just keeping it pythonic. You can always detach it to a separate micro service with a custom API if you would to connect to other systems or languages.
C# IS A COMPLETE LANGUAGE FOR MACHINE LEARNING
The ML.NET framework uses most of the techniques and algorithms used by other ML friendly languages, such as R and Python.. A lot of ready made implementations of the classical ML algorithms and frameworks exist (Such as Regression, LightGBM, NLP, Tensorflow and a lot more)
A simple Example
As a comparison example, I will show you how to use ML.Net with C# and the similar code for python.
I will showcase different supervised ML algorithms for regression in both environments.
(Regression is a common ML technique used to predict values such as sales forecasting or house prices)
The flow will be as following, for both C# and Python
Data Import
Data Preparation
Model training
Evaluation
Traditionally, and greatly simplified, you would after the first evaluation (step 4) go back and tweak the model (step 3) and iterate that over and over until satisfied. After that you would do a deployment where the model is served for other systems.
The serving of the trained model is quite easy in ML.NET and you can easily add it to your code, API, or web application, but that will be discussed in a future post.
The data is borrowed from Kaggle, so if you want to repeat the examples head over to https://www.kaggle.com/c/tabular-playground-series-jan-2021 and download it.
Then place the extracted data in a subfolder called 'tabular-playground-series-jan-2021'
The Kaggle data consists of 14 columns with decimal data and a 15th column which is the result (target in the code)
Both the training set and test set consists of 300.000 rows.
Here is a snapshot of a few rows of the data we will be using:
using System;
using Microsoft.ML;
using Microsoft.ML.Data;
using Microsoft.ML.Trainers;
using System.Linq;
using System.IO;
using System.Diagnostics;
using System.Collections.Generic;
using Microsoft.ML.Trainers.FastTree;
using xplot.plotly;
namespace regression
{
class Program
{
//Using same dataset as for python we will walk down the path tree
private static readonly string TrainingDataPath = Path.Combine(Path.GetFullPath(@"..\..\..\..\..\"), "tabular-playground-series-jan-2021", "train.csv");
private static readonly string TestingDataPath = Path.Combine(Path.GetFullPath(@"..\..\..\..\..\"), "tabular-playground-series-jan-2021", "test.csv");
static void Main(string[] args)
{
Console.WriteLine("************** Hello ML.Net! ***************");
// Initalize ML.NET environment
var mlContext = new MLContext();
//Load Data with IDataView interface (forward read-only cursor)
IDataView trainingDataView = mlContext.Data.LoadFromTextFile(
path: TrainingDataPath,
hasHeader: true,
separatorChar: ','
);
IDataView validateDataView = mlContext.Data.LoadFromTextFile(
path: TestingDataPath,
hasHeader: true,
separatorChar: ','
);
//Similar to head() in python
var data = trainingDataView.Preview(maxRows: 5);
//Create a validation split
var split = mlContext.Data.TrainTestSplit(trainingDataView, testFraction: 0.60, seed: 0);
var trainData = split.TrainSet;
//Load list of regressors to evaluate
var regressors = new List, object>>()
{
mlContext.Regression.Trainers.FastForest(new FastForestRegressionTrainer.Options { LabelColumnName= "target", NumberOfTrees=50, FeatureFraction = 0.8 }),
mlContext.Regression.Trainers.FastTree(labelColumnName: "target", numberOfLeaves:20, numberOfTrees:100 ),
mlContext.Regression.Trainers.FastTreeTweedie(labelColumnName:"target"),
mlContext.Regression.Trainers.LbfgsPoissonRegression(labelColumnName:"target", optimizationTolerance:1e-07f),
mlContext.Regression.Trainers.LightGbm(labelColumnName: "target", numberOfIterations:500, learningRate:0.1, numberOfLeaves:30, minimumExampleCountPerLeaf:20), //Light Gradient Boosting Machine
mlContext.Regression.Trainers.OnlineGradientDescent(labelColumnName: "target", learningRate:0.1f ), //Closest to Stochastic Gradient Descent in scikit learn
//Extra regressors, not included in python comparison
mlContext.Regression.Trainers.Sdca(labelColumnName:"target", maximumNumberOfIterations:200), //Stochastic Dual Coordinate Ascent, needs the extra lightning package for scikit learn
mlContext.Regression.Trainers.Gam(labelColumnName:"target", numberOfIterations:10000), //Generalized Additive Model (GAM) does not exist in scikit learn, but there is pyGam
};
foreach (var algo in regressors)
{
var stopWatch = Stopwatch.StartNew();
//Add the columns (cont1-cont14) as features for the model. Skip first ID column and last target column as input.
var features = split.TrainSet.Schema
.Select(col => col.Name)
.Where(colName => colName != "id" && colName != "target")
.ToArray();
//Create basic pipeline, here we could also add transforms, cleanup and
var pipeLine = mlContext.Transforms.Concatenate("Features", features)
.Append(algo);
var model = pipeLine.Fit(trainData);
var predictions = model.Transform(validateDataView);
var metrics = mlContext.Regression.Evaluate(predictions, "target", "Score");
stopWatch.Stop();
PrintRegressionMetrics(algo.ToString(), metrics);
Console.WriteLine("Total time: {0} Milliseconds", (stopWatch.ElapsedMilliseconds));
}
}
//Print stats on model
public static void PrintRegressionMetrics(string name, RegressionMetrics metrics)
{
Console.WriteLine($"*************************************************");
Console.WriteLine($"* Metrics for {name} regression model ");
Console.WriteLine($"*------------------------------------------------");
Console.WriteLine($"* LossFn: {metrics.LossFunction:0.##}");
Console.WriteLine($"* Absolute loss: {metrics.MeanAbsoluteError:#.##}");
Console.WriteLine($"* Squared loss: {metrics.MeanSquaredError:#.##}");
Console.WriteLine($"* RMS loss: {metrics.RootMeanSquaredError:#.##}");
Console.WriteLine($"*************************************************");
}
//Compared to Python, we need a clearly defined datamodel.
public class ModelInput
{
//Column 0 is ID so we can skip it.
[ColumnName("id"), LoadColumn(1)]
public float id { get; set; }
[ColumnName("cont1"), LoadColumn(1)]
public float Cont1 { get; set; }
[ColumnName("cont2"), LoadColumn(2)]
public float Cont2 { get; set; }
[ColumnName("cont3"), LoadColumn(3)]
public float Cont3 { get; set; }
[ColumnName("cont4"), LoadColumn(4)]
public float Cont4 { get; set; }
[ColumnName("cont5"), LoadColumn(5)]
public float Cont5 { get; set; }
[ColumnName("cont6"), LoadColumn(6)]
public float Cont6 { get; set; }
[ColumnName("cont7"), LoadColumn(7)]
public float Cont7 { get; set; }
[ColumnName("cont8"), LoadColumn(8)]
public float Cont8 { get; set; }
[ColumnName("cont9"), LoadColumn(9)]
public float Cont9 { get; set; }
[ColumnName("cont10"), LoadColumn(10)]
public float Cont10 { get; set; }
[ColumnName("cont11"), LoadColumn(11)]
public float Cont11 { get; set; }
[ColumnName("cont12"), LoadColumn(12)]
public float Cont12 { get; set; }
[ColumnName("cont13"), LoadColumn(13)]
public float Cont13 { get; set; }
[ColumnName("cont14"), LoadColumn(14)]
public float Cont14 { get; set; }
//Label Column
[ColumnName("target"), LoadColumn(15)]
public float Target { get; set; }
}
}
}
###### Prerequisites
#Python 3.7+
import pandas as pd
from pathlib import Path
import time
import os
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error
from sklearn.ensemble import RandomForestRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.linear_model import TweedieRegressor
from sklearn.linear_model import PoissonRegressor
from sklearn.experimental import enable_hist_gradient_boosting # noqa
from sklearn.ensemble import HistGradientBoostingRegressor
from sklearn.linear_model import SGDRegressor
print("************** Hello scikit-learn! ***************");
script_dir = os.path.dirname(__file__)
input_path = Path(script_dir + os.path.sep + 'tabular-playground-series-jan-2021/')
train = pd.read_csv(input_path / 'train.csv', index_col='id')
print("train length:" + str(len(train)))
print(train.head())
print("test length:" + str(len(train)))
test = pd.read_csv(input_path / 'test.csv', index_col='id')
print(test.head())
# Pull out the target, and make a validation split
target = train.pop('target')
X_train, X_test, y_train, y_test = train_test_split(train, target, train_size=0.60)
#Load list of regressors to evaluate
model_names = ["Random Forest", "Tweedie", "Decision Tree", "Poisson", "HistGradientBoosting (LightGbm)", "Stocastic Gradent Descent"]
models = [
#Keeping parameters the same between ML.Net and python whenever possible.
RandomForestRegressor(max_features=0.8, n_estimators=50, n_jobs=-1),
DecisionTreeRegressor(max_leaf_nodes=20, max_depth=100),
TweedieRegressor(power=1.9, alpha=.1, max_iter=10000),
PoissonRegressor(alpha=1e-07),
HistGradientBoostingRegressor(max_iter=500, learning_rate=0.1, max_leaf_nodes=30, min_samples_leaf=20), #Experimental, Similar to lightGbm
SGDRegressor(eta0=0.1)
]
for name, model in zip(model_names, models):
startTimeStamp = round(time.monotonic() * 1000)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
#plot_results(name, y_test, y_pred)
squared_loss = mean_squared_error(y_test, y_pred, squared=False)
rms_loss = mean_squared_error(y_test, y_pred, squared=True)
absolute_loss = mean_absolute_error(y_test, y_pred)
endTimeStamp = round(time.monotonic() * 1000)
print(f'*************************************************');
print(f'* Metrics for {name} ');
print(f'*------------------------------------------------');
print(f'* Absolute Loss:{absolute_loss:0.5f}')
print(f'* Squared Loss:{squared_loss:0.5f}')
print(f'* RMS Loss:{rms_loss:0.5f}')
print(f'* Total time: %3d Milliseconds' %(endTimeStamp-startTimeStamp))
print(f'*************************************************');
Results
C# | Python |
---|---|
|
|
As we can see some quite a few of the regression examples where similar or faster in Python except for Random forest regression where C# fast considerably faster.
Also the metrics are evaluated differently showing dissimilar number between C# and Python.
Conclusion
Using C# has its benefits and so has Python. I do like training ML models in both languages and will in future projects use the language that is most beneficial for the customer project environment. That is the right tool for the right task instead of shoehorning in yet another coding language.
Hope this showed you the usability of using ML (well, at least regression) depending on tool in your belt you feel most comfortable with.