
Let’s continue from the previous article, talking about features, the properties we use as input to our model in order to build a prediction.
In every Machine Learning course, it’s evident what we want to predict: the flat price per square meter, the taxi tips in New York, etc. But in our specific case (and more in general), do we know what we want to extract from our data? Are our features good enough?
A good feature has a known value at prediction time: we can’t create a model with a full set of data and instead make a prediction with missing information.
Let’s come back to our data:

Let’s suppose we want to make a prediction on a match that player X will play tomorrow. The only features we know today are: the player’s age, his (her) role, the match date, and his (her) belonging to the home team.
Everything else is unknown: how many minutes he (she) will play, how many goals or self-goals he (she) will score, if he (she) will receive a yellow or a red card, if the home team will win, etc. This information is usually called labels: everything we want to predict.
In the last article, we did not insert the player’s identity and the teams’ names as features in our dataset. Does the probability that X will score a goal or receive a card depend on his (her) history or is an absolute number that is not related to his (her) own identity?
Another question: do these probabilities depend on player X belonging to team Y and not to team Z anymore?
Even after years of studying Statistics, these questions remain too complex to be answered. I decided to recover the identities in our dataset to show you another classical technique in data preparation.
We suppose, for the sake of simplicity, to have data concerning only five players, labeled with a number range from 1 to 5. We associate a five-dimensional array (that is the total number of players) and not the player’s ID to each row in our dataset. This array has all its elements equal to zero except the one corresponding to its ID. Again, it’s an example of one-hot-encoding, a technique we talked about in the previous article, optimizing the execution time and efficiency of our Machine Learning algorithms.

To create a sparse column (another name used to indicate such columns), we need to pre-process our data to extract all the keys and create a dictionary with them. This dictionary must be available at the prediction time and must not change compared to the one used in our model’s training.
What if we add a new player to the original data and want to build a prediction for him (her)? We don’t have his (her) key. A classical technique is to add, from the beginning, a key for an unknown player (in our example one for each role) and associate to them the average values of the other features. Adding more data, we can create a new dictionary version and use it to train our model again.
How can we manage a sparse column? Especially if we have thousands of players? Only a library can help us to solve such a big problem.
ML.NET allows us to enrich our .NET applications with Machine Learning features, both for online and offline scenarios.
Starting from the Microsoft .NET learning portal ( https://dotnet.microsoft.com/learn ), it’s possible to access to the Machine Learning section.

On Linux and macOS we can install the ML.NET CLI (Command Line Interface) while on Windows it’s available an extension called ML.NET Model Builder for Visual Studio.


Starting from a .NET Core, Console Application, we can access to a contextual menu called Machine Learning as shown in the following image:

This menu starts a wizard that leads us to choose one among a series of typical scenarios:

The custom scenario is what we need for our example to have more flexibility. We can load the dataset from an input csv file, choose the label for our predictions, and choose the features among those available.

When we start a project from scratch, such an approach will rarely produce a working model. We need to test our procedure step-by-step. Therefore, let’s start by installing a NuGet package called Microsoft.ML

Starting from the .csv file containing all the data and from the names chosen in the header, let’s define the following two classes in our code:
public class PlayerData
{
[LoadColumn(0)]
public int Id;
[LoadColumn(1)]
public float Age;
[LoadColumn(5)]
public string AwayTeam;
[LoadColumn(6)]
public float Year;
[LoadColumn(7)]
public float Month;
[LoadColumn(8)]
public float Day;
[LoadColumn(10)]
public float Minutes;
[LoadColumn(12)]
public string HomeTeam;
[LoadColumn(18)]
public float IsHomeTeam;
[LoadColumn(19)]
public float IsDefender;
[LoadColumn(20)]
public float IsMidfield;
[LoadColumn(21)]
public float IsForward;
[LoadColumn(22)]
public float IsNoRole;
}
public class PlayerDataPrediction
{
[ColumnName("Minutes")]
public float Minutes;
}
PlayerData is the class corresponding to the input data, and its properties match the dataset columns. The LoadColumn attribute shows the dataset column indexes.
PlayerDataPrediction represents a class having the properties we want to predict. Everyone is preceded by a ColumnName attribute.
Every ML.NET operation starts with the creation of an instance of the MLContext class. Conceptually, it’s similar to the DbContext class in Entity Framework: a context for the execution of every Machine Learning activity. The constructor takes as a parameter an integer number as seed for the pseudo-random number generator for internal use. The use of the same seed in our experiments makes the MLContext deterministic: its results will be reproducible in future experiments.
var mlContext = new MLContext(seed: 0);
Operations leading to the generation of a Model are:
- Loading data
- Extracting and manipulating data
- Running the Model training
- Saving the Model
ML.NET uses the IDataView interface to define the pipeline used for reading and manipulating the input data. IDataView loads textual files (.csv) or real-time data (for example from a SQL database or a log file).
IDataView dataView = mlContext.Data.LoadFromTextFile<playerdata>(dataPath, hasHeader: true, separatorChar: ',');
This instruction loads a .csv file available at the dataPath path, stating that its first row contains a header and that the separator character is a comma.
The transformation pipeline starts stating which will be the label among the available data columns.
var pipeline = mlContext.Transforms.CopyColumns(outputColumnName: "Label", inputColumnName: "Minutes")
and then showing which are the categorical columns on which to use OneHotEncoding:
.Append(mlContext.Transforms.Categorical.OneHotEncoding(outputColumnName: "AwayTeamEncoded", inputColumnName: "AwayTeam"))
.Append(mlContext.Transforms.Categorical.OneHotEncoding(outputColumnName: "IdEncoded", inputColumnName: "Id"))
.Append(mlContext.Transforms.Categorical.OneHotEncoding(outputColumnName: "HomeTeamEncoded", inputColumnName: "HomeTeam"))
We proceed concatenating all the features:
.Append(mlContext.Transforms.Concatenate("Features",
"IdEncoded", "AwayTeamEncoded", "HomeTeamEncoded",
"Age", "Year", "Month", "Day","IsHomeTeam","IsDefender",
"IsMidfield", "IsForward", "IsNoRole")
and finally executing the model’s training choosing an algorithm in the library among the available ones:
.Append(mlContext.Regression.Trainers.Sdca()));
The algorithm we used is called Stochastic Dual Coordinate Ascend (SDCA). Why we choose this one? Simply because it’s the only one not raising an exception with the input data we passed. Why do other algorithms fail? Maybe because our data are still not correctly normalized and optimized.
The fact that SDCA is working it’s not enough to conclude that the model will be usable in the prediction phase. We need to know which is its accuracy.
The model is created and saved with the following instruction:
var model = pipeline.Fit(dataView);
mlContext.Model.Save(model, dataView.Schema, "model.zip");

Too simple, isn’t it? Are you curious to know if we’ll succeed in predicting how many minutes will the player X in his (her) next match starting from the available features? You will read it in my next article where I will show you the techniques to be used to create a sample of data to evaluate the model’s performances.