The Machine Learning is now in a phase of continuous expansion, facilitated by the offers of all cloud platforms.
In my first and second articles about this argument, we found out that a programmer can analyze data using high-level tools, even without a vast knowledge of statistics and machine learning. Presuming that everything going to work at the first attempt is quite unlikely, that is, we can build a model with our set of data with reasonable efficiency. In the last article about ML, I introduced the feature crossing idea, which leads us to come back to the manipulation of data.
In this situation, high-level tools show their limits and make us search for new ones: they are such complicated that we are not able to handle options and wizards, which suddenly fail (I wrote about this problem in my second article).
What are the factors that can guide us in choosing a new lower-level framework for machine learning and data analysis? In my opinion, it is necessary to abandon personal preferences and to direct the choice on tools supported and used by a numerous and robust community of users.
Python is the most used programming language in the ML world. The reason is simple: the core of ML is formed by complex algorithms and adaptable workflows.
Python data scientists can focus on ML problems in place of technical and architectural aspects of a programming language. Furthermore, the versatility of this language permits to use different styles depending on needs:
- Object-oriented (even if it is not fully supported)
In the end, Python offers a rich technological stack with a wide number of libraries made for the ML, such as Keras, TensorFlow, Scikit, SciPy, Seaborn, Pandas, NumPy, etc.
I entered this world with a platform called Anaconda, created for data scientists and their collaborative projects since it is possible to distribute projects in live notebooks and ML projects. It is only a free package manager based on a Python distribution that has more than 1500 open source packages available on Windows, macOS, and Linux.
Anaconda Navigator is the desktop user interface to launch applications and manage packages, environments of execution, and notebooks.
The navigator allows us to create an Environment where I installed ML main libraries.
We launch then a session of jupyter notebook in this environment, which is a web interface to create and execute notebooks. The session allows us to choose a work-folder where I insert our soccer dataset, for convenience.
This is a cell editor, in which we will insert instructions that can then be executed. Cells also allow you to insert formatted text (such as Markdown) to document and comment on the code.
The first instructions are imports of libraries that we will use later.
Let’s try to upload the dataset to memory using the pandas library.
The object loaded into memory can be shuffled using the following code:
football_dataframe = football_dataframe.reindex( np.random.permutation(football_dataframe.index))
And the result is the following:
We process our features, choosing the input and the output ones. To do this, we need two functions written in Python.
We then prepare the training and validation samples (using for the first a percentage of 70% and the remaining 30% for the second).
Now we can start to use the most important library in the ML world: TensorFlow.
TensorFlow is an open-source software library with proven and optimized modules, which are useful to implement algorithms for different types of perceptual and language comprehension tasks. Used by about fifty teams active in both scientific research and production fields, it is the basis of dozens of Google commercial products such as voice recognition, Gmail, Google Photo, etc.
TensorFlow is available in our Jupyter notebook in version 2.0: attention! Many tutorials on ML use a code based on previous versions, and they are not compatible with the last one!
Let’s start using TensorFlow to define our features properly. The code fragment shown in the following image shows how, starting from the definition of numerical features through the method tf.feature_column.numeric_column(“FeatureName”), we can define another bucketized (do you remember the first article?). We can use the method tf.feature_column_bucketized that takes in input a numeric column and transforms it in a bucketized feature using a number of intervals passed as the second argument.
For example, in the case of the players age:
Working on high-dimensional linear models, one often starts from an optimization technique called FTRL (Follow The Regularized Leader). It is, therefore, necessary to write a function that creates a model based on this optimization technique. The notebook code is available on GitHub: the train_model function is a standard implementation of FTRL (you can find it, for example, on Google’s teaching material).
Last step: the definition of feature-crossing. Tensor Flow offers a method called tf_feature_column.crossed_column that accepts as input a bucketized array feature and the total number of buckets.
For example, in our case we can create:
roles_crossed_ = tf.feature_column.crossed_column( set([bucketized_goalkeeper, bucketized_defender, bucketized_midfield, bucketized_forward, bucketized_age] ), hash_bucket_size=112)
At this point, we just have to turn on the machine by carefully choosing the parameters according to your calculation possibilities.
The Jupyter notebook we wrote can be loaded, together with our data, on the portal https://studio.azureml.net that we have seen in the first article. Warning: you need to install the missing packages (for example, as shown in the following TensorFlow image) via the pip package manager: !pip install tensorflow. Pay attention here too to the different versions of installable packages from the cloud.
As I was writing this article, Microsoft surprised us with an announcement that concerns Anaconda: dotnet interactive has become available in Jupyter through the platform we talked about.
Firstly install dotnet interactive via the command:
dotnet tool install --global Microsoft.dotnet-interactive
then make it available in Jupyter
dotnet interactive jupyter install
We can verify the installation through the following command:
jupyter kernelspec list
As we create a new Jupyter notebook, we have the new kernels available:
This is a code example that you can write and run:
We have come to the end of this series of articles where I introduced you to the ML world from the perspective of a code developer. I hope I have given you some advice to use in your data analysis. There will be those who will have appreciated more the CLI of ML.NET, who instead the classic portal of azure ML.NET and who instead the ecosystem based on Python. In any case, special attention must be paid to the preparation of data.
See you at the next article!