Iris Data Set with MySQL HeatWave Machine Learning & Zeppelin
MySQL – The world’s most popular open source database – is also probably the most popular database in the Cloud.
MySQL HeatWave database service is THE MySQL PaaS, a fully managed cloud service in Oracle Cloud Infrastructure (OCI).
The only MySQL service 100% developed, managed and supported by the MySQL Team.
In short:
For your analytics needs you can explore and visualize your data using a data visualization tool like Tableau, Qlik, Oracle Analytics Cloud, Apache Superset (see: Explore & visualize your MySQL HeatWave data with Superset)…
For your Machine Learning needs you can use a notebook like Jupyter, Apache Zeppelin (see: Interactively explore & visualize your MySQL HeatWave data with Apache Zeppelin),…
In this article I will show you how easy it is to use the Machine Learning capabilities of MySQL HeatWave. We will use the famous Iris dataset and we’re going to build, train, deploy and explain a machine learning model within MySQL HeatWave using Apache Zeppelin.
Prerequisites
To be able to reproduce this tutorial, you must have a MySQL HeatWave instance running (details here and also RTFM).
The data used in this article are available on my GitHub account here.
You need a MySQL client to load the data into MySQL HeatWave. I’m using MySQL Shell.
Furthermore if you want to use Zeppelin (although optional because we’re going to run mainly sql queries) , it must be properly setup.
Please read: Interactively explore & visualize your MySQL HeatWave data with Apache Zeppelin
You can also download the Zeppelin notebook here.
Finally, you might need to install some packages.
ex: (pip install pandas numpy unlzw3 sklearn seaborn matplotlib jupyter grpcio protobuf)
MySQL HeatWave ML
MySQL HeatWave ML makes it easy to use machine learning, whether you are a novice user or an experienced ML practitioner.
You provide the data, and HeatWave ML analyzes the characteristics of the data and creates an optimized machine learning model that you can use to generate predictions and explanations.
HeatWave ML supports supervised machine learning. That is, it creates a machine learning model by analyzing a labeled dataset to learn patterns that enable it to predict labels based on the features of the dataset. HeatWave ML supports both classification and regression models.
More information here.
All HeatWave ML operations are initiated by running CALL
or SELECT
statements, which can be easily integrated into your applications.
HeatWave ML routines include:
ML_TRAIN
: Trains a machine learning model for a given training dataset.ML_PREDICT_ROW
: Makes predictions for one or more rows of data.ML_PREDICT_TABLE
: Makes predictions for a table of data.ML_EXPLAIN_ROW
: Explains predictions for one or more rows of data.ML_EXPLAIN_TABLE
: Explains predictions for a table of data.ML_SCORE
: Computes the quality of a model.ML_MODEL_LOAD
: Loads a machine learning model for predictions and explanations.ML_MODEL_UNLOAD
: Unloads a machine learning model.
Load the data
In a real life scenario, your data are already inside your MySQL instance. Although you might need to prepare them upfront, because the relevant data set must be inside a single table.
I’m using MySQL Shell (installed on the server that host Zeppelin) to load the data into the iris_ML schema. The dump file – MySQL_HeatWave-iris_ML.sql – was previously uploaded to the server into /home/ubuntu.
mysqlsh --version
mysqlsh myUser:MyP4s$W0rd@10.0.1.2 --sql -f /home/ubuntu/MySQL_HeatWave-iris_ML.sql
The dump contains 4 tables:
- iris : the reference table ie source of truth, production data
- iris_test : contains the test dataset
- iris_train : contains the training dataset
- iris_validate : contains the validation dataset
Check HeatWave ML status
HeatWave ML is enable by default as soon as your MySQL HeatWave Cluster is active.
The rapid_ml_status variable provides the status of HeatWave ML:
SHOW GLOBAL STATUS LIKE 'rapid_ml_status';
HeatWave ML is up and running \o/
Data exploration
The first step of any project based on data is to look at the data.
Summarize and visualize the data will help you to have a better understanding of the project.
It may surprise you but SQL provides some commands and functions for data exploration, although much less extensive than R or Python 🙂 :
-- Tables structure
DESCRIBE iris_ML.iris; -- Reference table (production table)
DESCRIBE iris_ML.iris_train; -- Table that contains the training dataset
DESCRIBE iris_ML.iris_test; -- Test dataset table
DESCRIBE iris_ML.iris_validate; -- Validation dataset table
-- Number of rows
SELECT count(*) FROM iris_ML.iris;
SELECT count(*) FROM iris_ML.iris_train;
SELECT count(*) FROM iris_ML.iris_test;
SELECT count(*) FROM iris_ML.iris_validate;
iris simulate the live production table. It contains the original data (150 rows) and actually it will not be used (as is) by HeatWave ML.
To keep things simple as possible in this article, iris as the same structure than iris_train. But in a real life scenario it will most likely not the case.
This production table may have additional columns like for example: a primary key, timestamp, etc… any useful business related information that are not relevant for building our model.
iris_train contains the training dataset a subset of iris table (120 rows), used to train the machine learning model.
This table is typically a subset of the reference table(s).
iris_test contains the test dataset a subset of iris table (30 rows), different than iris_train. Its structure is almost similar to iris_train but without the target column (class) .
iris_validate contains the validation dataset (30 rows). Same data than iris_test but same structure than iris_train in other words this table has the target column (class).
General requirements for HeatWave ML here.
-- Data sample
SELECT * FROM iris_ML.iris LIMIT 10;
-- Class distribution
SELECT class, count(*) FROM iris_ML.iris GROUP BY class;
-- Summary Statistics
SELECT MIN(sepal_length), MAX(sepal_length), ROUND(AVG(sepal_length), 2), ROUND(STD(sepal_length), 2) FROM iris_ML.iris;
SELECT MIN(sepal_width), MAX(sepal_width), ROUND(AVG(sepal_width), 2), ROUND(STD(sepal_width), 2) FROM iris_ML.iris;
SELECT MIN(petal_length), MAX(petal_length), ROUND(AVG(petal_length), 2), ROUND(STD(petal_length), 2) FROM iris_ML.iris;
SELECT MIN(petal_width), MAX(petal_width), ROUND(AVG(petal_width), 2), ROUND(STD(petal_width), 2) FROM iris_ML.iris;
Class distribution is well balanced:
- Iris-virginica: 50 rows
- Iris-setosa: 50 rows
- Iris-versicolor: 50 rows
Data visualization
Visualize your data is probably the more convenient way to explore and understand them.
Below a little Python script to generate some graphs.
Beforehand, I had to edit the Zeppelin python interpreter – zeppelin.python – and replace python by python3:
The interpreter is: %python.ipython
The script is going to connect to MySQL HeatWave, thus you must update the database information (host, database, user, password)
%python.ipython
## Edit python interpreter on Zeppelin - zeppelin.python: python3 instead of python
## Update host, database, user and password (from you're MySQL HeatWave)
import mysql.connector as connection
import pandas as pd
import seaborn as sns
sns.set_palette('husl')
%matplotlib inline
import matplotlib.pyplot as plt
try:
mydb = connection.connect(host="10.0.1.2", database='iris_ML', user="<user>", passwd="<password>", port=3306)
query = "SELECT * FROM iris;"
data = pd.read_sql(query,mydb)
mydb.close() #close the connection
except Exception as e:
my_conn.close()
print(str(e))
data.head()
data.info()
data.describe()
data['class'].value_counts()
g = sns.pairplot(data, hue='class', markers='*')
plt.show()
HeatWave Machine Learning (ML) inclut tout ce dont les utilisateurs ont besoin pour créer, former, déployer et expliquer des modèles d’apprentissage automatique dans MySQL HeatWave, sans coût supplémentaire.
Dans ce webinaire vous apprendrez
The information above come from the Iris table (reference table). It could interesting to compare these data with the training dataset (iris_train) and the test dataset in order to evaluate the quality of these samples.
Replace query = “SELECT * FROM iris;” by query = “SELECT * FROM iris_train;” and then query = “SELECT * FROM iris_test;”
Training a Model
Now we have a better understanding of the data, let’s moving forward and train the model. We’re having a classification problem.
The ML_TRAIN
routine, when run on a training dataset, produces a trained machine learning (ML) model.
# Train the model using ML_TRAIN
CALL sys.ML_TRAIN('iris_ML.iris_train', 'class', JSON_OBJECT('task', 'classification'), @iris_model);
You can show the current model, selecting the session variable @iris_model:
SELECT @iris_model;
The lifetime duration of a session variable is… the session lifetime duration. So when the session is closed the session variable content is lost.
ML_TRAIN
stores the machine learning model – @iris_model – in the MODEL_CATALOG
table:
# Model information
SELECT model_id, model_handle, model_owner, target_column_name, train_table_name, model_type, task, model_object_size FROM ML_SCHEMA_admin.MODEL_CATALOG;
You can take the last created model using the following query:
SELECT model_handle FROM ML_SCHEMA_admin.MODEL_CATALOG ORDER BY model_id DESC LIMIT 1 INTO @iris_model;
When the model is created (using ML_TRAIN) you should load it into HeatWave ML (ML_MODEL_LOAD):
CALL sys.ML_MODEL_LOAD(@iris_model, NULL);
Row Prediction
HeatWave ML allows you to make prediction for individual rows or the entire table.
Row(s) predictions are generated by running ML_PREDICT_ROW
.
Data are specified in JSON
format.
# Predict 1 Row
SET @row_input = JSON_OBJECT(
"sepal_length", 7.3,
"sepal_width", 2.9,
"petal_length", 6.3,
"petal_width", 1.8
);
SELECT sys.ML_PREDICT_ROW(@row_input, @iris_model);
Row Explanation
Being able to understand and explain a prediction is important in order to trust your model and to be able to explain the results. It might also be required by your local regulation.
ML_EXPLAIN_ROW
generates explanations for one or more rows of data. Explanations help you understand which features have the most influence on a prediction.
Feature importance is presented as a value ranging from -1 to 1.
- A positive value indicates that a feature contributed toward the prediction.
- A negative value indicates that the feature contributed toward a different prediction
# Prediction explanation for 1 Row
SET @row_input = JSON_OBJECT(
"sepal_length", 7.3,
"sepal_width", 2.9,
"petal_length", 6.3,
"petal_width", 1.8
);
SELECT sys.ML_EXPLAIN_ROW(@row_input, @iris_model);
Table Prediction
ML_PREDICT_TABLE
generates predictions for an entire table and saves the results to an output table.
# Generate predictions for a table
-- CALL sys.ML_MODEL_LOAD(@iris_model, NULL); -- Uncomment if the model is not yet loaded
DROP TABLE IF EXISTS iris_ML.iris_predictions; -- Useful if a table prediction was already generated
CALL sys.ML_PREDICT_TABLE('iris_ML.iris_test', @iris_model, 'iris_ML.iris_predictions');
A new table called iris_predictions is created. You can display Its first 5 rows with:
SELECT * FROM iris_ML.iris_predictions LIMIT 5;
Table Explanation
ML_EXPLAIN_TABLE
explains predictions for an entire table and saves results to an output table.
# Prediction explanation for a table
-- CALL sys.ML_MODEL_LOAD(@iris_model, NULL); -- Uncomment if the model is not yet loaded
DROP TABLE IF EXISTS iris_ML.iris_explanations; -- Usueful if a table explanation was already done
CALL sys.ML_EXPLAIN_TABLE('iris_ML.iris_test', @iris_model, 'iris_ML.iris_explanations');
A new table called iris_explanations is created. You can display Its first 5 rows with:
SELECT * FROM iris_ML.iris_explanations LIMIT 5;
Scores
Scoring the model allows to assess the model’s reliability.
Models with a low score can be expected to perform poorly, producing predictions and explanations that cannot be relied upon. A low score typically indicates that the provided feature columns are not a good predictor of the target values.
HeatWave ML supports a variety of scoring metrics to help you understand how your model performs across a series of benchmarks.
Details here.
ML_SCORE
returns a computed metric indicating the quality of the model.
-- This example uses the accuracy: Computes the fraction of labels a model predicts correctly
CALL sys.ML_SCORE('iris_ML.iris_validate', 'class', @iris_model, 'accuracy', @accuracy_score);
SELECT @accuracy_score;
Other metrics are also available. See here.
Here we go!
We’ve seen the machine learning life cycle.
I told you that Machine Learning with MySQL HeatWave was easy 🙂
You only need to use a limited set of SQL routines:
ML_TRAIN
: Trains a machine learning model for a given training dataset.ML_PREDICT_ROW
: Makes predictions for one or more rows of data.ML_PREDICT_TABLE
: Makes predictions for a table of data.ML_EXPLAIN_ROW
: Explains predictions for one or more rows of data.ML_EXPLAIN_TABLE
: Explains predictions for a table of data.ML_SCORE
: Computes the quality of a model.
During the last 20 years MySQL has democratized the usage of transactional databases.
Now with MySQL HeatWave we are in the process of democratizing Analytics and Machine Learning.
With MySQL HeatWave,
valorizing your data has never been so easy!
Resources
- Introduction to the series Discovering MySQL HeatWave Database Service
- Interactively explore & visualize your MySQL HeatWave data with Apache Zeppelin
- MySQL HeatWave Database Service (MDS)
- MySQL HeatWave User Guide
- HeatWave ML
- Documentation of MySQL Database Service
- Oracle Cloud Infrastructure (OCI)
- Oracle Cloud Infrastructure Glossary
- Oracle Cloud Free Tier
- MySQL — The world’s most popular open source database
- MySQL Shell
- Performance comparison of MySQL HeatWave with Snowflake, Amazon Redshift, Amazon Aurora, and Amazon RDS for MySQL
- Oracle’s latest MySQL HeatWave cloud database simplifies use of machine learning
- Download the notebooks
- Apache Zeppelin
- Iris Data Set
Watch my videos on my YouTube channel and subscribe.
Thanks for using HeatWave & MySQL!
Cloud Solutions Architect at Oracle
MySQL Geek, author, blogger and speaker
I’m an insatiable hunger of learning.
—–
Blog: www.dasini.net/blog/en/
Twitter: https://twitter.com/freshdaz
SlideShare: www.slideshare.net/freshdaz
Youtube: https://www.youtube.com/channel/UC12TulyJsJZHoCmby3Nm3WQ
—–
[…] Lors de cette présentation, je vais m’appuyer sur cet article: Iris Data Set with MySQL HeatWave Machine Learning & Zeppelin […]