Iris Data Set with MySQL HeatWave Machine Learning & Zeppelin

July 18, 2022

Tags: Cloud, ML, tuto

MySQL – The world’s most popular open source database – is also probably the most popular database in the Cloud.
MySQL HeatWave database service is THE MySQL PaaS, a fully managed cloud service in Oracle Cloud Infrastructure (OCI).

The only MySQL service 100% developed, managed and supported by the MySQL Team.

In short:

MySQL HeatWave is a massively parallel, high performance, in-memory query accelerator that accelerates MySQL performance by orders of magnitude for analytics workloads, mixed workloads, and Machine Learning.

Architectural Features of MySQL HeatWave — ***MySQL HeatWave Architecture***

For your analytics needs you can explore and visualize your data using a data visualization tool like Tableau, Qlik, Oracle Analytics Cloud, Apache Superset (see: Explore & visualize your MySQL HeatWave data with Superset)…

For your Machine Learning needs you can use a notebook like Jupyter, Apache Zeppelin (see: Interactively explore & visualize your MySQL HeatWave data with Apache Zeppelin),…

In this article I will show you how easy it is to use the Machine Learning capabilities of MySQL HeatWave. We will use the famous Iris dataset and we’re going to build, train, deploy and explain a machine learning model within MySQL HeatWave using Apache Zeppelin.

Prerequisites

To be able to reproduce this tutorial, you must have a MySQL HeatWave instance running (details here and also RTFM).
The data used in this article are available on my GitHub account here.

You need a MySQL client to load the data into MySQL HeatWave. I’m using MySQL Shell.
Furthermore if you want to use Zeppelin (although optional because we’re going to run mainly sql queries) , it must be properly setup.
Please read: Interactively explore & visualize your MySQL HeatWave data with Apache Zeppelin
You can also download the Zeppelin notebook here.

Finally, you might need to install some packages.
ex: (pip install pandas numpy unlzw3 sklearn seaborn matplotlib jupyter grpcio protobuf)

MySQL HeatWave ML

MySQL HeatWave ML makes it easy to use machine learning, whether you are a novice user or an experienced ML practitioner.

You provide the data, and HeatWave ML analyzes the characteristics of the data and creates an optimized machine learning model that you can use to generate predictions and explanations.

HeatWave ML supports supervised machine learning. That is, it creates a machine learning model by analyzing a labeled dataset to learn patterns that enable it to predict labels based on the features of the dataset. HeatWave ML supports both classification and regression models.
More information here.

**In-database Machine Learning with MySQL HeatWave**

All HeatWave ML operations are initiated by running CALL or SELECT statements, which can be easily integrated into your applications.

HeatWave ML routines include:

ML_TRAIN: Trains a machine learning model for a given training dataset.
ML_PREDICT_ROW: Makes predictions for one or more rows of data.
ML_PREDICT_TABLE: Makes predictions for a table of data.
ML_EXPLAIN_ROW: Explains predictions for one or more rows of data.
ML_EXPLAIN_TABLE: Explains predictions for a table of data.
ML_SCORE: Computes the quality of a model.
ML_MODEL_LOAD: Loads a machine learning model for predictions and explanations.
ML_MODEL_UNLOAD: Unloads a machine learning model.

Load the data

In a real life scenario, your data are already inside your MySQL instance. Although you might need to prepare them upfront, because the relevant data set must be inside a single table.

I’m using MySQL Shell (installed on the server that host Zeppelin) to load the data into the iris_ML schema. The dump file – MySQL_HeatWave-iris_ML.sql – was previously uploaded to the server into /home/ubuntu.

mysqlsh --version

mysqlsh myUser:MyP4s$W0rd@10.0.1.2 --sql -f /home/ubuntu/MySQL_HeatWave-iris_ML.sql

The dump contains 4 tables:

iris : the reference table ie source of truth, production data
iris_test : contains the test dataset
iris_train : contains the training dataset
iris_validate : contains the validation dataset

Check HeatWave ML status

HeatWave ML is enable by default as soon as your MySQL HeatWave Cluster is active.

The rapid_ml_status variable provides the status of HeatWave ML:

SHOW GLOBAL STATUS LIKE 'rapid_ml_status';

HeatWave ML is up and running \o/

Data exploration

The first step of any project based on data is to look at the data.
Summarize and visualize the data will help you to have a better understanding of the project.

It may surprise you but SQL provides some commands and functions for data exploration, although much less extensive than R or Python 🙂 :

-- Tables structure 
DESCRIBE iris_ML.iris;  -- Reference table (production table)
DESCRIBE iris_ML.iris_train; -- Table that contains the training dataset
DESCRIBE iris_ML.iris_test; -- Test dataset table
DESCRIBE iris_ML.iris_validate; -- Validation dataset table

-- Number of rows
SELECT count(*) FROM iris_ML.iris;
SELECT count(*) FROM iris_ML.iris_train;
SELECT count(*) FROM iris_ML.iris_test;
SELECT count(*) FROM iris_ML.iris_validate;

iris simulate the live production table. It contains the original data (150 rows) and actually it will not be used (as is) by HeatWave ML.
To keep things simple as possible in this article, iris as the same structure than iris_train. But in a real life scenario it will most likely not the case.
This production table may have additional columns like for example: a primary key, timestamp, etc… any useful business related information that are not relevant for building our model.

iris_train contains the training dataset a subset of iris table (120 rows), used to train the machine learning model.
This table is typically a subset of the reference table(s).

iris_test contains the test dataset a subset of iris table (30 rows), different than iris_train. Its structure is almost similar to iris_train but without the target column (class) .

iris_validate contains the validation dataset (30 rows). Same data than iris_test but same structure than iris_train in other words this table has the target column (class).

General requirements for HeatWave ML here.

-- Data sample
SELECT * FROM iris_ML.iris LIMIT 10;


-- Class distribution
SELECT class, count(*) FROM iris_ML.iris GROUP BY class;


-- Summary Statistics
SELECT MIN(sepal_length), MAX(sepal_length), ROUND(AVG(sepal_length), 2), ROUND(STD(sepal_length), 2) FROM iris_ML.iris;
SELECT MIN(sepal_width), MAX(sepal_width), ROUND(AVG(sepal_width), 2), ROUND(STD(sepal_width), 2) FROM iris_ML.iris;
SELECT MIN(petal_length), MAX(petal_length), ROUND(AVG(petal_length), 2), ROUND(STD(petal_length), 2) FROM iris_ML.iris;
SELECT MIN(petal_width), MAX(petal_width), ROUND(AVG(petal_width), 2), ROUND(STD(petal_width), 2) FROM iris_ML.iris;

Class distribution is well balanced:

Iris-virginica: 50 rows
Iris-setosa: 50 rows
Iris-versicolor: 50 rows

Data visualization

Visualize your data is probably the more convenient way to explore and understand them.

Below a little Python script to generate some graphs.

Beforehand, I had to edit the Zeppelin python interpreter – zeppelin.python – and replace python by python3:

Update zeppelin.python: replace python by python3 — *zeppelin.python*: python3:

The interpreter is: %python.ipython
The script is going to connect to MySQL HeatWave, thus you must update the database information (host, database, user, password)

%python.ipython

## Edit python interpreter on Zeppelin - zeppelin.python: python3  instead of python
## Update host, database, user and password (from you're MySQL HeatWave)



import mysql.connector as connection
import pandas as pd
import seaborn as sns
sns.set_palette('husl')
%matplotlib inline
import matplotlib.pyplot as plt


try:
    mydb = connection.connect(host="10.0.1.2", database='iris_ML', user="<user>", passwd="<password>", port=3306)
    query = "SELECT * FROM iris;"
    data = pd.read_sql(query,mydb)
    mydb.close() #close the connection
except Exception as e:
    my_conn.close()
    print(str(e))

data.head()

data.info()

data.describe()

data['class'].value_counts()

g = sns.pairplot(data, hue='class', markers='*')

plt.show()

The information above come from the Iris table (reference table). It could interesting to compare these data with the training dataset (iris_train) and the test dataset in order to evaluate the quality of these samples.

Replace {query = “SELECT * FROM iris;”}
by {query = “SELECT * FROM iris_train;”}
and then by {query = “SELECT * FROM iris_test;”}

Training a Model

Now we have a better understanding of the data, let’s moving forward and train the model. We’re having a classification problem.

The ML_TRAIN routine, when run on a training dataset, produces a trained machine learning (ML) model.

# Train the model using ML_TRAIN

CALL sys.ML_TRAIN('iris_ML.iris_train', 'class', JSON_OBJECT('task', 'classification'), @iris_model);

You can show the current model, selecting the session variable @iris_model:

SELECT @iris_model;

The lifetime duration of a session variable is… the session lifetime duration. So when the session is closed the session variable content is lost.

ML_TRAIN stores the machine learning model – @iris_model – in the MODEL_CATALOG table:

# Model information

SELECT model_id, model_handle, model_owner, target_column_name, train_table_name, model_type, task, model_object_size  FROM ML_SCHEMA_admin.MODEL_CATALOG;

You can take the last created model using the following query:

SELECT model_handle FROM ML_SCHEMA_admin.MODEL_CATALOG ORDER BY model_id DESC LIMIT 1 INTO @iris_model;

When the model is created (using ML_TRAIN) you should load it into HeatWave ML (ML_MODEL_LOAD):

CALL sys.ML_MODEL_LOAD(@iris_model, NULL);

Starting with HeatWave 9.0, you can check which models are currently loaded and active, using sys.ML_MODEL_ACTIVE:

SQL>
CALL sys.ML_MODEL_ACTIVE('current', @model_info);
Query OK, 0 rows affected (0.7321 sec)


SELECT JSON_PRETTY(@model_info)\G
*************************** 1. row ***************************
JSON_PRETTY(@model_info): [
  {
    "total model size(bytes)": 13951
  },
  {
    "iris_ML.iris_train_admin_1657554696": {
      "task": "classification",
      "notes": "",
      "chunks": 1,
      "format": "HWMLv2.0",
      "n_rows": 120,
      "status": "Ready",
      "options": {
        "task": "classification",
        "model_explainer": "permutation_importance, shap",
        "optimization_metric": "accuracy",
        "prediction_explainer": "permutation_importance"
      },
      "n_columns": 4,
      "pos_class": null,
      "column_names": [
        "sepal_length",
        "sepal_width",
        "petal_length",
        "petal_width"
      ],
      "model_quality": "high",
      "training_time": 6.638953,
      "algorithm_name": "GaussianNB",
      "training_score": null,
      "build_timestamp": 1740563992,
      "n_selected_rows": 96,
      "training_params": {
        "sp_arr": null,
        "timezone": null,
        "recommend": "ratings",
        "force_use_X": false,
        "recommend_k": 3,
        "remove_seen": true,
        "ranking_topk": 10,
        "contamination": null,
        "lsa_components": 100,
        "semisupervised": null,
        "ranking_threshold": 1,
        "feedback_threshold": 1
      },
      "train_table_name": "iris_ML.iris_train",
      "model_explanation": {
        "shap": {
          "petal_width": 0.3578,
          "sepal_width": 0.0,
          "petal_length": 0.2829,
          "sepal_length": 0.0
        },
        "permutation_importance": {
          "petal_width": 0.3872,
          "sepal_width": 0.0,
          "petal_length": 0.3843,
          "sepal_length": 0.0
        }
      },
      "model_object_size": 13951,
      "n_selected_columns": 2,
      "target_column_name": "class",
      "optimization_metric": "accuracy",
      "exclude_feature_names": [],
      "include_feature_names": [],
      "selected_column_names": [
        "petal_length",
        "petal_width"
      ],
      "training_drift_metric": {
        "mean": 0.018764767953578888,
        "variance": 0.0008759556124732613
      }
    }
  }
]

Generate Model Explanations

Starting with HeatWave 9, you can train the model explainer. That will help you to understand which features have the most influence on a prediction:

SQL>
-- Train the model Explainer
CALL sys.ML_EXPLAIN('iris_ML.iris_train', 'class', @iris_model, JSON_OBJECT('model_explainer', 'shap', 'prediction_explainer', 'permutation_importance'));


-- Check the result
SELECT JSON_PRETTY(model_explanation) 
FROM ML_SCHEMA_admin.MODEL_CATALOG 
WHERE model_handle = @iris_model\G
*************************** 1. row ***************************
JSON_PRETTY(model_explanation): {
  "shap": {
    "petal_width": 0.3578,
    "sepal_width": 0.0,
    "petal_length": 0.2829,
    "sepal_length": 0.0
  },
  "permutation_importance": {
    "petal_width": 0.3872,
    "sepal_width": 0.0,
    "petal_length": 0.3843,
    "sepal_length": 0.0
  }
}

Row Prediction

HeatWave ML allows you to make prediction for individual rows or the entire table.

Row(s) predictions are generated by running ML_PREDICT_ROW.
Data are specified in JSON format.

# Predict 1 Row

SET @row_input = JSON_OBJECT( 
   "sepal_length", 7.3, 
   "sepal_width", 2.9, 
   "petal_length", 6.3, 
   "petal_width", 1.8
);  


SELECT sys.ML_PREDICT_ROW(@row_input, @iris_model);

Row Explanation

Being able to understand and explain a prediction is important in order to trust your model and to be able to explain the results. It might also be required by your local regulation.

ML_EXPLAIN_ROW generates explanations for one or more rows of data. Explanations help you understand which features have the most influence on a prediction.
Feature importance is presented as a value ranging from -1 to 1.

A positive value indicates that a feature contributed toward the prediction.
A negative value indicates that the feature contributed toward a different prediction

# Prediction explanation for 1 Row

SET @row_input = JSON_OBJECT( 
   "sepal_length", 7.3, 
   "sepal_width", 2.9, 
   "petal_length", 6.3, 
   "petal_width", 1.8
);    


SELECT sys.ML_EXPLAIN_ROW(@row_input, @iris_model);

Table Prediction

ML_PREDICT_TABLE generates predictions for an entire table and saves the results to an output table.

# Generate predictions for a table 

-- CALL sys.ML_MODEL_LOAD(@iris_model, NULL); -- Uncomment if the model is not yet loaded

DROP TABLE IF EXISTS iris_ML.iris_predictions; -- Useful if a table prediction was already generated

CALL sys.ML_PREDICT_TABLE('iris_ML.iris_test', @iris_model, 'iris_ML.iris_predictions');

A new table called iris_predictions is created. You can display Its first 5 rows with:

SELECT * FROM iris_ML.iris_predictions LIMIT 5;

Table Explanation

ML_EXPLAIN_TABLE explains predictions for an entire table and saves results to an output table.

# Prediction explanation for a table 

-- CALL sys.ML_MODEL_LOAD(@iris_model, NULL); -- Uncomment if the model is not yet loaded

DROP TABLE IF EXISTS iris_ML.iris_explanations;  -- Usueful if a table explanation was already done

CALL sys.ML_EXPLAIN_TABLE('iris_ML.iris_test', @iris_model, 'iris_ML.iris_explanations');

A new table called iris_explanations is created. You can display Its first 5 rows with:

SELECT * FROM iris_ML.iris_explanations LIMIT 5;

Scores

Scoring the model allows to assess the model’s reliability.

Models with a low score can be expected to perform poorly, producing predictions and explanations that cannot be relied upon. A low score typically indicates that the provided feature columns are not a good predictor of the target values.

HeatWave ML supports a variety of scoring metrics to help you understand how your model performs across a series of benchmarks.
Details here.

ML_SCORE returns a computed metric indicating the quality of the model.

-- This example uses the accuracy: Computes the fraction of labels a model predicts correctly

CALL sys.ML_SCORE('iris_ML.iris_validate', 'class', @iris_model, 'accuracy', @accuracy_score);

SELECT @accuracy_score;

Other metrics are also available. See here.

Here we go!
We’ve seen the machine learning life cycle.

I told you that Machine Learning with MySQL HeatWave was easy 🙂

You only need to use a limited set of SQL routines:

ML_TRAIN: Trains a machine learning model for a given training dataset.
ML_PREDICT_ROW: Makes predictions for one or more rows of data.
ML_PREDICT_TABLE: Makes predictions for a table of data.
ML_EXPLAIN_ROW: Explains predictions for one or more rows of data.
ML_EXPLAIN_TABLE: Explains predictions for a table of data.
ML_SCORE: Computes the quality of a model.

During the last 20 years MySQL has democratized the usage of transactional databases.
Now with MySQL HeatWave we are in the process of democratizing Analytics and Machine Learning.

With MySQL HeatWave,
valorizing your data has never been so easy!

Resources

Follow me on Linkedin

Watch my videos on my YouTube channel and subscribe.

My Slideshare account.

My Speaker Deck account.

Thanks for using HeatWave & MySQL!

Olivier DASINI

Cloud Solutions Architect at Oracle
MySQL Geek, strong interest in data and AI, author, blogger and speaker
I’m an insatiable hunger of learning.
—–

Blog: www.dasini.net/blog/en/
Twitter: https://twitter.com/freshdaz
SlideShare: www.slideshare.net/freshdaz
Youtube: https://www.youtube.com/channel/UC12TulyJsJZHoCmby3Nm3WQ
—–

One Response to “Iris Data Set with MySQL HeatWave Machine Learning & Zeppelin”

Webinar – Apprentissage automatique avec MySQL HeatWave | dasini.net - Journal d'experts MySQL

July 29th, 2022 at 12:07

[…] Lors de cette présentation, je vais m’appuyer sur cet article: Iris Data Set with MySQL HeatWave Machine Learning & Zeppelin […]

dasini.net – Diary of a MySQL expert