Building an AI Vision Search Engine with MySQL HeatWave GenAI

May 14, 2026

Modern AI systems increasingly rely on multimodal data: text, images, documents, audio, and video. Among these modalities, image understanding has become one of the most important capabilities for AI-powered applications.

Organizations now expect systems to:

  • Search images using natural language
  • Compare visually similar assets
  • Classify and enrich image catalogs
  • Validate AI-generated descriptions
  • Build semantic retrieval pipelines

Traditionally, implementing these capabilities required specialized computer vision infrastructure, external vector databases, custom ML pipelines, and multiple frameworks.

With MySQL HeatWave GenAI, many of these capabilities can now be implemented directly inside SQL workflows using built-in AI routines such as:

In this article, we will build the foundations of a Vision Model Evaluation Assistant using MySQL HeatWave GenAI.

The objective is not to create another image classifier, but rather a semantic image understanding platform capable of:

  • Generating image descriptions
  • Producing semantic embeddings
  • Performing text-to-image search
  • Performing semantic reverse image search
  • Evaluating image understanding quality

This article intentionally focuses on the database layer and the SQL logic powering the system.

Landscape of Vision-to-Embedding Approaches

Before building the solution, it is important to understand the major approaches currently used for AI-powered image understanding systems.

Let’s have a look on three paradigms for representing images in AI systems.

1. Direct Image Embeddings

The first approach consists of converting raw image pixels directly into vectors.

Popular architectures include:

In this approach:

  • The image is fed directly into a neural network
  • The network produces a dense vector embedding
  • Similarity search is performed directly on image vectors

Workflow

Strengths

  • Extremely fast similarity search
  • Excellent for image-to-image matching
  • End-to-end architecture
  • No intermediate textual representation required

Limitations

These embeddings are often difficult to interpret:

  • No human-readable explanation
  • Harder to debug
  • Harder to refine
  • Black-box behavior

If a search result is incorrect, it is difficult to understand why the model failed.

To summarize

Pros

  • High-performance retrieval
  • Strong visual similarity matching
  • Can compare image-to-image directly

Cons

  • Requires specialized model deployment
  • Less explainable
  • Harder to integrate into SQL-centric workflows

2. Semantic Metadata Embeddings

This is the approach used in this article.

Instead of embedding raw pixels directly, we first generate a textual understanding of the image.

The workflow becomes:

This pipeline is composed of two distinct stages:

Step 1: Image -> Text

A Vision Language Model (VLM) analyzes the image and generates:

  • A description and / or Keywords
  • Contextual understanding

Example:

{    "description": "A red motorcycle parked on a city street near buildings during daytime.",

    "keywords": ["motorcycle", "street", "urban", "vehicle", "city"]
}

Step 2: Text -> Embedding

The generated text is then converted into a vector embedding using a text embedding model.

This creates semantic vectors representing the meaning of the image.

Strengths

This approach offers several advantages:

Human-readable and explainable

Unlike direct image embeddings, the generated metadata is understandable by humans.

If a search fails, developers can inspect:

  • The generated description
  • The extracted keywords
  • The semantic interpretation

This creates a powerful audit trail.

SQL-native workflow

Because the representation becomes textual:

  • Standard embedding models can be used
  • VECTOR columns integrate naturally
  • SQL workflows remain simple

This aligns perfectly with MySQL HeatWave GenAI capabilities.

Trade-offs

The quality of retrieval depends heavily on:

  • Caption quality
  • Prompt engineering
  • Vision model performance

Low-level visual details may also be lost:

  • Texture
  • Exact shape
  • Pixel-specific patterns

To summarize

Pros

  • Human-readable
  • Easy to debug
  • Works well with SQL systems
  • Easier operationalization

Cons

  • Less precise for purely visual similarity
  • Depends on generated metadata quality

3. Multimodal Fusion Models

A third category consists of multimodal architectures capable of jointly processing:

  • Images
  • Text
  • Audio
  • Video
  • Structured data

These systems use:

  • Early fusion
  • Intermediate fusion
  • Late fusion
  • Hybrid fusion architectures

Their objective is to create unified multimodal embeddings.

Strengths

  • Very powerful semantic understanding
  • Strong cross-modal reasoning
  • Rich contextual interpretation

Challenges

These architectures are often:

  • More complex
  • Harder to operationalize
  • Less transparent
  • Difficult to integrate into SQL-centric architectures

Why We Chose the Image -> Text -> Embedding Approach?

In this article, we intentionally choose the Semantic Metadata Embeddings pipeline because it aligns perfectly with MySQL HeatWave GenAI capabilities.

Specifically:

This approach also creates a human-readable semantic audit trail.

If retrieval quality degrades, architects and developers can directly inspect the generated descriptions to understand:

  • What the model detected
  • What it missed
  • Why semantic similarity failed

For AI evaluation systems, explainability is often just as important as accuracy.

Overview of MySQL HeatWave GenAI

MySQL HeatWave is a fully managed cloud service that combines:

  • Transactional processing (OLTP)
  • Analytics (OLAP)
  • Machine Learning
  • Generative AI
  • Vector Store capabilities

within a single MySQL platform.

One of the most interesting aspects of HeatWave GenAI is that it enables AI-powered workflows directly inside SQL.

Instead of exporting data to external AI pipelines, developers can:

  • Generate embeddings
  • Execute semantic similarity search
  • Perform retrieval augmented generation (RAG)
  • Analyze images
  • Invoke LLMs

using SQL routines \o/.

MySQL HeatWave Key GenAI Capabilities

MySQL HeatWave GenAI includes:

  • Text generation
  • Text embeddings
  • Vector similarity search
  • In-database AI processing
  • Vision model integration

Key routines and data types used in this article include:

In this article, I’m using MySQL HeatWave version 9.6.1:

mysql> 
SELECT VERSION();
+-------------+
| VERSION()   |
+-------------+
| 9.6.1-cloud |
+-------------+

Significantly lower the Total Cost of Ownership

Traditionally, AI pipelines require:

  • Multiple services
  • External vector databases
  • Dedicated ML infrastructure
  • Separate orchestration layers

With MySQL HeatWave:

  • Data remains inside MySQL
  • AI logic becomes SQL-native
  • Governance is simplified
  • Operational complexity is reduced

By consolidating these diverse requirements into a single platform, MySQL HeatWave significantly lowers the Total Cost of Ownership (TCO). It eliminates the expensive licensing fees associated with multiple third-party services and external vector databases.

Furthermore, by making AI logic SQL-native and reducing operational complexity, organizations can leverage their existing database expertise rather than hiring specialized ML engineers to manage separate orchestration layers, drastically cutting both infrastructure and personnel overhead.

Summary of TCO Drivers

  • Skill Set Leverage: AI tasks become accessible to anyone with SQL skills, reducing the need for niche, high-cost specialists.
  • Reduced Licensing: Consolidating multiple services into one platform removes the need for separate vendor contracts.
  • Lower Infrastructure Costs: Eliminating dedicated ML infrastructure and external databases reduces the physical or cloud hardware footprint.
  • Operational Efficiency: Keeping data inside MySQL HeatWave simplifies governance and reduces the labor hours required for complex data movement and synchronization.

Generating Image Understanding with sys.ML_GENERATE

The foundation of our evaluation assistant is the ability to convert images into semantic understanding.

This is where sys.ML_GENERATE becomes extremely powerful.

Prerequisites

The environment requires:

  • A MySQL HeatWave instance with a HeatWave cluster attached
  • Access to supported vision models (like google.gemini-2.5-pro)
  • Access to supported embedding models (like cohere.embed-english-v3.0)
  • A table capable of storing:
    • Images
    • Descriptions
    • Keywords
    • Embeddings

Table Definition

CREATE TABLE image_details (  id_image int unsigned NOT NULL AUTO_INCREMENT,
  image_name varchar(255) NOT NULL,
  image_base64 longtext NOT NULL,
  image_description json NOT NULL,
  image_description_embedding vector(2048) NOT NULL COMMENT 'GENAI_OPTIONS=EMBED_MODEL_ID=cohere.embed-english-v3.0',
  image_keywords json NOT NULL,
  image_keywords_embedding vector(2048) NOT NULL COMMENT 'GENAI_OPTIONS=EMBED_MODEL_ID=cohere.embed-english-v3.0',
  created_at timestamp NULL DEFAULT CURRENT_TIMESTAMP,
  PRIMARY KEY (id_image)
) ENGINE=InnoDB;

This table stores:

  • The original image (in base 64).
  • Semantic metadata (Image description and keywords).
  • Vector embeddings (of the description and the keywords).

directly inside MySQL.

What sys.ML_GENERATE Enables

The sys.ML_GENERATE routine supports multimodal prompts combining:

  • Text
  • Images

This enables:

  • Caption generation
  • Object detection prompts
  • Context extraction
  • Semantic interpretation

Please read the MySQL HeatWave ML_GENERATE Documentation if you want to know more.

Bridging Structured and Unstructured Data

One of the most interesting aspects of ML_GENERATE is that it bridges the gap between:

  • Structured SQL systems
  • Unstructured image content

The output becomes structured semantic metadata that can:

  • Be stored
  • Indexed
  • Embedded
  • Queried
  • Compared

using standard SQL operations.

Supported Vision Models

The supported vision models can be queried directly from MySQL HeatWave:

mysql>
SELECT * 
FROM sys.ML_SUPPORTED_LLMS
WHERE model_id LIKE 'google.gemini%'\G

*************************** 1. row ***************************
         provider: OCI Generative AI Service
         model_id: google.gemini-2.5-flash
availability_date: 2026-01-22
     capabilities: ["GENERATION"]
    default_model: 0
*************************** 2. row ***************************
         provider: OCI Generative AI Service
         model_id: google.gemini-2.5-pro
availability_date: 2026-01-22
     capabilities: ["GENERATION"]
    default_model: 0
*************************** 3. row ***************************
         provider: OCI Generative AI Service
         model_id: google.gemini-2.5-flash-lite
availability_date: 2026-01-22
     capabilities: ["GENERATION"]
    default_model: 0

The model used in this article is:

google.gemini-2.5-pro

The workflow is straightforward:

This transforms unstructured visual data into structured semantic data.

Examples

Object Detection Prompt

mysql> 
SELECT image_base64
FROM image_details
WHERE id_image = 1
INTO @image_base64;

SET @prompt_desc_key_json = 'What objects are present?';

SELECT JSON_UNQUOTE(
    JSON_EXTRACT(
        sys.ML_GENERATE(@prompt_desc_key_json, JSON_OBJECT("model_id", "google.gemini-2.5-pro", "image", @image_base64)),
        '$.text'
    )
) AS description
\G

*************************** 1. row ***************************
description: Based on the image provided, here are the objects that are present:

**Main Subject:**
*   A black motorcycle, which appears to be a Honda Hornet, is the central object. Its visible parts include:
    *   Wheels and tires
    *   Engine
    *   Exhaust pipe and muffler
    *   Frame
    *   Seat
    *   Fuel tank (with a black tank cover on it)
    *   Handlebars with mirrors and levers
    *   Front and rear disc brakes
    *   Gold-colored front forks

**Background and Surroundings:**
*   A white or light gray garage door with vertical panels.
*   A red brick wall on the left.
*   A white drainpipe between the wall and the garage door.
*   Pavement or concrete on the ground, with some patches of frost or thin snow.
*   Dry, brown leaves scattered on the ground.

Scene Context Prompt

mysql> 
SET @prompt_desc_key_json = 'What is the context of this scene?';

SELECT JSON_UNQUOTE(
    JSON_EXTRACT(
        sys.ML_GENERATE(@prompt_desc_key_json, JSON_OBJECT("model_id", "google.gemini-2.5-pro", "image", @image_base64)),
        '$.text'
    )
) AS description
\G

*************************** 1. row ***************************
description: Based on the image, the context of the scene is an outdoor, residential setting, likely during a cold season like late autumn or winter.

Here are the key details that establish this context:

*   **Subject:** The central focus is a modern, black and silver Honda Hornet motorcycle. It is parked and appears to be the subject of the photo, possibly taken by its owner.
*   **Location:** The motorcycle is parked on a paved surface, such as a driveway or a courtyard, in front of white, vertical-paneled garage doors. To the left, there is a red brick wall. This combination suggests a residential area.
*   **Season/Weather:** The ground is littered with dead, brown leaves and patches of what looks like frost or old, melting snow. This indicates cold weather, pointing to late autumn or winter. The lighting appears flat and overcast, which is common for that time of year.

This capability becomes the semantic foundation of the evaluation assistant. We are not merely generating captions.

We are enabling:

  • Queryable visual understanding
  • Semantic enrichment
  • Explainable AI retrieval

Text-to-Image Search (Semantic Retrieval)

One of the most powerful applications of semantic embeddings is natural language image search.

The objective is simple:

Input: Text
Output: Relevant Images

But internally, the logic is fundamentally different from traditional image retrieval systems.

The Core Idea

We are not searching for:

  • Pixels
  • Shapes
  • Exact visual patterns

We are searching for:

  • Meaning
  • Context
  • Semantics

This is semantic retrieval.

Workflow Overview

The workflow contains two phases:

  1. Preprocessing
  2. Query execution

Preprocessing Phase

Each image must first be semantically enriched.

The workflow is the following:

Examples


Supported Embedding Models

mysql> 
SELECT *
FROM sys.ML_SUPPORTED_LLMS
WHERE capabilities LIKE '["TEXT_EMBEDDINGS"]'\G

*************************** 1. row ***************************
         provider: HeatWave
         model_id: all_minilm_l12_v2
availability_date: 2024-07-01
     capabilities: ["TEXT_EMBEDDINGS"]
    default_model: 0
*************************** 2. row ***************************
         provider: HeatWave
         model_id: multilingual-e5-small
availability_date: 2024-07-24
     capabilities: ["TEXT_EMBEDDINGS"]
    default_model: 1
*************************** 3. row ***************************
         provider: OCI Generative AI Service
         model_id: cohere.embed-english-v3.0
availability_date: 2024-02-17
     capabilities: ["TEXT_EMBEDDINGS"]
    default_model: 0
*************************** 4. row ***************************
         provider: OCI Generative AI Service
         model_id: cohere.embed-multilingual-v3.0
availability_date: 2024-02-17
     capabilities: ["TEXT_EMBEDDINGS"]
    default_model: 0

Generating Semantic Metadata

mysql> 
SET @prompt_desc_key_json = '
Analyze the image and return a minified JSON object.
The JSON must contain two fields:
"description" (a concise, single-paragraph description of visible content with no speculation)
and "keywords" (a JSON array of relevant lowercase visual keywords).
Output ONLY the raw JSON on a single line without any markdown formatting, backticks, or preamble
';

Generate Description and Keywords

mysql> 
SELECT JSON_UNQUOTE(
    JSON_EXTRACT(
        sys.ML_GENERATE(@prompt_desc_key_json, JSON_OBJECT("model_id", "google.gemini-2.5-pro", "image", @image_base64)),
        '$.text'
    )
) AS description
\G

*************************** 1. row ***************************
description: {"description":"A side profile view of a black Honda Hornet naked motorcycle parked on a paved surface with scattered leaves. The bike has a black frame, engine, and fuel tank, with a silver tail section and exhaust pipe. The front forks are gold-colored. In the background, there is a red brick wall on the left and a light-colored, vertically-paneled garage door on the right.","keywords":["motorcycle","honda","honda hornet","naked bike","street bike","black","parked","outdoors","garage door","brick wall","engine","exhaust","two-wheeler","vehicle","wheel","tire","disc brake"]}

Generate Embeddings and Store Data

mysql> 
INSERT INTO image_details (
    image_name,
    image_base64,
    image_description,
    image_description_embedding,
    image_keywords,
    image_keywords_embedding
)
VALUES (
    image_name,
    image_base64,
    image_description,
    sys.ML_EMBED_ROW(
        @source_image_descriptions,
        '{"model_id": "cohere.embed-english-v3.0"}'
    ),
    image_keywords,
    sys.ML_EMBED_ROW(
        @source_image_keywords,
        '{"model_id": "cohere.embed-english-v3.0"}'
    )
);

Query Phase

At query time:

  1. The user enters a text description
  2. The text becomes an embedding
  3. Similarity search retrieves matching images

Generate Query Embedding

mysql> 
SET @embeddOptions = '{"model_id": "cohere.embed-english-v3.0"}';

SET @searchImage = 'street motorcycle';

SELECT sys.ML_EMBED_ROW(
    @searchImage,
    @embeddOptions
) INTO @searchImageEmbedding;

If you want to see what a vector embedding looks like under the hood, you can use the FROM_VECTOR() function:

mysql>
SELECT from_vector(@searchImageEmbedding)\G
*************************** 1. row ***************************
from_vector(@searchImageEmbedding): [2.09045e-02,1.06277e-02,2.59590e-03,-6.01807e-02,-4.44336e-02,8.44574e-03,-4.88281e-02,8.72803e-03,-1.45645e-02,2.78168e-02,-4.73328e-02,2.53296e-03,-5.47791e-03,-1.36337e-02,4.00696e-02,-1.90735e-02,6.69556e-02,-3.72314e-02,3.95203e-02,2.24457e-02,-1.06964e-02,7.75757e-02,-1 
                 ... even more numbers ...
,-3.04871e-02,5.33295e-03,3.96729e-03]

Similarity Search Using Descriptions

mysql> 
SELECT 
    image_name,
    DISTANCE(
        image_description_embedding,
        @searchImageEmbedding,
        'COSINE'
    ) AS min_distance
FROM image_details
ORDER BY min_distance
LIMIT 3;

+-----------------------+---------------------+
| image_name            | min_distance        |
+-----------------------+---------------------+
| vintage_sidecar.JPG   | 0.46460431814193726 |
| Le_Mans_24_moto.png   | 0.49795448780059814 |
| hornet_black_gold.jpg |  0.5088052749633789 |
+-----------------------+---------------------+

Similarity Search Using Description + Keywords

mysql> 
WITH distances AS (
    SELECT
        image_name,
        (
            DISTANCE(
                image_keywords_embedding,
                @searchImageEmbedding,
                'COSINE'
            )
            +
            DISTANCE(
                image_description_embedding,
                @searchImageEmbedding,
                'COSINE'
            )
        ) / 2 AS avg_distance
    FROM image_details
)
SELECT *
FROM distances
ORDER BY avg_distance ASC
LIMIT 3;

+-----------------------+---------------------+
| image_name            | avg_distance        |
+-----------------------+---------------------+
| hornet_black_gold.jpg | 0.48873063921928406 |
| vintage_sidecar.JPG   |  0.4904197156429291 |
| Yamaha_DTX.jpg        |  0.5230678915977478 |
+-----------------------+---------------------+

Simple Re-ranking Strategy

mysql> 
WITH initial_results AS (
    SELECT
        image_name,
        DISTANCE(
            image_keywords_embedding,
            @searchImageEmbedding,
            'COSINE'
        ) AS keywords_distance,

        DISTANCE(
            image_description_embedding,
            @searchImageEmbedding,
            'COSINE'
        ) AS description_distance

    FROM image_details

    ORDER BY keywords_distance + description_distance
    LIMIT 15
),

reranked_results AS (
    SELECT
        image_name,
        (
            0.3 * keywords_distance +
            0.7 * description_distance
        ) AS combined_distance

    FROM initial_results
)

SELECT *
FROM reranked_results
ORDER BY combined_distance ASC
LIMIT 3;

+-----------------------+--------------------+
| image_name            | combined_distance  |
+-----------------------+--------------------+
| vintage_sidecar.JPG   | 0.4800935566425323 |
| hornet_black_gold.jpg |  0.496760493516922 |
| Le_Mans_24_moto.png   | 0.5196295261383057 |
+-----------------------+--------------------+

Key Insight

The most important concept here is this:

  • We are not comparing images.
  • We are comparing semantic representations of images.

This distinction fundamentally changes how image retrieval systems can be designed using SQL-native AI capabilities.

Reverse Image Search (Image-to-Image via Semantics)

Traditional reverse image search systems rely heavily on:

  • Pixel similarity
  • Feature extraction
  • Specialized computer vision pipelines

Our approach is different.

We perform Semantic Image-to-Image Search.

This is:

  • Similarity of meaning
  • Not similarity of pixels

Workflow

Why This Is Interesting

Two images may be visually different while still being semantically related.

For example:

  • Different motorcycles
  • Different lighting conditions
  • Different camera angles

Yet both images may describe: « A motorcycle parked on an urban street ».

Semantic retrieval captures this meaning.

Advantages of the Semantic Approach

Simpler Architecture

No need for:

  • Complex Computer Vision frameworks
  • GPU-heavy image feature pipelines
  • Specialized image vector databases

Explainability

Because the retrieval is based on generated semantic metadata:

  • Results can be audited
  • Descriptions can be refined
  • Prompts can be improved

SQL-native Workflow

Everything remains inside MySQL HeatWave:

  • Metadata
  • Embeddings
  • Similarity search
  • Ranking logic

Architectural Trade-offs & Design Considerations

Building AI-powered semantic image systems involves several important architectural decisions.

Description vs Keywords

Both representations serve different purposes.

Descriptions

Descriptions provide:

  • Rich contextual understanding
  • Better semantic reasoning
  • Natural language flexibility

However:

  • They may introduce noise
  • Longer text may dilute embeddings

Keywords

Keywords provide:

  • Focused semantic signals
  • Better precision
  • Faster matching

However:

  • They lose contextual richness

Strategy implemented in this article

A hybrid approach:

  • Used descriptions for semantic context
  • Used keywords for precision
  • Combined both during re-ranking

However

I’m storing two embeddings doubles vector storage (2 × 2048 dimensions per image).
For very large libraries, consider using only one – either concatenated or choose the one that performs better on your validation set.

Storing Images in the Database vs Object Storage

Another important design consideration concerns image storage.

Storing Images in MySQL

Advantages:

  • Simpler architecture
  • Centralized governance
  • Easier transactional consistency

Disadvantages:

  • Larger database size increases buffer pool pressure
  • Backup overhead
  • Potentially performance impact (queries, replication,… )

Storing Images in Object Storage

Advantages:

  • Better scalability
  • Lower storage costs
  • Better CDN integration

Disadvantages:

  • More complex architecture
  • Additional security management

Recommended Hybrid Strategy

A practical enterprise approach is often:

  • Store thumbnails (small, e.g., 512×512) as base64 in MySQL (should be sufficient for vision model analysis)
  • Store full-resolution images in object storage for later retrieval or user download

This enables:

  • Fast previews
  • Efficient semantic retrieval
  • Scalable storage architecture

Prompt Engineering Matters

Semantic quality heavily depends on prompts.

Poor prompts generate:

  • Vague descriptions
  • Weak keywords
  • Low-quality embeddings

Well-designed prompts improve:

  • Retrieval quality
  • Explainability
  • Evaluation consistency

Prompt engineering becomes a critical part of semantic architecture design.

Péroraison

MySQL HeatWave GenAI significantly simplifies the implementation of semantic image understanding systems while substantially lowering the Total Cost of Ownership through the elimination of external infrastructure and fragmented AI pipelines.

By combining:

  • Vision-language models
  • Text embeddings
  • Vector search
  • SQL-native AI routines

organizations can build powerful multimodal systems directly inside MySQL HeatWave.

The « image -> text -> embedding » approach presented in this article provides:

  • Explainability
  • Operational simplicity
  • SQL-native integration
  • Semantic search capabilities

Most importantly, it enables architects and developers to build AI systems that remain understandable, debuggable, and governable — which is often one of the biggest challenges in enterprise AI adoption.

Resources

Follow me on Linkedin

Watch my videos on my YouTube channel and subscribe.

My Slideshare account.

My Speaker Deck account.

Thanks for using HeatWave & MySQL!

0

Querying the Unstructured: Natural Language to SQL for JSON Data

October 14, 2025

In Ask Your Database Anything: Natural Language to SQL (NL2SQL) in MySQL HeatWave, we have explored the innovative MySQL HeatWave GenAI technology that converts Natural Language into SQL, making it easier for you to interact with databases. This feature collects information on the schemas, tables, and columns that you have access to, and then uses a Large Language Model (LLM) to generate an SQL query for the question pertaining to your data. It also lets you run the generated query and view the result set.

Following our last article, Let Your AI DBA Assistant Write Your MySQL Queries, showcased an interesting use case: leveraging an AI DBA Assistant to generate monitoring and tuning queries for the Performance, Information, and Sys Schemas using plain English.

In this article, we will talk about how to handle JSON documents with MySQL HeatWave GenAI NL2SQL feature.

Context

I’m using MySQL HeatWave 9.4.2 on Oracle Cloud Infrastructure (OCI):

SQL> 
-- HeatWave MySQL server version
SHOW VARIABLES WHERE Variable_name IN ('version_comment', 'version');
+-----------------+--------------------------+
| Variable_name   | Value                    |
+-----------------+--------------------------+
| version         | 9.4.2-cloud              |
| version_comment | MySQL Enterprise - Cloud |
+-----------------+--------------------------+

Furthermore, I’m using a modified version (available on my GitHub account, file: nl2sql_world_json_nested_-_tables_json_nested.sql) of the well-known World database. Although the data hasn’t been updated in some time, it still serves as a useful tool for visualizing the results of SQL queries.

Here the schema:

CREATE TABLE `country_json_nested` (
  `ID` int NOT NULL AUTO_INCREMENT COMMENT 'Primary key: integer that uniquely identifies country_json JSON documents.',
  `country_description` json NOT NULL COMMENT 'Country information. Including country names, continent, region, surface area, identifiers (country code), demographics, economy (GNP), and government structure',
  PRIMARY KEY (`ID`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci COMMENT='Stores detailed information about countries, including identifiers, demographics, economy, and government structure.'
;

CREATE TABLE `country_language_json_nested` (
  `ID` int NOT NULL AUTO_INCREMENT COMMENT 'Primary key: integer that uniquely identifies country_language_json JSON documents.',
  `country_language_description` json NOT NULL COMMENT 'Country languages information. Including a three-letter country code referencing the country table, the name of the language spoken in the country, whether they are official and the share of the population using them.',
  PRIMARY KEY (`ID`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci COMMENT='Stores information about the languages spoken in each country, including whether they are official and the share of the population using them.'
;

CREATE TABLE `city_json_nested` (
  `ID` int NOT NULL AUTO_INCREMENT COMMENT 'Primary key: integer that uniquely identifies city_json JSON documents.',
  `city_description` json NOT NULL COMMENT 'Cities information. including their name, country, district, and population.',
  PRIMARY KEY (`ID`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci COMMENT='Stores information about cities, including their name, country, district, and population.'
;

Let’s run 3 queries on the data set:

MySQL>
-- Find the name, country code and the district of the city with ID 2989
SELECT 
    city_description->>"$.city_name" AS "City Name", 
    city_description->>"$.location.country_code" AS "Country Code", 
    city_description->>"$.location.district" AS District 
FROM city_json_nested 
WHERE city_description->>"$.ID_city" = 2989;
+-----------+--------------+-------------+
| city_name | country_code | district    |
+-----------+--------------+-------------+
| Grenoble  | FRA          | Rhône-Alpes |
+-----------+--------------+-------------+

-- Find the government form of Germany
SELECT 
    country_description->>"$.government.government_form" AS "Government Form" 
FROM country_json_nested 
WHERE country_description->>"$.identifiers.code" = 'FRA';
+-----------------+
| government_form |
+-----------------+
| Republic        |
+-----------------+

-- List all official languages spoken in Canada
SELECT 
    country_language_description->>"$.language_info.language" AS Language 
FROM country_language_json_nested 
WHERE country_language_description->>"$.country.country_code" = 'CAN' 
    AND country_language_description->>"$.language_info.is_official" = 'T';
+----------+
| language |
+----------+
| English  |
| French   |
+----------+

I’m employing the inline path operator (->>), which serves as a convenient shorthand for extracting and unquoting values: JSON_UNQUOTE(JSON_EXTRACT(...)).

The execution of these queries, while simple, still relies on existing SQL and JSON expertise. MySQL HeatWave GenAI can eliminates this dependency by enabling users to interact with the database via Natural Language to SQL. As highlighted in “Ask Your Database Anything: Natural Language to SQL in MySQL HeatWave” the new NL_SQL routine allows non-technical users to write requests in plain English, and have the system automatically generate the necessary SQL.

Introducing JSON

JSON (JavaScript Object Notation) is a lightweight, human-readable format for structuring and exchanging data. Built around key–value pairs and ordered arrays, JSON enables developers to model complex, hierarchical information without the rigidity of a fixed schema.
Its simplicity, flexibility, and near-universal support across programming languages have made it the standard for transmitting structured data between servers, web applications, and APIs in modern software systems.

JSON in the MySQL Context

In MySQL, the native JSON data type — introduced in version 5.7 — brings NoSQL-style flexibility into the realm of relational databases. It allows developers to store, validate, and optimize JSON documents directly within columns, combining the ACID compliance and transactional integrity of MySQL with the dynamic nature of semi-structured data.
Specialized JSON functions such as JSON_EXTRACT(), JSON_CONTAINS(), JSON_OBJECT() or JSON_TABLE() enable precise querying and manipulation, while indexing options like generated columns ensure high performance.
This hybrid approach lets MySQL handle everything from traditional records to complex API payloads within a unified, scalable system.

Some MySQL & JSON content that you may find useful:

Challenges of Using the JSON Data Type

NL2SQL in MySQL HeatWave GenAI is easy to use and highly effective when configured with the right context. To help the LLM generate accurate queries, it’s important to narrow its focus by specifying the relevant schemas and tables through the schemas and tables parameters. Additionally, using clear, descriptive names for tables, columns, and views ensures the model can better interpret their purpose and produce more precise SQL statements.

The core difficulty stems from JSON’s structure: the document simultaneously holds both data and metadata (keys), yet MySQL treats the entire document as undifferentiated raw data within a single column. From the database’s perspective, the column contains a blob of JSON, meaning there’s little or no external schema metadata — such as explicit column names, data types, or constraints — to help the LLM interpret the semantic meaning or internal structure of the fields. This inherent lack of external context makes intelligent processing more challenging.

Examples

Using nl_sql is quite simple:

CALL sys.NL_SQL(
  "NaturalLanguageStatement",
  @output,
  JSON_OBJECT(
    'schemas', JSON_ARRAY('DBName'),
    'model_id', 'ModelID'
  )
);

Replace the following:

  • NaturalLanguageStatement: natural-language statement. It can be a question, statement, or query pertaining to your data available in MySQL HeatWave.
  • DBName: database to consider for generating and running the SQL query.
  • ModelID: LLM to use.

You can find additional information here.

Let’s see an example:

MySQL>
SELECT model_id, provider 
FROM sys.ML_SUPPORTED_LLMS 
WHERE capabilities LIKE '["GENERATION"]';

+-------------------------------+---------------------------+
| model_id                      | provider                  |
+-------------------------------+---------------------------+
| llama2-7b-v1                  | HeatWave                  |
| llama3-8b-instruct-v1         | HeatWave                  |
| llama3.1-8b-instruct-v1       | HeatWave                  |
| llama3.2-1b-instruct-v1       | HeatWave                  |
| llama3.2-3b-instruct-v1       | HeatWave                  |
| mistral-7b-instruct-v1        | HeatWave                  |
| mistral-7b-instruct-v3        | HeatWave                  |
| cohere.command-latest         | OCI Generative AI Service |
| cohere.command-plus-latest    | OCI Generative AI Service |
| cohere.command-a-03-2025      | OCI Generative AI Service |
| meta.llama-3.3-70b-instruct   | OCI Generative AI Service |
| cohere.command-r-08-2024      | OCI Generative AI Service |
| cohere.command-r-plus-08-2024 | OCI Generative AI Service |
+-------------------------------+---------------------------+
MySQL> 
SHOW TABLES;
+------------------------------------+
| Tables_in_nl2sql_world_json_nested |
+------------------------------------+
| city_json_nested                   |
| country_json_nested                |
| country_language_json_nested       |
+------------------------------------+
MySQL nl2sql_world_json_nested SQL> SET @nlq = "Number of cities";
Query OK, 0 rows affected (0.0003 sec)

MySQL nl2sql_world_json_nested SQL> CALL sys.NL_SQL(@nlq, @output, '{"model_id": "meta.llama-3.3-70b-instruct", "schemas":["nl2sql_world_json_nested"]}');
+-----------------------------------------------------------------------+
| Executing generated SQL statement...                                  |
+-----------------------------------------------------------------------+
| SELECT COUNT(`ID`) FROM `nl2sql_world_json_nested`.`city_json_nested` |
+-----------------------------------------------------------------------+
1 row in set (1.6583 sec)

+-------------+
| COUNT(`ID`) |
+-------------+
|        4079 |
+-------------+
1 row in set (1.6583 sec)

That was straightforward!

Now, let’s raise the bar with a slightly more complex query:

MySQL nl2sql_world_json_nested SQL> SET @nlq = "Population of Monaco";
Query OK, 0 rows affected (0.0003 sec)

MySQL nl2sql_world_json_nested SQL> CALL sys.NL_SQL(@nlq, @output, '{"model_id": "meta.llama-3.3-70b-instruct", "schemas":["nl2sql_world_json_nested"]}');
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Executing generated SQL statement...                                                                                                                                   |
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| SELECT JSON_EXTRACT(`city_description`, '$.population') FROM `nl2sql_world_json_nested`.`city_json_nested` WHERE JSON_EXTRACT(`city_description`, '$.name') = 'Monaco' |
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
1 row in set (2.1982 sec)

Empty set (2.1982 sec)

Query OK, 0 rows affected (2.1982 sec)
MySQL nl2sql_world_json_nested SQL> SET @nlq = "List all official languages spoken in Canada";
Query OK, 0 rows affected (0.0004 sec)

MySQL nl2sql_world_json_nested SQL> CALL sys.NL_SQL(@nlq, @output, '{"model_id": "meta.llama-3.3-70b-instruct", "schemas":["nl2sql_world_json_nested"]}');
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Executing generated SQL statement...                                                                                                                                                                                                                                                                |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| SELECT JSON_EXTRACT(`country_language_description`, '$[*].language') FROM `nl2sql_world_json_nested`.`country_language_json_nested` WHERE JSON_EXTRACT(`country_language_description`, '$[*].country_code') = '"CAN"' AND JSON_EXTRACT(`country_language_description`, '$[*].is_official') = 'true' |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
1 row in set (2.3889 sec)

Empty set (2.3889 sec)

Query OK, 0 rows affected (2.3889 sec)

In both of these recent cases, the generated queries were syntactically correct, but the system failed to produce a semantically relevant query.

But all is not lost…

Workaround: Defining Views for Better Context

A highly effective workaround is to introduce well-defined views over the existing table. The main idea is to expose the JSON’s internal structure by effectively translating the semi-structured data into a fully relational schema. By using descriptive, explicit column names in these views, you furnish the NL2SQL model with the necessary semantic metadata. This strategy significantly improves accuracy for complex JOIN operations and accurate filtering by providing clear, accessible relational keys.

Let’s see what the views (nl2sql_world_json_nested_-_view_json_nested.sql) might look like:

CREATE OR REPLACE VIEW country_nested_flat AS
SELECT
    ID AS country_id,

    -- Identifiers
    JSON_UNQUOTE(JSON_EXTRACT(country_description, '$.identifiers.code')) AS country_code,
    JSON_UNQUOTE(JSON_EXTRACT(country_description, '$.identifiers.code2')) AS country_code2,
    JSON_UNQUOTE(JSON_EXTRACT(country_description, '$.identifiers.country_name')) AS country_name,
    JSON_UNQUOTE(JSON_EXTRACT(country_description, '$.identifiers.local_name')) AS local_name,

    -- Geography
    JSON_UNQUOTE(JSON_EXTRACT(country_description, '$.geography.continent')) AS continent,
    JSON_UNQUOTE(JSON_EXTRACT(country_description, '$.geography.region')) AS region,
    CAST(JSON_EXTRACT(country_description, '$.geography.surface_area') AS DECIMAL(12,2)) AS surface_area,
    CAST(JSON_EXTRACT(country_description, '$.geography.ID_capital') AS SIGNED) AS capital_id,

    -- Demographics
    CAST(JSON_EXTRACT(country_description, '$.demographics.country_population') AS SIGNED) AS population,
    CAST(JSON_EXTRACT(country_description, '$.demographics.life_expectancy') AS DECIMAL(5,2)) AS life_expectancy,
    CAST(JSON_EXTRACT(country_description, '$.demographics.independance_year') AS SIGNED) AS independance_year,

    -- Economy
    CAST(JSON_EXTRACT(country_description, '$.economy.GNP') AS DECIMAL(15,2)) AS gnp,
    CAST(JSON_EXTRACT(country_description, '$.economy.GNPOld') AS DECIMAL(15,2)) AS gnp_old,

    -- Government
    JSON_UNQUOTE(JSON_EXTRACT(country_description, '$.government.head_of_state')) AS head_of_state,
    JSON_UNQUOTE(JSON_EXTRACT(country_description, '$.government.government_form')) AS government_form

FROM country_json_nested;
CREATE OR REPLACE VIEW country_language_nested_flat AS
SELECT
    ID AS language_id,

    -- Country reference
    JSON_UNQUOTE(JSON_EXTRACT(country_language_description, '$.country.country_code')) AS country_code,

    -- Language details
    JSON_UNQUOTE(JSON_EXTRACT(country_language_description, '$.language_info.language')) AS language,
    JSON_UNQUOTE(JSON_EXTRACT(country_language_description, '$.language_info.is_official')) AS is_official,
    CAST(JSON_EXTRACT(country_language_description, '$.language_info.percentage') AS DECIMAL(5,2)) AS percentage

FROM country_language_json_nested;
CREATE OR REPLACE VIEW city_nested_flat AS
SELECT
    ID AS json_id,

    -- City identifiers
    CAST(JSON_EXTRACT(city_description, '$.ID_city') AS SIGNED) AS city_id,
    JSON_UNQUOTE(JSON_EXTRACT(city_description, '$.city_name')) AS city_name,

    -- Location details
    JSON_UNQUOTE(JSON_EXTRACT(city_description, '$.location.district')) AS district,
    JSON_UNQUOTE(JSON_EXTRACT(city_description, '$.location.country_code')) AS country_code,

    -- Demographics
    CAST(JSON_EXTRACT(city_description, '$.city_population') AS SIGNED) AS population

FROM city_json_nested;

Let’s now re-run our three previous queries, applying this new view-based approach:

MySQL nl2sql_world_json_nested SQL> SET @nlq = "Number of cities";
Query OK, 0 rows affected (0.0004 sec)

MySQL nl2sql_world_json_nested SQL> CALL sys.NL_SQL(@nlq, @output, '{"model_id": "meta.llama-3.3-70b-instruct", "schemas":["nl2sql_world_json_nested"]}');
+----------------------------------------------------------------------------+
| Executing generated SQL statement...                                       |
+----------------------------------------------------------------------------+
| SELECT COUNT(`city_id`) FROM `nl2sql_world_json_nested`.`city_nested_flat` |
+----------------------------------------------------------------------------+
1 row in set (1.8282 sec)

+------------------+
| COUNT(`city_id`) |
+------------------+
|             4079 |
+------------------+
1 row in set (1.8282 sec)

Query OK, 0 rows affected (1.8282 sec)
MySQL nl2sql_world_json_nested SQL> SET @nlq = "Population of Monaco";
Query OK, 0 rows affected (0.0004 sec)

MySQL nl2sql_world_json_nested SQL> CALL sys.NL_SQL(@nlq, @output, '{"model_id": "meta.llama-3.3-70b-instruct", "schemas":["nl2sql_world_json_nested"]}');
+-----------------------------------------------------------------------------------------------------------+
| Executing generated SQL statement...                                                                      |
+-----------------------------------------------------------------------------------------------------------+
| SELECT `population` FROM `nl2sql_world_json_nested`.`country_nested_flat` WHERE `country_name` = 'Monaco' |
+-----------------------------------------------------------------------------------------------------------+
1 row in set (2.0889 sec)

+------------+
| population |
+------------+
|      34000 |
+------------+
1 row in set (2.0889 sec)

Query OK, 0 rows affected (2.0889 sec)
MySQL nl2sql_world_json_nested SQL> SET @nlq = "List all official languages spoken in Canada";
Query OK, 0 rows affected (0.0004 sec)

MySQL nl2sql_world_json_nested SQL> CALL sys.NL_SQL(@nlq, @output, '{"model_id": "meta.llama-3.3-70b-instruct", "schemas":["nl2sql_world_json_nested"]}');
+---------------------------------------------------------------------------------------------------------------------------------------+
| Executing generated SQL statement...                                                                                                  |
+---------------------------------------------------------------------------------------------------------------------------------------+
| SELECT `language` FROM `nl2sql_world_json_nested`.`country_language_nested_flat` WHERE `country_code` = 'CAN' AND `is_official` = 'T' |
+---------------------------------------------------------------------------------------------------------------------------------------+
1 row in set (2.0249 sec)

+----------+
| language |
+----------+
| English  |
| French   |
+----------+
2 rows in set (2.0249 sec)

Query OK, 0 rows affected (2.0249 sec)

Q.E.D.

Peroraison

Bridging natural language processing and semi-structured data presents both exciting opportunities and unique challenges. MySQL HeatWave GenAI’s NL2SQL capability demonstrates how natural language can simplify interaction with complex data systems, even when working with intricate JSON documents.
However, because JSON stores both data and metadata together, the absence of explicit schema information can limit how effectively an LLM interprets and formulates queries. Creating well-structured views that expose JSON’s internal organization offers a practical solution — transforming unstructured data into meaningful relational context.
Ultimately, this approach not only enhances NL2SQL’s accuracy but also showcases how MySQL HeatWave continues to evolve as a powerful engine for intelligent, natural language–driven analytics.

Resources

Follow me on Linkedin

Watch my videos on my YouTube channel and subscribe.

My Slideshare account.

My Speaker Deck account.

Thanks for using HeatWave & MySQL!

1

Let Your AI DBA Assistant Write Your MySQL Queries

October 7, 2025

Having explored the innovative MySQL HeatWave technology that converts Natural Language into SQL (Ask Your Database Anything: Natural Language to SQL in MySQL HeatWave), our next article in this series, dives into a practical use case demonstrating how an AI DBA Assistant can significantly simplify your query generation workflow.

In MySQL, there are 3 specialized system schemas designed to give DBAs and developers deeper visibility and control over the server. Together, they provide the tools needed to monitor performance, inspect metadata, and simplify management tasks:

The Performance Schema is a powerful instrumentation framework designed for low-level monitoring of server execution, enabling administrators and developers to gain deep insights into how the database is running. Unlike general status metrics, it collects highly detailed statistics about server events and resource usage directly from the server internals in real time, with the data stored in memory. It is particularly useful for for live diagnostics and performance tuning.
The primary purpose of the Performance Schema is to expose what is happening inside the MySQL server and, more importantly, why certain operations may be slow. This makes it an invaluable tool for tasks such as identifying poorly performing queries, diagnosing I/O wait bottlenecks, or analyzing mutex contention in multithreaded workloads. By surfacing these low-level insights, the Performance Schema empowers users to move beyond surface-level monitoring and perform precise root-cause analysis of performance issues.

The Information Schema is the SQL-standard-compliant interface for accessing metadata about the objects managed by the server. Acting as a central directory, it provides a structured view of databases, tables, columns, indexes, privileges, and overall server characteristics, making it the go-to source for understanding the logical organization of a MySQL instance. Its primary purpose is to expose database metadata—answering questions about what objects exist and how they are structured, rather than how they perform. While the metadata itself is stored on disk, MySQL presents it through in-memory tables that can be queried like regular tables.
Typical use cases include retrieving a list of all tables in a specific database, checking column data types, or examining indexes and privileges. By adhering to the SQL standard, the Information Schema ensures portability and consistency, allowing users to interact with MySQL metadata in a way that aligns with other relational database systems.

The Sys Schema is a set of user-friendly views, functions, and procedures that sits on top of the Performance Schema and Information Schema, transforming their often complex and technical data into a more readable, actionable format. Its main purpose is to simplify the process of interpreting server metadata and performance statistics, making it much easier for DBAs and developers to diagnose issues and optimize workloads without having to manually parse through raw instrumentation data.
By aggregating and presenting information from both underlying schemas, the Sys Schema provides clear insights into common administrative tasks, such as identifying the most time-consuming queries, monitoring active sessions, or detecting unused indexes. In essence, it acts as a usability layer, bridging the gap between MySQL’s powerful but intricate internal schemas and the practical needs of day-to-day database operations.

Together, these schemas form the foundation for effective MySQL performance tuning, troubleshooting, and administration.

However, fully leveraging the Performance Schema, Information Schema, and Sys Schema requires a solid command of SQL to query the data they expose. This is where AI can bridge the gap.
As explored in my article Ask Your Database Anything: Natural Language to SQL in MySQL HeatWave, MySQL HeatWave’s Natural Language to SQL (NL2SQL) capabilities make it possible to interact with your database by simply asking questions in plain English, without writing complex queries.

Let’s walk through a concrete example (a special thanks to my colleague Ivan for inspiring this idea):

The code is available on my GitHub account.

Peroraison

MySQL’s Performance, Information, and Sys Schemas give DBAs powerful tools to monitor, tune, and troubleshoot servers, while HeatWave NL2SQL makes those insights accessible in simple English.

Resources

Follow me on Linkedin

Watch my videos on my YouTube channel and subscribe.

My Slideshare account.

My Speaker Deck account.

Thanks for using HeatWave & MySQL!

3

Ask Your Database Anything: Natural Language to SQL in MySQL HeatWave

September 30, 2025

The ability to query data efficiently has always been central to unlocking insights, but writing SQL can be a barrier for many users who aren’t fluent in the language of databases. Analysts, product managers, and business users often know the questions they want to ask—just not how to express them in SQL. With the rise of large language models (LLMs) and advancements in database technology, that gap is closing quickly.

MySQL HeatWave now brings Natural Language to SQL (NL2SQL) capabilities directly into the database engine, allowing users to generate SQL queries from plain English statements. Instead of wrestling with complex joins, filters, or aggregate functions, users can simply type a natural-language request—such as “Show me the top 10 products by revenue this quarter”—and HeatWave automatically translates it into an SQL query.

In this article, we’ll explore how to leverage LLM-powered NL2SQL in MySQL HeatWave, walk through practical examples, and show how this feature empowers both technical and non-technical users to interact with data more intuitively.

Context

I’m using MySQL HeatWave 9.4.1 on Oracle Cloud Infrastructure (OCI):

SQL> 
-- HeatWave MySQL server version
SHOW VARIABLES WHERE Variable_name IN ('version_comment', 'version');
+-----------------+--------------------------+
| Variable_name   | Value                    |
+-----------------+--------------------------+
| version         | 9.4.1-cloud              |
| version_comment | MySQL Enterprise - Cloud |
+-----------------+--------------------------+

Furthermore, I’m using a modified version (available on my GitHub account) of the well-known World database. Although the data hasn’t been updated in some time, it still serves as a useful tool for visualizing the results of SQL queries.

Here the schema:

CREATE TABLE `country` (
  `code` char(3) NOT NULL DEFAULT '' COMMENT 'Primary key: three-letter country code (ISO standard).',
  `country_name` char(52) NOT NULL DEFAULT '' COMMENT 'Official name of the country.',
  `continent` enum('Asia','Europe','North America','Africa','Oceania','Antarctica','South America') NOT NULL DEFAULT 'Asia' COMMENT 'Continent where the country is located.',
  `region` char(26) NOT NULL DEFAULT '' COMMENT 'Geographical region within the continent.',
  `surface_area` decimal(10,2) NOT NULL DEFAULT '0.00' COMMENT 'Total surface area of the country in square kilometers.',
  `independance_year` smallint DEFAULT NULL COMMENT 'Year the country achieved independence (NULL if unknown).',
  `country_population` int NOT NULL DEFAULT '0' COMMENT 'Total population of the country.',
  `life_expectancy` decimal(3,1) DEFAULT NULL COMMENT 'Average life expectancy of the population in years.',
  `GNP` decimal(10,2) DEFAULT NULL COMMENT 'Gross National Product of the country in millions of USD.',
  `GNPOld` decimal(10,2) DEFAULT NULL COMMENT 'Gross National Product in an earlier year for comparison.',
  `local_name` char(45) NOT NULL DEFAULT '' COMMENT 'The country’s name in its local language.',
  `government_form` char(45) NOT NULL DEFAULT '' COMMENT 'Description of the form of government (e.g., Republic, Monarchy).',
  `head_of_state` char(60) DEFAULT NULL COMMENT 'Name of the current head of state (e.g., President, Monarch).',
  `ID_capital` int DEFAULT NULL COMMENT 'ID of the capital city (foreign key reference to city table).',
  `code2` char(2) NOT NULL DEFAULT '' COMMENT 'Two-letter country code (ISO standard).',
  PRIMARY KEY (`code`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci COMMENT='Stores detailed information about countries, including identifiers, demographics, economy, and government structure.'
;


CREATE TABLE `country_language` (
  `country_code` char(3) NOT NULL DEFAULT '' COMMENT 'Three-letter country code referencing the country table. First part of the composite primary key.',
  `language` char(30) NOT NULL DEFAULT '' COMMENT 'Name of the language spoken in the country. Second part of the composite primary key.',
  `is_official` enum('T','F') NOT NULL DEFAULT 'F' COMMENT 'Indicates whether the language is an official language of the country (T = true, F = false).',
  `percentage` decimal(4,1) NOT NULL DEFAULT '0.0' COMMENT 'Percentage of the country’s population that speaks this language.',
  PRIMARY KEY (`country_code`,`language`),
  KEY `country_code` (`country_code`),
  CONSTRAINT `country_language_ibfk_1` FOREIGN KEY (`country_code`) REFERENCES `country` (`code`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci COMMENT='Stores information about the languages spoken in each country, including whether they are official and the share of the population using them.'
;


CREATE TABLE `city` (
  `ID_city` int NOT NULL AUTO_INCREMENT COMMENT 'Primary key: integer that uniquely identifies each city.',
  `city_name` char(35) NOT NULL DEFAULT '' COMMENT 'Name of the city.',
  `country_code` char(3) NOT NULL DEFAULT '' COMMENT 'Three-letter country code referencing the country table.',
  `district` char(20) NOT NULL DEFAULT '' COMMENT 'District or administrative region where the city is located.',
  `city_population` int NOT NULL DEFAULT '0' COMMENT 'Population count of the city.',
  PRIMARY KEY (`ID_city`),
  KEY `country_code` (`country_code`),
  CONSTRAINT `city_ibfk_1` FOREIGN KEY (`country_code`) REFERENCES `country` (`code`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci COMMENT='Stores information about cities, including their name, country, district, and population.'
;

As you can see, the main difference from the original version is that I embedded the documentation directly into the tables using COMMENT clauses. In addition, some column names have been renamed. The goal is to capture as much relevant information as possible to provide business context. This contextual information will then be used to augment the LLM.

Let’s run 3 queries on the data set:

MySQL>
-- Find the name, country code and the district of the city with ID 2989

SELECT city_name, country_code, district 
FROM city 
WHERE ID_city = 2989;
+-----------+--------------+-------------+
| city_name | country_code | district    |
+-----------+--------------+-------------+
| Grenoble  | FRA          | Rhône-Alpes |
+-----------+--------------+-------------+
 
 
-- Find the government form of Germany

SELECT government_form
FROM country
WHERE code = 'FRA';
+-----------------+
| government_form |
+-----------------+
| Republic        |
+-----------------+


-- List all official languages spoken in Canada

SELECT language
FROM country_language
WHERE country_code = 'CAN' AND is_official = 'T';
+----------+
| language |
+----------+
| English  |
| French   |
+----------+

These queries are fairly simple, but they still require some SQL skills. What if you could write in plain English and have MySQL HeatWave understand you?
This is now possible with the NL_SQL routine, which generates and executes SQL queries directly from natural language statements.

Generate SQL Queries From Natural-Language Statements

Starting with MySQL HeatWave 9.4.1, MySQL HeatWave GenAI (on OCI, AWS, and Azure) allows you with the NL_SQL stored procedure, to generate SQL queries directly from natural language, making it easier to interact with your databases.

The feature gathers information about the schemas, tables, and columns you have access to, then leverages a Large Language Model to generate an appropriate SQL query based on your request. You can also execute the generated query and view the results instantly.

Using nl_sql is quite simple:

CALL sys.NL_SQL("NaturalLanguageStatement",@output, JSON_OBJECT('schemas',JSON_ARRAY('DBName'),'model_id','ModelID'));

Replace the following:

  • NaturalLanguageStatement: natural-language statement. It can be a question, statement, or query pertaining to your data available in MySQL HeatWave.
  • DBName: database to consider for generating and running the SQL query.
  • ModelID: LLM to use.

You can find additional information here.

Also as of MySQL HeatWave 9.3.2, you can view the list of available Large Language Models and embedding models directly from the database using the sys.ML_SUPPORTED_LLMS view. This enhancement, part of the MySQL HeatWave GenAI feature, allows you to stay current with the available models, including in-database models and those from external services like Oracle Cloud Infrastructure (OCI) Generative AI.

To display the list of supported content generation LLMs and their provider, you can use the following simple query:

MySQL> 
SELECT model_id, provider 
FROM sys.ML_SUPPORTED_LLMS 
WHERE capabilities LIKE '["GENERATION"]';
+-------------------------------+---------------------------+
| model_id                      | provider                  |
+-------------------------------+---------------------------+
| llama2-7b-v1                  | HeatWave                  |
| llama3-8b-instruct-v1         | HeatWave                  |
| llama3.1-8b-instruct-v1       | HeatWave                  |
| llama3.2-1b-instruct-v1       | HeatWave                  |
| llama3.2-3b-instruct-v1       | HeatWave                  |
| mistral-7b-instruct-v1        | HeatWave                  |
| mistral-7b-instruct-v3        | HeatWave                  |
| cohere.command-latest         | OCI Generative AI Service |
| cohere.command-plus-latest    | OCI Generative AI Service |
| cohere.command-a-03-2025      | OCI Generative AI Service |
| meta.llama-3.3-70b-instruct   | OCI Generative AI Service |
| cohere.command-r-08-2024      | OCI Generative AI Service |
| cohere.command-r-plus-08-2024 | OCI Generative AI Service |
+-------------------------------+---------------------------+

We can see here, the in-MySQL HeatWave models (provider HeatWave) as well as Oracle Cloud Infrastructure Generative AI service models (provider OCI Generative AI Service).

To view or use the OCI Generative AI Service models in this list, you need to enable the database system to access OCI services. For more information, see Authenticate OCI Generative AI Service.

When using the sys.NL_SQL routine in MySQL HeatWave, you have several options for the underlying (LLM) that translates your natural language queries into SQL.
The available models are part of the MySQL HeatWave GenAI feature. You can select from in-database models or, like we have seen above, if your environment is configured for it, models from the Oracle Cloud Infrastructure (OCI) Generative AI Service.

In MySQL HeatWave 9.4.1, some of the available models for natural language to SQL tasks include:

  • meta.llama-3.3-70b-instruct (OCI Generative AI Service)
  • llama3.1-8b-instruct-v1 (In-database HeatWave)
  • llama3.2-3b-instruct-v1 (In-database HeatWave)

The specific models available may vary by MySQL HeatWave version and the cloud service region you are using. To get the most current list, don’t forget that you can always query the sys.ML_SUPPORTED_LLMS view.

What Can I Ask My Database? Example Queries

Now, let’s see how it works in action. We’ll start with our earlier queries and then try out a few new ones.

Find the name, country code and the district of the city with ID 2989:

MySQL> SET @nlq = "Find the name, country code and the district of the city with ID 2989";
Query OK, 0 rows affected (0.0004 sec)

MySQL> CALL sys.NL_SQL(@nlq, @output, '{"model_id": "llama3.2-3b-instruct-v1", "schemas":["nl2sql_world"]}');
+--------------------------------------------------------------------------------------------------+
| Executing generated SQL statement...                                                             |
+--------------------------------------------------------------------------------------------------+
| SELECT `city_name`, `country_code`, `district` FROM `nl2sql_world`.`city` WHERE `ID_city` = 2989 |
+--------------------------------------------------------------------------------------------------+
1 row in set (3.7089 sec)

+-----------+--------------+-------------+
| city_name | country_code | district    |
+-----------+--------------+-------------+
| Grenoble  | FRA          | Rhône-Alpes |
+-----------+--------------+-------------+
1 row in set (3.7089 sec)

Find the government form of Germany:

MySQL> SET @nlq = "Find the government form of Germany";
Query OK, 0 rows affected (0.0003 sec)

MySQL> CALL sys.NL_SQL(@nlq, @output, '{"model_id": "llama3.1-8b-instruct-v1", "schemas":["nl2sql_world"]}');
+-----------------------------------------------------------------------------+
| Executing generated SQL statement...                                        |
+-----------------------------------------------------------------------------+
| SELECT `government_form` FROM `nl2sql_world`.`country` WHERE `code` = 'DEU' |
+-----------------------------------------------------------------------------+
1 row in set (4.0998 sec)

+------------------+
| government_form  |
+------------------+
| Federal Republic |
+------------------+
1 row in set (4.0998 sec)

List all official languages spoken in Canada:

MySQL> SET @nlq = "Find the top 5 most populated cities in the world";
Query OK, 0 rows affected (0.0002 sec)

MySQL> CALL sys.NL_SQL(@nlq, @output, '{"model_id": "meta.llama-3.3-70b-instruct", "schemas":["nl2sql_world"]}');
+----------------------------------------------------------------------------------------------------------+
| Executing generated SQL statement...                                                                     |
+----------------------------------------------------------------------------------------------------------+
| SELECT `city_name`, `city_population` FROM `nl2sql_world`.`city` ORDER BY `city_population` DESC LIMIT 5 |
+----------------------------------------------------------------------------------------------------------+
1 row in set (2.6276 sec)

+-----------------+-----------------+
| city_name       | city_population |
+-----------------+-----------------+
| Mumbai (Bombay) |        10500000 |
| Seoul           |         9981619 |
| São Paulo       |         9968485 |
| Shanghai        |         9696300 |
| Jakarta         |         9604900 |
+-----------------+-----------------+
5 rows in set (2.6276 sec)

Find the top 10 most multilingual countries (by number of languages spoken):

MySQL> SET @nlq = "Find the top 10 most multilingual countries (by number of languages spoken)";
Query OK, 0 rows affected (0.0004 sec)

MySQL> CALL sys.NL_SQL(@nlq, @output, '{"model_id": "meta.llama-3.3-70b-instruct", "schemas":["nl2sql_world"]}');
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Executing generated SQL statement...                                                                                                                                                                                                                        |
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| SELECT `T1`.`country_name`, COUNT(`T2`.`language`) AS `num_languages` FROM `nl2sql_world`.`country` AS `T1` JOIN `nl2sql_world`.`country_language` AS `T2` ON `T1`.`code` = `T2`.`country_code` GROUP BY `T1`.`code` ORDER BY `num_languages` DESC LIMIT 10 |
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
1 row in set (5.8831 sec)

+--------------------+---------------+
| country_name       | num_languages |
+--------------------+---------------+
| United States      |            12 |
| China              |            12 |
| India              |            12 |
| Canada             |            12 |
| Russian Federation |            12 |
| Tanzania           |            11 |
| South Africa       |            11 |
| Philippines        |            10 |
| Iran               |            10 |
| Kenya              |            10 |
+--------------------+---------------+
10 rows in set (5.8831 sec)

List all official languages spoken in Canada:

MySQL> SET @nlq = "List all official languages spoken in Canada";
Query OK, 0 rows affected (0.0003 sec)

MySQL> CALL sys.NL_SQL(@nlq, @output, '{"model_id": "llama3.2-3b-instruct-v1", "schemas":["nl2sql_world"]}');
+------------------------------------------------------------------------------------------------------------------+
| Executing generated SQL statement...                                                                             |
+------------------------------------------------------------------------------------------------------------------+
| SELECT `language` FROM `nl2sql_world`.`country_language` WHERE `is_official` = 'T' AND `country_code` IN ('CAN') |
+------------------------------------------------------------------------------------------------------------------+
1 row in set (3.0730 sec)

+----------+
| language |
+----------+
| English  |
| French   |
+----------+
2 rows in set (3.0730 sec)

Pretty impressive, isn’t it? 🙂

Click the picture to enlarge

A word of caution: Natural Language to SQL is not foolproof. The LLM can make mistakes and may generate an incorrect query. Additionally, while there are no guarantees, a larger number of model parameters (the b in 70b) typically leads to better results.

The next section will discuss key considerations for ensuring optimal performance and accuracy when using the sys.NL_SQL routine.

Getting the Best Results from Natural Language to SQL

Best Practices for Optimal Results

To get the most out of the NL_SQL feature, follow these best practices for generating accurate and efficient queries:

  • Provide Specific Context: Help the LLM by restricting its focus. Use the schemas and tables parameters to specify only the relevant schemas or tables for your query.
  • Use Descriptive Names: Ensure your tables, columns, and views have clear, semantically meaningful names. Using descriptive views is especially helpful for improving the accuracy of complex JOIN operations.
  • Be Specific with Values: To avoid errors, provide exact values in your natural language input, rather than vague descriptions.

Important Limitations and Considerations

While powerful, the NL_SQL feature does have a few limitations to keep in mind:

  • Query Accuracy: The generated SQL may not always be perfectly valid or optimal. Always review the output before executing it.
  • Performance: Generated queries can sometimes be complex, leading to unpredictable execution times and potentially large result sets that consume excessive resources.
  • Stateless Operations: The feature does not maintain the state of previous calls. Each invocation is independent and does not learn from prior interactions.
  • Metadata Volume: Accuracy can decrease if the database contains a large amount of metadata, as this can confuse the LLM.
  • Unsupported Features: The NL_SQL routine does not support temporary tables.

Peroraison

With the Natural Language to SQL feature in MySQL HeatWave, we’re witnessing a major shift in how we interact with databases. This technology empowers a broader range of users—from business analysts to product managers—to access and analyze data without needing deep SQL expertise. By simply using plain English, anyone can now translate a business question into an executable SQL query. This not only democratizes data access but also significantly accelerates the time from question to insight.

The sys.NL_SQL routine, backed by powerful LLMs, is a testament to the seamless integration of generative AI into core database services. While it’s not a silver bullet—and requires careful consideration of best practices like providing context and using descriptive names—it marks a fundamental step toward making data more accessible and intuitive.

By providing a bridge between human language and database queries, MySQL HeatWave is not just helping us write code; it’s enabling us to ask our databases anything, paving the way for a more natural and productive relationship with our data.

With the sys.NL_SQL routine, asking questions of your database has never been more natural.

Resources

Follow me on Linkedin

Watch my videos on my YouTube channel and subscribe.

My Slideshare account.

My Speaker Deck account.

Thanks for using HeatWave & MySQL!

4

Archive Smarter, Query Faster: Unlocking MySQL Performance with HeatWave

August 19, 2025

Managing large volumes of data is a challenge every organization faces as applications grow. For MySQL users, this often means deciding what to do with historical or less frequently accessed data: keep it in production tables at the cost of performance, or archive it and lose the ability to query it efficiently.
Traditionally, archiving has been a trade-off — helpful for keeping databases lean, but limiting when developers or analysts need to run queries across years of historical records.

HeatWave MySQL changes this equation!

By combining a high-performance in-memory query accelerator with a fully managed MySQL database service, HeatWave makes it possible to archive vast amounts of data while still enabling fast, interactive analytics. Instead of moving archived data into a separate system or relying on slow queries, developers and DBAs can continue to use familiar SQL while taking advantage of HeatWave’s speed and scale.

In this article, we’ll explore how to use HeatWave to archive MySQL data effectively and then run accelerated queries on that archived data — without compromising performance. Whether you’re managing billions of rows or just planning for future growth, this approach can help simplify your architecture while delivering the best of both worlds: efficient data storage and lightning-fast analytics.

Context

The database has been collecting and collecting daily time-stamped data for years.To optimize performance, we want to keep only the current year’s data plus the full year before it. All older data should be archived, yet still available for querying with very low latency.
Let’s see how HeatWave can help us archive efficiently while still running lightning-fast queries on historical data.


I’m using HeatWave 8.4.6 on Oracle Cloud Infrastructure (OCI):

SQL> 
-- HeatWave MySQL server version
SHOW VARIABLES WHERE Variable_name IN ('version_comment', 'version');
+-----------------+--------------------------+
| Variable_name   | Value                    |
+-----------------+--------------------------+
| version         | 8.4.6-cloud              |
| version_comment | MySQL Enterprise - Cloud |
+-----------------+--------------------------+

with a 1 node HeatWave Cluster enable.

HeatWave Cluster Enable
SQL> 
-- HeatWave cluster node number / HeatWave Cluster status 
SHOW STATUS WHERE Variable_name IN ('rapid_cluster_ready_number', 'rapid_cluster_status');
+----------------------------+-------+
| Variable_name              | Value |
+----------------------------+-------+
| rapid_cluster_ready_number | 1     |
| rapid_cluster_status       | ON    |
+----------------------------+-------+
HeatWave Cluster Information

We’ll illustrate this article’s examples using a dataset from Kaggle.

$ head -n 5 daily-minimum-temperatures-in-me.csv
Date,Daily minimum temperatures
1/1/1981,20.7
1/2/1981,17.9
1/3/1981,18.8
1/4/1981,14.6

After some transformation, the data are stored in a partitioned table called: min_temp

SQL > 
SHOW CREATE TABLE min_temp\G
*************************** 1. row ***************************
       Table: min_temp
Create Table: CREATE TABLE `min_temp` (
  `d` date DEFAULT NULL,
  `temp` decimal(3,1) DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci
/*!50100 PARTITION BY RANGE (year(`d`))
(PARTITION p1981 VALUES LESS THAN (1982) ENGINE = InnoDB,
 PARTITION p1982 VALUES LESS THAN (1983) ENGINE = InnoDB,
 PARTITION p1983 VALUES LESS THAN (1984) ENGINE = InnoDB,
 PARTITION p1984 VALUES LESS THAN (1985) ENGINE = InnoDB,
 PARTITION p1985 VALUES LESS THAN (1986) ENGINE = InnoDB,
 PARTITION p1986 VALUES LESS THAN (1987) ENGINE = InnoDB,
 PARTITION p1987 VALUES LESS THAN (1988) ENGINE = InnoDB,
 PARTITION p1988 VALUES LESS THAN (1989) ENGINE = InnoDB,
 PARTITION p1989 VALUES LESS THAN (1990) ENGINE = InnoDB,
 PARTITION p1990 VALUES LESS THAN (1991) ENGINE = InnoDB) */

It contains time series data with daily temperature measurements:

 SQL > 
SELECT * FROM min_temp LIMIT 3;
+------------+------+
| d          | temp |
+------------+------+
| 1981-01-01 | 20.7 |
| 1981-01-02 | 17.9 |
| 1981-01-03 | 18.8 |
+------------+------+


SELECT YEAR(d) AS temp_year FROM min_temp GROUP BY temp_year;
+-----------+
| temp_year |
+-----------+
|      1981 |
|      1982 |
|      1983 |
|      1984 |
|      1985 |
|      1986 |
|      1987 |
|      1988 |
|      1989 |
|      1990 |
+-----------+
10 rows in set (0.0014 sec)

Use the INFORMATION_SCHEMA.PARTITIONS table to inspect partition information:

SELECT PARTITION_NAME, PARTITION_ORDINAL_POSITION, PARTITION_DESCRIPTION, TABLE_ROWS 
FROM INFORMATION_SCHEMA.PARTITIONS 
WHERE TABLE_NAME = 'min_temp';
+----------------+----------------------------+-----------------------+------------+
| PARTITION_NAME | PARTITION_ORDINAL_POSITION | PARTITION_DESCRIPTION | TABLE_ROWS |
+----------------+----------------------------+-----------------------+------------+
| p1981          |                          1 | 1982                  |        365 |
| p1982          |                          2 | 1983                  |        365 |
| p1983          |                          3 | 1984                  |        365 |
| p1984          |                          4 | 1985                  |        365 |
| p1985          |                          5 | 1986                  |        365 |
| p1986          |                          6 | 1987                  |        365 |
| p1987          |                          7 | 1988                  |        365 |
| p1988          |                          8 | 1989                  |        365 |
| p1989          |                          9 | 1990                  |        365 |
| p1990          |                         10 | 1991                  |        365 |
+----------------+----------------------------+-----------------------+------------+

Exchanging Partitions and Subpartitions with Tables

In our case, a practical approach to archiving is to convert the partitions containing old data into regular tables. With {HeatWave} MySQL, this process is straightforward.
Let’s take a look at how it works.

In MySQL, it is possible to exchange a table partition or subpartition with a table using ALTER TABLE. There are some requirements, so you may want to read the documentation, but in short we first need to create a similar table than the one that contains the partitions. But this table must not be partitioned. Then we’ll be able to transfer the data from the partition to this new table.

In MySQL, you can use the ALTER TABLE command to exchange a partition or subpartition with a regular table. There are a few requirements to meet — so it’s worth reviewing the documentation — but in short, the process starts by creating a table with the same structure as the partitioned one, except without partitions. Once that table is ready, you can transfer the data from the partition into it.

To create the table (min_temp_1981) that will store the data from partition p1981, you can do, a CREATE… SELECT:

SQL > 
CREATE TABLE min_temp_1981 SELECT * FROM min_temp WHERE NULL;
Query OK, 0 rows affected (0.0126 sec)

Records: 0  Duplicates: 0  Warnings: 0


-- No data was copied
SELECT * FROM min_temp_1981 LIMIT 3;
Empty set (0.0005 sec)


SHOW CREATE TABLE min_temp_1981\G
*************************** 1. row ***************************
       Table: min_temp_1981
Create Table: CREATE TABLE `min_temp_1981` (
  `d` date DEFAULT NULL,
  `temp` decimal(3,1) DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci

Then transfer the data, from partition p1981 to table min_temp_1981, using ALTER TABLE… EXCHANGE PARTITION… WITH TABLE:

SQL > 
ALTER TABLE min_temp EXCHANGE PARTITION p1981 WITH TABLE min_temp_1981;
Query OK, 0 rows affected (0.0181 sec)


SELECT * FROM min_temp_1981 LIMIT 3;
+------------+------+
| d          | temp |
+------------+------+
| 1981-01-01 | 20.7 |
| 1981-01-02 | 17.9 |
| 1981-01-03 | 18.8 |
+------------+------+


SELECT YEAR(d) AS temp_year FROM min_temp_1981 GROUP BY temp_year;
+-----------+
| temp_year |
+-----------+
|      1981 |
+-----------+

All the data from partition p1981 is now in the table min_temp_1981.

Like expected, in the partitioned table, the “1981 data” is not there anymore:

SQL > 
SELECT * FROM min_temp LIMIT 3;
+------------+------+
| d          | temp |
+------------+------+
| 1982-01-01 | 17.0 |
| 1982-01-02 | 15.0 |
| 1982-01-03 | 13.5 |
+------------+------+


SELECT YEAR(d) AS temp_year FROM min_temp GROUP BY temp_year ;
+-----------+
| temp_year |
+-----------+
|      1982 |
|      1983 |
|      1984 |
|      1985 |
|      1986 |
|      1987 |
|      1988 |
|      1989 |
|      1990 |
+-----------+
9 rows in set (0.0014 sec)

But the partition is still there, and it is now empty (TABLE_ROWS=0):

SQL> 
SELECT PARTITION_NAME, PARTITION_ORDINAL_POSITION, PARTITION_DESCRIPTION, TABLE_ROWS 
FROM INFORMATION_SCHEMA.PARTITIONS 
WHERE TABLE_NAME = 'min_temp';
+----------------+----------------------------+-----------------------+------------+
| PARTITION_NAME | PARTITION_ORDINAL_POSITION | PARTITION_DESCRIPTION | TABLE_ROWS |
+----------------+----------------------------+-----------------------+------------+
| p1981          |                          1 | 1982                  |          0 |
| p1982          |                          2 | 1983                  |        365 |
| p1983          |                          3 | 1984                  |        365 |
| p1984          |                          4 | 1985                  |        365 |
| p1985          |                          5 | 1986                  |        365 |
| p1986          |                          6 | 1987                  |        365 |
| p1987          |                          7 | 1988                  |        365 |
| p1988          |                          8 | 1989                  |        365 |
| p1989          |                          9 | 1990                  |        365 |
| p1990          |                         10 | 1991                  |        365 |
+----------------+----------------------------+-----------------------+------------+

Load data into the HeatWave Cluster

In order to have lightning-fast queries, the archived data must be loaded into the HeatWave cluster.

HeatWave cluster provides a distributed, scalable, shared-nothing, in-memory, hybrid columnar, query processing engine. It can accelerate analytic queries, query external data stored in object storage, and perform machine learning.

To successfully load a table into a HeatWave cluster, you must first ensure it has a primary key. If the table you are trying to archive lacks a primary key, the service will return an Unable to load table without primary key error:

SQL > 
-- Add a primary key
ALTER TABLE min_temp_1981 
    ADD COLUMN id_min_temp_1981 INT UNSIGNED NOT NULL AUTO_INCREMENT PRIMARY KEY FIRST;
Query OK, 0 rows affected (0.0698 sec)

Records: 0  Duplicates: 0  Warnings: 0


SHOW CREATE TABLE min_temp_1981\G
*************************** 1. row ***************************
       Table: min_temp_1981
Create Table: CREATE TABLE `min_temp_1981` (
  `id_min_temp_1981` int unsigned NOT NULL AUTO_INCREMENT,
  `d` date DEFAULT NULL,
  `temp` decimal(3,1) DEFAULT NULL,
  PRIMARY KEY (`id_min_temp_1981`)
) ENGINE=InnoDB AUTO_INCREMENT=366 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci


SELECT * FROM min_temp_1981 LIMIT 3;
+------------------+------------+------+
| id_min_temp_1981 | d          | temp |
+------------------+------------+------+
|                1 | 1981-01-01 | 20.7 |
|                2 | 1981-01-02 | 17.9 |
|                3 | 1981-01-03 | 18.8 |
+------------------+------------+------+

Now we can load the data into the HeatWave Cluster, using sys.heatwave_load:

SQL> 
CALL sys.heatwave_load(JSON_ARRAY("temp_archiving"), JSON_OBJECT('include_list', JSON_ARRAY('temp_archiving.min_temp_1981')));


+------------------------------------------+
| INITIALIZING HEATWAVE AUTO PARALLEL LOAD |
+------------------------------------------+
| Version: 3.21                            |
|                                          |
| Load Mode: normal                        |
| Load Policy: disable_unsupported_columns |
| Output Mode: normal                      |
|                                          |
+------------------------------------------+
6 rows in set (0.0175 sec)

+------------------------------------------------------------------------+
| OFFLOAD ANALYSIS                                                       |
+------------------------------------------------------------------------+
| Verifying input schemas: 1                                             |
| User excluded items: 0                                                 |
|                                                                        |
| SCHEMA                       OFFLOADABLE    OFFLOADABLE     SUMMARY OF |
| NAME                              TABLES        COLUMNS     ISSUES     |
| ------                       -----------    -----------     ---------- |
| `temp_archiving`                       1              3                |
|                                                                        |
| Total offloadable schemas: 1                                           |
|                                                                        |
+------------------------------------------------------------------------+
10 rows in set (0.0175 sec)

+-----------------------------------------------------------------------------------------------------------------------------+
| CAPACITY ESTIMATION                                                                                                         |
+-----------------------------------------------------------------------------------------------------------------------------+
| Default encoding for string columns: VARLEN (unless specified in the schema)                                                |
| Estimating memory footprint for 1 schema(s)                                                                                 |
|                                                                                                                             |
|                                TOTAL       ESTIMATED       ESTIMATED       TOTAL     DICTIONARY      VARLEN       ESTIMATED |
| SCHEMA                   OFFLOADABLE   HEATWAVE NODE      MYSQL NODE      STRING        ENCODED     ENCODED            LOAD |
| NAME                          TABLES       FOOTPRINT       FOOTPRINT     COLUMNS        COLUMNS     COLUMNS            TIME |
| ------                   -----------       ---------       ---------     -------     ----------     -------       --------- |
| `temp_archiving`                   1        3.27 MiB      256.00 KiB           0              0           0          1.00 s |
|                                                                                                                             |
| Sufficient MySQL host memory available to load all tables.                                                                  |
| Sufficient HeatWave cluster memory available to load all tables.                                                            |
|                                                                                                                             |
+-----------------------------------------------------------------------------------------------------------------------------+
12 rows in set (0.0175 sec)

+---------------------------------------------------------------------------------------------------------------------------------------+
| EXECUTING LOAD SCRIPT                                                                                                                 |
+---------------------------------------------------------------------------------------------------------------------------------------+
| HeatWave Load script generated                                                                                                        |
|   Retrieve load script containing 3 generated DDL command(s) using the query below:                                                   |
| Deprecation Notice: "heatwave_load_report" will be deprecated, please switch to "heatwave_autopilot_report"                           |
|   SELECT log->>"$.sql" AS "Load Script" FROM sys.heatwave_autopilot_report WHERE type = "sql" ORDER BY id;                            |
|                                                                                                                                       |
| Adjusting load parallelism dynamically per internal/external table.                                                                   |
| Using current parallelism of 4 thread(s) as maximum for internal tables.                                                              |
|                                                                                                                                       |
| Warning: Executing the generated script may alter column definitions and secondary engine flags in the schema                         |
|                                                                                                                                       |
| Using SQL_MODE: ONLY_FULL_GROUP_BY,STRICT_TRANS_TABLES,NO_ZERO_IN_DATE,NO_ZERO_DATE,ERROR_FOR_DIVISION_BY_ZERO,NO_ENGINE_SUBSTITUTION |
|                                                                                                                                       |
| Proceeding to load 1 table(s) into HeatWave.                                                                                          |
|                                                                                                                                       |
| Applying changes will take approximately 113.00 ms                                                                                    |
|                                                                                                                                       |
+---------------------------------------------------------------------------------------------------------------------------------------+
16 rows in set (0.0175 sec)

+--------------------------------------------------+
| TABLE LOAD                                       |
+--------------------------------------------------+
| TABLE (1 of 1): `temp_archiving`.`min_temp_1981` |
| Commands executed successfully: 3 of 3           |
| Warnings encountered: 0                          |
| Table load succeeded!                            |
|   Total columns loaded: 3                        |
|   Table loaded using 1 thread(s)                 |
|   Elapsed time: 22.70 s                          |
|                                                  |
+--------------------------------------------------+
8 rows in set (0.0175 sec)

+-------------------------------------------------------------------------------+
| LOAD SUMMARY                                                                  |
+-------------------------------------------------------------------------------+
|                                                                               |
| SCHEMA                          TABLES       TABLES      COLUMNS         LOAD |
| NAME                            LOADED       FAILED       LOADED     DURATION |
| ------                          ------       ------      -------     -------- |
| `temp_archiving`                     1            0            3      22.70 s |
|                                                                               |
+-------------------------------------------------------------------------------+
6 rows in set (0.0175 sec)

Query OK, 0 rows affected (0.0175 sec)

You can monitor the process using the rpd_tables & rpd_table_id tables:

SQL> 
SELECT name, load_progress, load_status, query_count 
FROM performance_schema.rpd_tables JOIN performance_schema.rpd_table_id USING(id) 
WHERE name LIKE 'temp_archiving%' 
ORDER BY name\G
*************************** 1. row ***************************
         name: temp_archiving.min_temp_1981
load_progress: 100
  load_status: AVAIL_RPDGSTABSTATE
  query_count: 0


DESC performance_schema.rpd_table_id;
+-------------+--------------+------+-----+---------+-------+
| Field       | Type         | Null | Key | Default | Extra |
+-------------+--------------+------+-----+---------+-------+
| ID          | int unsigned | NO   |     | NULL    |       |
| NAME        | varchar(129) | YES  |     | NULL    |       |
| SCHEMA_NAME | varchar(64)  | YES  |     | NULL    |       |
| TABLE_NAME  | varchar(64)  | YES  |     | NULL    |       |
+-------------+--------------+------+-----+---------+-------+


DESC performance_schema.rpd_tables;
+-----------------------------+-------------------------------+------+-----+---------+-------+
| Field                       | Type                          | Null | Key | Default | Extra |
+-----------------------------+-------------------------------+------+-----+---------+-------+
| ID                          | int unsigned                  | NO   |     | NULL    |       |
| SNAPSHOT_SCN                | bigint unsigned               | YES  |     | NULL    |       |
| PERSISTED_SCN               | bigint unsigned               | YES  |     | NULL    |       |
| POOL_TYPE                   | varchar(64)                   | YES  |     | NULL    |       |
| DATA_PLACEMENT_TYPE         | varchar(64)                   | YES  |     | NULL    |       |
| NROWS                       | bigint unsigned               | NO   |     | NULL    |       |
| LOAD_STATUS                 | varchar(64)                   | YES  |     | NULL    |       |
| LOAD_PROGRESS               | double                        | YES  |     | NULL    |       |
| SIZE_BYTES                  | bigint unsigned               | NO   |     | NULL    |       |
| TRANSFORMATION_BYTES        | bigint unsigned               | YES  |     | NULL    |       |
| QUERY_COUNT                 | bigint unsigned               | NO   |     | NULL    |       |
| LAST_QUERIED                | timestamp(6)                  | YES  |     | NULL    |       |
| LOAD_START_TIMESTAMP        | timestamp(6)                  | YES  |     | NULL    |       |
| LOAD_END_TIMESTAMP          | timestamp(6)                  | YES  |     | NULL    |       |
| RECOVERY_SOURCE             | enum('MySQL','ObjectStorage') | YES  |     | NULL    |       |
| RECOVERY_START_TIMESTAMP    | timestamp(6)                  | YES  |     | NULL    |       |
| RECOVERY_END_TIMESTAMP      | timestamp(6)                  | YES  |     | NULL    |       |
| LOAD_TYPE                   | enum('SELF','USER')           | YES  |     | NULL    |       |
| LOGICAL_PARTS_LOADED_AT_SCN | longtext                      | YES  |     | NULL    |       |
+-----------------------------+-------------------------------+------+-----+---------+-------+

Now you can run your {analytics} SELECT queries like usually. 🙂

More details about HeatWave MySQL Performance.

Simplify the process with stored procedure

The workflow can be easily simplified using stored procedures (since we’re using HeatWave MySQL 8.4, we can’t take advantage of JavaScript stored procedures, which are only available starting from MySQL 9.2.).

Let’s review the 4 steps:

1. Clone table (Copy structure without partitions,… or data)
CALL clone_table(<schema_name>, <partitioned_table>, <archived_table>);

2. Exchanging Partitions
CALL exchange_partition(<schema_name>, <partitioned_table>, <partition_name>, <archived_table>);

3. Add a primary key
CALL add_primary_key_column(<schema_name>, <archived_table>, <primary_key>);

4. Load the data into the HeatWave Cluster
CALL sys.heatwave_load(JSON_ARRAY(<schema_name>), JSON_OBJECT(‘include_list’, JSON_ARRAY(<schema_name>.<archived_table>)));

While a robust production solution would be more complex, we can illustrate the core concepts with a basic implementation (these examples should only be used for testing and learning) :

Clone table

DELIMITER $$

CREATE PROCEDURE `clone_table`(
    IN _source_schema VARCHAR(64),
    IN _source_table VARCHAR(64),
    IN _destination_table VARCHAR(64)
) 
BEGIN
    -- Build the ALTER TABLE statement in order to create the table with the data
    SET @create_table_stmt = CONCAT(
            ' CREATE TABLE `',_source_schema, '`.`', _destination_table, 
            '` SELECT * FROM `',_source_schema, '`.`',_source_table, '`', 
            ' WHERE NULL '
        );

    -- Execute
    PREPARE stmt FROM @create_table_stmt;
    EXECUTE stmt;
    DEALLOCATE PREPARE stmt;
END$$

DELIMITER ;

Usage:

CALL clone_table('temp_archiving', 'min_temp', 'min_temp_1982');

Exchange Partition

DELIMITER $$

CREATE PROCEDURE exchange_partition(
    IN _source_schema VARCHAR(64),
    IN _partitioned_table VARCHAR(64),
    IN _partition_to_exchange VARCHAR(64),
    IN _new_table_name VARCHAR(64)
)
BEGIN
    -- Build the ALTER TABLE statement in order to exchange the partition with the newly created table
    SET @exchange_stmt = CONCAT(
        ' ALTER TABLE `',_source_schema, '`.`', _partitioned_table, '` ',
        ' EXCHANGE PARTITION `', _partition_to_exchange, '` ',
        ' WITH TABLE `',_source_schema, '`.`', _new_table_name, '`'
    );

    -- Execute
    PREPARE stmt FROM @exchange_stmt;
    EXECUTE stmt;
    DEALLOCATE PREPARE stmt;
END$$

DELIMITER ;

Usage:

CALL exchange_partition('temp_archiving', 'min_temp', 'p1982', 'min_temp_1982');

Add a Primary Key

DELIMITER $$

CREATE PROCEDURE add_primary_key_column(
    IN _source_schema VARCHAR(64),
    IN _table_name VARCHAR(64),
    IN _column_with_PK VARCHAR(64)
)
BEGIN
    -- Build the ALTER TABLE statement in order to add the PK
    SET @alter_table_stmt = CONCAT(
        ' ALTER TABLE `',_source_schema, '`.`', _table_name, '` ',
        ' ADD COLUMN `', _column_with_PK, '` INT UNSIGNED NOT NULL AUTO_INCREMENT PRIMARY KEY FIRST'
    );

    -- Execute
    PREPARE stmt FROM @alter_table_stmt;
    EXECUTE stmt;
    DEALLOCATE PREPARE stmt;
END$$

DELIMITER ;

Usage:

CALL add_primary_key_column('temp_archiving', 'min_temp_1982', 'id_min_temp_1982');

Finally, we can create a wrapper to perform all 4 steps at once:

Data Archiving Wrapper

DELIMITER $$

CREATE PROCEDURE `archiving_wrapper`(
    IN _source_schema VARCHAR(64),
    IN _source_table VARCHAR(64),
    IN _destination_table VARCHAR(64),
    IN _partition_to_exchange VARCHAR(64),
    IN _column_with_PK VARCHAR(64)
) 
BEGIN
    -- Copy structure without indexes, partitions,... and data
    CALL clone_table(_source_schema, _source_table, _destination_table);

    -- Exchange partition
    CALL exchange_partition(_source_schema, _source_table, _partition_to_exchange, _destination_table);

    -- Add a primary key 
    CALL add_primary_key_column(_source_schema, _destination_table, _column_with_PK);

    -- Load data into HeatWave Cluster 
    CALL sys.heatwave_load(
        JSON_ARRAY(_source_schema),
        JSON_OBJECT(
            'include_list',
            JSON_ARRAY(CONCAT(_source_schema, '.', _destination_table))
        )
    );             

END$$

DELIMITER ;

Usage:

SQL>
CALL archiving_wrapper('temp_archiving', 'min_temp', 'min_temp_1982', 'p1982', 'id_min_temp_1982');


+------------------------------------------+
| INITIALIZING HEATWAVE AUTO PARALLEL LOAD |
+------------------------------------------+
| Version: 3.21                            |
|                                          |
| Load Mode: normal                        |
| Load Policy: disable_unsupported_columns |
| Output Mode: normal                      |
|                                          |
+------------------------------------------+
6 rows in set (0.1021 sec)

+------------------------------------------------------------------------+
| OFFLOAD ANALYSIS                                                       |
+------------------------------------------------------------------------+
| Verifying input schemas: 1                                             |
| User excluded items: 0                                                 |
|                                                                        |
| SCHEMA                       OFFLOADABLE    OFFLOADABLE     SUMMARY OF |
| NAME                              TABLES        COLUMNS     ISSUES     |
| ------                       -----------    -----------     ---------- |
| `temp_archiving`                       1              3                |
|                                                                        |
| Total offloadable schemas: 1                                           |
|                                                                        |
+------------------------------------------------------------------------+
10 rows in set (0.1021 sec)

+-----------------------------------------------------------------------------------------------------------------------------+
| CAPACITY ESTIMATION                                                                                                         |
+-----------------------------------------------------------------------------------------------------------------------------+
| Default encoding for string columns: VARLEN (unless specified in the schema)                                                |
| Estimating memory footprint for 1 schema(s)                                                                                 |
|                                                                                                                             |
|                                TOTAL       ESTIMATED       ESTIMATED       TOTAL     DICTIONARY      VARLEN       ESTIMATED |
| SCHEMA                   OFFLOADABLE   HEATWAVE NODE      MYSQL NODE      STRING        ENCODED     ENCODED            LOAD |
| NAME                          TABLES       FOOTPRINT       FOOTPRINT     COLUMNS        COLUMNS     COLUMNS            TIME |
| ------                   -----------       ---------       ---------     -------     ----------     -------       --------- |
| `temp_archiving`                   1        3.27 MiB      256.00 KiB           0              0           0          1.00 s |
|                                                                                                                             |
| Sufficient MySQL host memory available to load all tables.                                                                  |
| Sufficient HeatWave cluster memory available to load all tables.                                                            |
|                                                                                                                             |
+-----------------------------------------------------------------------------------------------------------------------------+
12 rows in set (0.1021 sec)

+---------------------------------------------------------------------------------------------------------------------------------------+
| EXECUTING LOAD SCRIPT                                                                                                                 |
+---------------------------------------------------------------------------------------------------------------------------------------+
| HeatWave Load script generated                                                                                                        |
|   Retrieve load script containing 3 generated DDL command(s) using the query below:                                                   |
| Deprecation Notice: "heatwave_load_report" will be deprecated, please switch to "heatwave_autopilot_report"                           |
|   SELECT log->>"$.sql" AS "Load Script" FROM sys.heatwave_autopilot_report WHERE type = "sql" ORDER BY id;                            |
|                                                                                                                                       |
| Adjusting load parallelism dynamically per internal/external table.                                                                   |
| Using current parallelism of 4 thread(s) as maximum for internal tables.                                                              |
|                                                                                                                                       |
| Warning: Executing the generated script may alter column definitions and secondary engine flags in the schema                         |
|                                                                                                                                       |
| Using SQL_MODE: ONLY_FULL_GROUP_BY,STRICT_TRANS_TABLES,NO_ZERO_IN_DATE,NO_ZERO_DATE,ERROR_FOR_DIVISION_BY_ZERO,NO_ENGINE_SUBSTITUTION |
|                                                                                                                                       |
| Proceeding to load 1 table(s) into HeatWave.                                                                                          |
|                                                                                                                                       |
| Applying changes will take approximately 113.00 ms                                                                                    |
|                                                                                                                                       |
+---------------------------------------------------------------------------------------------------------------------------------------+
16 rows in set (0.1021 sec)

+--------------------------------------------------+
| TABLE LOAD                                       |
+--------------------------------------------------+
| TABLE (1 of 1): `temp_archiving`.`min_temp_1982` |
| Commands executed successfully: 3 of 3           |
| Warnings encountered: 0                          |
| Table load succeeded!                            |
|   Total columns loaded: 3                        |
|   Table loaded using 1 thread(s)                 |
|   Elapsed time: 2.66 s                           |
|                                                  |
+--------------------------------------------------+
8 rows in set (0.1021 sec)

+-------------------------------------------------------------------------------+
| LOAD SUMMARY                                                                  |
+-------------------------------------------------------------------------------+
|                                                                               |
| SCHEMA                          TABLES       TABLES      COLUMNS         LOAD |
| NAME                            LOADED       FAILED       LOADED     DURATION |
| ------                          ------       ------      -------     -------- |
| `temp_archiving`                     1            0            3       2.66 s |
|                                                                               |
+-------------------------------------------------------------------------------+
6 rows in set (0.1021 sec)

Query OK, 0 rows affected (0.1021 sec)

Let’s check:

SQL> 
SELECT * FROM min_temp_1982 LIMIT 3;
+------------------+------------+------+
| id_min_temp_1982 | d          | temp |
+------------------+------------+------+
|                1 | 1982-01-01 | 17.0 |
|                2 | 1982-01-02 | 15.0 |
|                3 | 1982-01-03 | 13.5 |
+------------------+------------+------+


SELECT YEAR(d) AS temp_year FROM min_temp_1982 GROUP BY temp_year;
+-----------+
| temp_year |
+-----------+
|      1982 |
+-----------+

The year 1982 has been correctly archived.

SQL> 
SELECT * FROM min_temp LIMIT 3;
+------------+------+
| d          | temp |
+------------+------+
| 1983-01-01 | 18.4 |
| 1983-01-02 | 15.0 |
| 1983-01-03 | 10.9 |
+------------+------+


SELECT YEAR(d) AS temp_year FROM min_temp GROUP BY temp_year ;
+-----------+
| temp_year |
+-----------+
|      1983 |
|      1984 |
|      1985 |
|      1986 |
|      1987 |
|      1988 |
|      1989 |
|      1990 |
+-----------+
8 rows in set (0.0013 sec)


SELECT PARTITION_NAME, PARTITION_ORDINAL_POSITION, PARTITION_DESCRIPTION, TABLE_ROWS 
FROM INFORMATION_SCHEMA.PARTITIONS 
WHERE TABLE_NAME = 'min_temp';
+----------------+----------------------------+-----------------------+------------+
| PARTITION_NAME | PARTITION_ORDINAL_POSITION | PARTITION_DESCRIPTION | TABLE_ROWS |
+----------------+----------------------------+-----------------------+------------+
| p1981          |                          1 | 1982                  |          0 |
| p1982          |                          2 | 1983                  |          0 |
| p1983          |                          3 | 1984                  |        365 |
| p1984          |                          4 | 1985                  |        365 |
| p1985          |                          5 | 1986                  |        365 |
| p1986          |                          6 | 1987                  |        365 |
| p1987          |                          7 | 1988                  |        365 |
| p1988          |                          8 | 1989                  |        365 |
| p1989          |                          9 | 1990                  |        365 |
| p1990          |                         10 | 1991                  |        365 |
+----------------+----------------------------+-----------------------+------------+

Data was moved out of the partitioned table.

SQL> 
SELECT name, load_progress, load_status, query_count 
FROM performance_schema.rpd_tables JOIN performance_schema.rpd_table_id USING(id) 
WHERE name LIKE 'temp_archiving%' 
ORDER BY name;
+------------------------------+---------------+---------------------+-------------+
| name                         | load_progress | load_status         | query_count |
+------------------------------+---------------+---------------------+-------------+
| temp_archiving.min_temp_1981 |           100 | AVAIL_RPDGSTABSTATE |           0 |
| temp_archiving.min_temp_1982 |           100 | AVAIL_RPDGSTABSTATE |           0 |
+------------------------------+---------------+---------------------+-------------+

Data has been offloaded to the HeatWave Cluster.

In our case, this workload would only run once a year after the catch-up phase. But in practice, you could schedule this kind of workflow to run more regularly — whether with cron (or the event scheduler if you are a player 🙂 , your favorite open source data orchestration tool like Airflow, Prefect, Dagster,… or a cloud native managed service like OCI Data Integration.

Peroraison

Archiving data no longer has to mean sacrificing accessibility or performance. With HeatWave MySQL, DBAs and developers can seamlessly move older partitions into regular tables, load them into the HeatWave cluster, and continue running queries at scale — all with the familiar MySQL syntax they already know.

By combining efficient data archiving with in-memory acceleration, HeatWave allows organizations to strike the balance between keeping production databases lean and still being able to analyze years of historical data instantly. What was once a trade-off between storage efficiency and query performance is now a streamlined workflow that can be automated and adapted to your needs.

With HeatWave, your archived data is no longer just stored; it’s ready to deliver actionable insights at the speed of thought.

Stay tuned for more insights!

Follow me on Linkedin

Watch my videos on my YouTube channel and subscribe.

My Slideshare account.

My Speaker Deck account.

Thanks for using HeatWave & MySQL!

0

HeatWave MySQL Database Audit

July 3, 2025

A few years ago (in 2018), I wrote a MySQL Security series and one of the episode was about the MySQL Enterprise Audit feature — MySQL Security – MySQL Enterprise Audit — a powerful auditing capability that enables you to track and monitor database activity to ensure data integrity, strengthen security, and maintain compliance with regulatory requirements. This robust feature has also been available in HeatWave MySQL for the past few years (since 2023), bringing the same enterprise-grade auditing capabilities to the cloud.

HeatWave MySQL Database Audit is builds upon the established technology of MySQL Enterprise Audit, offering a comprehensive solution for tracking and analyzing database activities.

Key Benefits of HeatWave MySQL Database Audit

  • Rigorous Compliance & Forensics: Helps organizations meet stringent industry regulations (like FedRAMP, DISA STIG, PCI-DSS, HIPAA, SOX, GDPR, FERPA, and Center for Internet Security Benchmarks) by creating a detailed record of database events, essential for investigations and demonstrating adherence to policies.
  • Security Operations (SecOps): Enables real-time monitoring of user behaviors to detect and respond to potential security threats proactively.
  • Holistic Server Activity Tracking: Provides comprehensive auditing, from basic client connections and disconnections to more granular activities like interactions with specific schemas and tables, security changes, and errors.
  • Insights into Query and Statement Performance: Tracks query execution statistics, allowing for the identification of slow queries and performance bottlenecks, leading to database optimization.
  • Utilization & Optimization: Offers data-driven insights to pinpoint and streamline database operations and resource utilization.
  • “Trust but Verify” Security Principle: Allows for the monitoring of high-privilege users to prevent misuse of access.
  • Business Audit: Creates detailed records to prove data validity, accuracy, and integrity, demonstrating that no tampering has occurred.
  • Security Analysis: Serves as a vital component in a defense-in-depth strategy, facilitating both proactive (machine learning-based anomaly detection) and reactive (post-mortem analysis of attacks) security measures.

Feature Highlights:

  • Ready to Use: No installation steps are required, simplifying the process of securing and monitoring database activities.
  • Customizable Auditing: Allows Database Administrators to define filters to monitor specific operations, users, or broad activity categories, reducing audit noise and optimizing log size.
  • Real-time & Minimal Overhead: Provides instantaneous access to database activity with minimal impact on performance, supporting diverse use cases.
  • Automatic Log Rotation & Management: Includes built-in log rotation and automatic purging, ensuring optimal DB system performance without manual intervention. Audit logs are structured in JSON format, encrypted, and compressed for efficiency and security.
  • Multi-Instance Support: Works seamlessly with standalone, multi-instance, and High-Availability (HA) configurations, replicating audit filters and configurations to ensure no event is lost.
  • Access via SQL Interface: Enables querying of logs directly from the SQL interface using any MySQL client. Logs can also be channeled to OCI Logging Analytics, third-party monitoring tools, or SIEM systems for broader analysis.
  • Effortless Migration of Rules: Existing MySQL Enterprise Auditing rules (on-premise or other systems) can be easily migrated to HeatWave MySQL instances in the cloud, ensuring consistency.
  • Optional Query Execution Metrics: Can include details about query execution, such as slow queries, for performance analysis.
  • Option to Remove Sensitive Data: Allows for the omission of sensitive data from statements before logging.

Server Activity Tracking

The goal here is to track all DML (Data Manipulation Language) statements executed on a HeatWave MySQL instance, by an user account (assuming this user account it used by an application that requires audited). You can easily achieve this by creating a dedicated HeatWave MySQL Audit Log Filter that specifically logs:

  • INSERT
  • UPDATE
  • DELETE
  • TRUNCATE TABLE
  • REPLACE
  • LOAD DATA
  • LOAD XML

We’ll assign this filter to the dedicated user called auditee@%.

Workflow overview:

  1. Verify that the HeatWave MySQL Database Audit is enabled
  2. Create the DML specific audit filter
  3. Register the filter
  4. Assign the filter to the appropriate user account

Important: To utilize any filtering capabilities, the user performing these actions must possess the AUDIT_ADMIN privilege.

In this article, I’m using HeatWave 8.4.5:

SELECT VERSION();
+----------------+
| VERSION()      |
+----------------+
| 8.4.5-u2-cloud |
+----------------+

Verify that HeatWave MySQL Audit is enabled

HeatWave MySQL Database Audit is enable by default. You can check using the following queries:

SQL> 
-- Checks at the server plugins level
SELECT
    PLUGIN_NAME,
    PLUGIN_STATUS
FROM
    INFORMATION_SCHEMA.PLUGINS
WHERE
    PLUGIN_NAME LIKE 'audit%';
+-------------+---------------+
| PLUGIN_NAME | PLUGIN_STATUS |
+-------------+---------------+
| audit_log   | ACTIVE        |
+-------------+---------------+


-- Check at the component-based infrastructure level
SELECT 
    * 
FROM 
    mysql.component 
WHERE 
    component_urn LIKE '%audit%'\G
*************************** 1. row ***************************
      component_id: 6
component_group_id: 6
     component_urn: file://component_audit_api_message_emit

You can see all the audit feature related functions using performance_schema.user_defined_functions table:

SELECT * FROM performance_schema.user_defined_functions WHERE UDF_LIBRARY='audit_log.so';
+-----------------------------------+-----------------+----------+--------------+-----------------+
| UDF_NAME                          | UDF_RETURN_TYPE | UDF_TYPE | UDF_LIBRARY  | UDF_USAGE_COUNT |
+-----------------------------------+-----------------+----------+--------------+-----------------+
| audit_log_rotate                  | char            | function | audit_log.so |               1 |
| audit_log_read                    | char            | function | audit_log.so |               1 |
| audit_log_filter_remove_filter    | char            | function | audit_log.so |               1 |
| audit_log_encryption_password_set | integer         | function | audit_log.so |               1 |
| audit_log_filter_set_filter       | char            | function | audit_log.so |               1 |
| audit_log_encryption_password_get | char            | function | audit_log.so |               1 |
| audit_log_filter_remove_user      | char            | function | audit_log.so |               1 |
| audit_log_filter_set_user         | char            | function | audit_log.so |               1 |
| audit_log_read_bookmark           | char            | function | audit_log.so |               1 |
| audit_log_filter_flush            | char            | function | audit_log.so |               1 |
+-----------------------------------+-----------------+----------+--------------+-----------------+

The user_defined_functions table contains a row for each loadable function registered automatically by a component or plugin, or manually by a CREATE FUNCTION statement.

Create the DML Audit Filter

The filer to log all DMLs running on HeatWave MySQL is quite simple. We are using a JSON syntax:

SET @dml_filter = '{
  "filter": {
    "class": {
      "name": "table_access",
      "event": {
        "name": ["insert", "update", "delete"],
        "log": true
      }
    }
  }
}'
;

This filter logs only DML operations by checking SQL statement types (that is INSERT / UPDATE / DELETE / TRUNCATE TABLE / REPLACE / LOAD DATA / LOAD XML).

Register the Filter

Assign the filter a name, using the audit_log_filter_set_filter function (e.g., dml_logger):

SELECT audit_log_filter_set_filter('dml_logger', @dml_filter);

+--------------------------------------------------------+
| audit_log_filter_set_filter('dml_logger', @dml_filter) |
+--------------------------------------------------------+
| OK                                                     |
+--------------------------------------------------------+

To view audit filters, run the following command:

SELECT * FROM mysql_audit.audit_log_filter\G

*************************** 1. row ***************************
  NAME: dml_logger
FILTER: {"filter": {"class": {"name": "table_access", "event": {"log": true, "name": ["insert", "update", "delete"]}}}}

Apply the Filter to the User Account

Use the audit_log_filter_set_user function to start filtering an user account.

The syntax for the specific users auditee@% is:

SELECT audit_log_filter_set_user('auditee@%', 'dml_logger');

+----------------------------------------------------------+
| audit_log_filter_set_user('auditee@%', 'dml_logger') |
+----------------------------------------------------------+
| OK                                                       |
+----------------------------------------------------------+

If you want to assign the filter to all users, use the following syntax:

SELECT audit_log_filter_set_user('%', 'dml_logger');

In our context that could be useful to log all DMLs from any users / applications.

To unassign this rule, run the following command:

SELECT audit_log_filter_remove_user('%');

To view the assigned rules, run the following command:

SELECT * FROM mysql_audit.audit_log_user;

+-------------+------+------------+
| USER        | HOST | FILTERNAME |
+-------------+------+------------+
| auditee     | %    | dml_logger |
+-------------+------+------------+

Accessing and Analyzing Audit Data

To access audit data, users can simply query it using standard SQL. The primary method for retrieving this information is audit_log_read(), which returns the audit records in JSON format.

For a basic example of how to extract audit log entries, use the following command:

SELECT audit_log_read(audit_log_read_bookmark());

To display the audit data in a more readable format, use the JSON_PRETTY() and CONVERT() functions:

SELECT JSON_PRETTY(CONVERT(audit_log_read(audit_log_read_bookmark()) USING UTF8MB4))\G

You can refine your audit data extraction by passing additional parameters to the audit_log_read() function.
For example, to retrieve, 10 entries of audit logs starting from a specific timestamp, you can use:

SELECT JSON_PRETTY(CONVERT(audit_log_read('{ "start": { "timestamp": "2025-07-02 15:28:12" }, "max_array_length": 10 }')USING UTF8MB4))\G

Note. You can also use the MySQL’s JSON function JSON_TABLE, to transform audit data into a tabular format. For example to extract a subset of JSON name-value pairs and convert them into a structured table, making the data easier to work with and analyze.

Playground

My application uses the auditee user account, which has the necessary privileges on the s1 schema as well as the AUDIT_ADMIN privilege:

CREATE USER auditee@'%' IDENTIFIED BY 'My5up4rP@sS';

GRANT ALL ON s1.* TO auditee@'%';

GRANT AUDIT_ADMIN ON *.* TO auditee@'%';

and for this demo I’ll use the table s1.t1:

CREATE SCHEMA s1;
USE s1;

CREATE TABLE `t1` (
  `id` int NOT NULL AUTO_INCREMENT,
  PRIMARY KEY (`id`)
);

Inside a session using the auditee@% user account:

SQL> SELECT CURRENT_USER();
+----------------+
| CURRENT_USER() |
+----------------+
| auditee@%      |
+----------------+

Let’s run some queries:

SQL> INSERT t1 VALUES (NULL);
Query OK, 1 row affected (0.0029 sec)

SQL> SELECT * FROM t1;
+----+
| id |
+----+
|  1 |
+----+

We can access the audit data through the SQL interface using the audit_log_read() function, as demonstrated earlier:

SELECT JSON_PRETTY(CONVERT(audit_log_read(audit_log_read_bookmark()) USING UTF8MB4))\G

*************************** 1. row ***************************
JSON_PRETTY(CONVERT(audit_log_read(audit_log_read_bookmark()) USING UTF8MB4)): [
  {
    "id": 0,
    "ts": 1751493370,
    "class": "table_access",
    "event": "insert",
    "login": {
      "ip": "10.0.0.241",
      "os": "",
      "user": "auditee",
      "proxy": ""
    },
    "account": {
      "host": "jumpserver.xxxxxxxxxx.yyyyyyyy.oraclevcn.com",
      "user": "auditee"
    },
    "timestamp": "2025-07-02 21:56:10",
    "connection_id": 31332,
    "table_access_data": {
      "db": "s1",
      "query": "INSERT t1 VALUES (NULL)",
      "table": "t1",
      "sql_command": "insert"
    }
  },
  null
]

As expected, the INSERT command is logged, not the SELECT.

User auditee is now running the following queries:

UPDATE t1 SET id=100 WHERE id=1;

SELECT * FROM t1;

DELETE FROM t1;

The audit log should contain the two DML statements, but not the SELECT query:

-- events in the log
SELECT JSON_PRETTY(CONVERT(audit_log_read(audit_log_read_bookmark()) USING UTF8MB4))\G
*************************** 1. row ***************************
JSON_PRETTY(CONVERT(audit_log_read(audit_log_read_bookmark()) USING UTF8MB4)): [
  {
    "id": 0,
    "ts": 1751493469,
    "class": "table_access",
    "event": "update",
    "login": {
      "ip": "10.0.0.241",
      "os": "",
      "user": "auditee",
      "proxy": ""
    },
    "account": {
      "host": "jumpserver.xxxxxxxxxx.yyyyyyyy.oraclevcn.com",
      "user": "auditee"
    },
    "timestamp": "2025-07-02 21:57:49",
    "connection_id": 31332,
    "table_access_data": {
      "db": "s1",
      "query": "UPDATE t1 SET id=100 WHERE id=1",
      "table": "t1",
      "sql_command": "update"
    }
  },
  {
    "id": 1,
    "ts": 1751493469,
    "class": "table_access",
    "event": "delete",
    "login": {
      "ip": "10.0.0.241",
      "os": "",
      "user": "auditee",
      "proxy": ""
    },
    "account": {
      "host": "jumpserver.xxxxxxxxxx.yyyyyyyy.oraclevcn.com",
      "user": "auditee"
    },
    "timestamp": "2025-07-02 21:57:49",
    "connection_id": 31332,
    "table_access_data": {
      "db": "s1",
      "query": "DELETE FROM t1",
      "table": "t1",
      "sql_command": "delete"
    }
  },
  null
]

Et voilà!

Let’s finish this article with some examples in bulk

Starting from a particular timestamp (2025-07-02 21:56:10):

-- At most 3 events at that time
SELECT JSON_PRETTY(CONVERT(audit_log_read('{ "start": { "timestamp": "2025-07-02 21:56:10" }, "max_array_length": 3 }')USING UTF8MB4))\G
*************************** 1. row ***************************
JSON_PRETTY(CONVERT(audit_log_read('{ "start": { "timestamp": "2025-07-02 21:56:10" }, "max_array_length": 3 }')USING UTF8MB4)): [
  {
    "id": 0,
    "ts": 1751493370,
    "class": "table_access",
    "event": "insert",
    "login": {
      "ip": "10.0.0.241",
      "os": "",
      "user": "auditee",
      "proxy": ""
    },
    "account": {
      "host": "jumpserver.xxxxxxxxxx.yyyyyyyy.oraclevcn.com",
      "user": "auditee"
    },
    "timestamp": "2025-07-02 21:56:10",
    "connection_id": 31332,
    "table_access_data": {
      "db": "s1",
      "query": "INSERT t1 VALUES (NULL)",
      "table": "t1",
      "sql_command": "insert"
    }
  },
  {
    "id": 0,
    "ts": 1751493469,
    "class": "table_access",
    "event": "update",
    "login": {
      "ip": "10.0.0.241",
      "os": "",
      "user": "auditee",
      "proxy": ""
    },
    "account": {
      "host": "jumpserver.xxxxxxxxxx.yyyyyyyy.oraclevcn.com",
      "user": "auditee"
    },
    "timestamp": "2025-07-02 21:57:49",
    "connection_id": 31332,
    "table_access_data": {
      "db": "s1",
      "query": "UPDATE t1 SET id=100 WHERE id=1",
      "table": "t1",
      "sql_command": "update"
    }
  },
  {
    "id": 1,
    "ts": 1751493469,
    "class": "table_access",
    "event": "delete",
    "login": {
      "ip": "10.0.0.241",
      "os": "",
      "user": "auditee",
      "proxy": ""
    },
    "account": {
      "host": "jumpserver.xxxxxxxxxx.yyyyyyyy.oraclevcn.com",
      "user": "auditee"
    },
    "timestamp": "2025-07-02 21:57:49",
    "connection_id": 31332,
    "table_access_data": {
      "db": "s1",
      "query": "DELETE FROM t1",
      "table": "t1",
      "sql_command": "delete"
    }
  },
  null
]

First event (max_array_length: 1) at that particular timestamp (2025-07-02 21:56:10):

-- first event at that time
SELECT JSON_PRETTY(CONVERT(audit_log_read('{ "start": { "timestamp": "2025-07-02 21:56:10" }, "max_array_length": 1 }')USING UTF8MB4))\G
*************************** 1. row ***************************
JSON_PRETTY(CONVERT(audit_log_read('{ "start": { "timestamp": "2025-07-02 21:56:10" }, "max_array_length": 1 }')USING UTF8MB4)): [
  {
    "id": 0,
    "ts": 1751493370,
    "class": "table_access",
    "event": "insert",
    "login": {
      "ip": "10.0.0.241",
      "os": "",
      "user": "auditee",
      "proxy": ""
    },
    "account": {
      "host": "jumpserver.xxxxxxxxxx.yyyyyyyy.oraclevcn.com",
      "user": "auditee"
    },
    "timestamp": "2025-07-02 21:56:10",
    "connection_id": 31332,
    "table_access_data": {
      "db": "s1",
      "query": "INSERT t1 VALUES (NULL)",
      "table": "t1",
      "sql_command": "insert"
    }
  }
]

Second event (id=1) at that particular timestamp (2025-07-02 21:57:49)

-- event id=1 at that time
SELECT JSON_PRETTY(CONVERT(audit_log_read('{ "timestamp": "2025-07-02 21:57:49", "id": 1 }')USING UTF8MB4))\G
*************************** 1. row ***************************
JSON_PRETTY(CONVERT(audit_log_read('{ "timestamp": "2025-07-02 21:57:49", "id": 1 }')USING UTF8MB4)): [
  {
    "id": 1,
    "ts": 1751493469,
    "class": "table_access",
    "event": "delete",
    "login": {
      "ip": "10.0.0.241",
      "os": "",
      "user": "auditee",
      "proxy": ""
    },
    "account": {
      "host": "jumpserver.xxxxxxxxxx.yyyyyyyy.oraclevcn.com",
      "user": "auditee"
    },
    "timestamp": "2025-07-02 21:57:49",
    "connection_id": 31332,
    "table_access_data": {
      "db": "s1",
      "query": "DELETE FROM t1",
      "table": "t1",
      "sql_command": "delete"
    }
  },
  null
]

Peroraison

HeatWave MySQL Database Audit offers a powerful, enterprise-grade auditing framework that seamlessly extends MySQL Enterprise Audit capabilities to the cloud. By enabling fine-grained filtering, real-time monitoring, and flexible log access via SQL, it empowers database administrators and security teams to ensure compliance, enhance visibility, and strengthen operational security.

In this article, we demonstrated how to track all DML operations executed by a specific application user using a dedicated audit filter. From enabling the audit plugin to querying structured audit logs, HeatWave makes it straightforward to implement robust auditing practices with minimal overhead.

Whether you’re working toward regulatory compliance, safeguarding sensitive data, or optimizing database performance, HeatWave MySQL Database Audit equips you with the tools needed to meet modern data governance and security demands—efficiently and effectively.

By leveraging HeatWave MySQL Database Audit, you’re not just logging data; you’re building a foundation of trust and accountability for your critical database operations.

Stay tuned for more insights!

Follow me on Linkedin

Watch my videos on my YouTube channel and subscribe.

My Slideshare account.

My Speaker Deck account.

Thanks for using HeatWave & MySQL!

0

Handle CSV files with HeatWave MySQL

June 26, 2025

When it comes to loading data from CSV files into your MySQL environment, there’s no shortage of options.
In this post, I’ll walk you through two efficient, developer-friendly and MySQL-ish approaches:

Both can save you time and effort—especially when dealing with large datasets.

In this article, I’am using HeatWave 9.3.1:

SELECT version();
+----------------+
| version()      |
+----------------+
| 9.3.1-u1-cloud |
+----------------+

Also, I’m using the reviews.csv file for Paris, Île-de-France, France (03 March, 2025) from the Inside Airbnb dataset.

$ head /Data/project/source/reviews.csv
listing_id,date
3109,2017-10-28
3109,2017-11-03
3109,2018-07-24
3109,2019-10-24
5396,2009-06-30
5396,2009-07-03
5396,2009-07-08
5396,2009-09-10
5396,2009-12-02

MySQL Shell’s parallel import utility

You must create the table that will store the data before importing it from the CSV file:

SQL>
CREATE SCHEMA homestays;

USE homestays;

CREATE TABLE `reviews_from_mysqlsh` (
  `listing_id` bigint unsigned NOT NULL,
  `date` date DEFAULT NULL
) ENGINE=InnoDB 
;

Import data from an object storage bucket

Now we can import the data. MySQL Shell supports importing input data files stored on a “local” disk or in Oracle Cloud Infrastructure (OCI) Object Storage buckets, using the routine util.importTable.
Here, my CSV file is stored on an OCI object storage bucket named Airbnb.

Leveraging MySQL Shell’s JavaScript syntax, you can do:

JS>
util.importTable("reviews.csv", {schema: "homestays", table: "reviews_from_mysqlsh", dialect: "csv-unix", skipRows: 1, showProgress: true, osBucketName: "Airbnb", osNamespace: "mynamespace"})
Importing from file 'reviews.csv' to table `homestays`.`reviews_from_mysqlsh` in MySQL Server at 10.0.1.2:3306 using 1 thread
[Worker000]: reviews.csv: Records: 2068800  Deleted: 0  Skipped: 0  Warnings: 0
99% (47.29 MB / 47.29 MB), 6.75 MB/s                    
File 'reviews.csv' (47.29 MB) was imported in 8.2168 sec at 5.76 MB/s
Total rows affected in homestays.reviews_from_mysqlsh: Records: 2068800  Deleted: 0  Skipped: 0  Warnings: 0

The main configuration options are:

  • CSV file name: reviews.csv
  • database name: homestays
  • table name: reviews_from_mysqlsh
  • dialect: CSV file created on a Unix systems
  • Object storage bucket name: Airbnb
  • Object storage namespace: mynamespace

And now you can query your data:

JS>
\sql SELECT * FROM homestays.reviews_from_mysqlsh LIMIT 5;
+------------+------------+
| listing_id | date       |
+------------+------------+
|       3109 | 2017-10-28 |
|       3109 | 2017-11-03 |
|       3109 | 2018-07-24 |
|       3109 | 2019-10-24 |
|       5396 | 2009-06-30 |
+------------+------------+
SQL>
SELECT COUNT(*) FROM homestays.reviews_from_mysqlsh;
+----------+
| COUNT(*) |
+----------+
|  2068800 |
+----------+
1 row in set (0.5208 sec)
SQL>
EXPLAIN SELECT COUNT(*) FROM homestays.reviews_from_mysqlsh\G
*************************** 1. row ***************************
EXPLAIN: -> Aggregate: count(0)  (cost=816457..816457 rows=1)
    -> Table scan on reviews_from_mysqlsh  (cost=0.267..551720 rows=2.06e+6)

1 row in set (0.0934 sec)

As a side note, you can even do some basic transformations on the fly before loading the data. See decodeColumns and columns options.

Import data from a local disk

If the file is on a “local” disk, the syntax is the following:

JS>
util.importTable("/Data/project/source/reviews.csv", {schema: "homestays", table: "reviews_from_mysqlsh", dialect: "csv-unix", skipRows: 1, showProgress: true})
Importing from file '/Data/project/source/reviews.csv' to table `homestays`.`reviews_from_mysqlsh` in MySQL Server at 10.0.1.2:3306 using 1 thread
[Worker000]: reviews.csv: Records: 2068800  Deleted: 0  Skipped: 0  Warnings: 0
99% (47.29 MB / 47.29 MB), 7.27 MB/s                    
File '/Data/project/source/reviews.csv' (47.29 MB) was imported in 6.9141 sec at 6.84 MB/s
Total rows affected in homestays.reviews_from_mysqlsh: Records: 2068800  Deleted: 0  Skipped: 0  Warnings: 0

Import data to a MySQL server

Obviously, util.importTable also works on a classic (I mean non HeatWave) MySQL instance. In this context, you will most likely need to set local_infile variable to 1. Its default value is OFF:

mysql> 
SHOW GLOBAL VARIABLES LIKE 'local_infile';
+---------------+-------+
| Variable_name | Value |
+---------------+-------+
| local_infile  | OFF   |
+---------------+-------+

It allows you to avoid the following error:

ERROR: The 'local_infile' global system variable must be set to ON in the target server, after the server is verified to be trusted.
Util.importTable: Invalid preconditions (RuntimeError)

Before changing this value I highly recommend to read: Security Considerations for LOAD DATA LOCAL. We don’t joke about security…

So basically, you’ll need to do something like this:

JS>
\sql SET GLOBAL local_infile = 1;

util.importTable("/Data/project/source/reviews.csv", {schema: "homestays", table: "reviews_from_mysqlsh", dialect: "csv-unix", skipRows: 1, showProgress: true})

\sql SET GLOBAL local_infile = 0;

Turning our attention back to HeatWave — as you may already know, a HeatWave cluster can dramatically accelerate your queries, enabling you to use the familiar MySQL API in analytics scenarios such as data warehousing and lakehousing.
To unlock these performance superpowers, you first need to load your data into the HeatWave cluster. Once that’s done, you can fully enjoy the incredible speed and efficiency it brings to your workloads!

Load data into the HeatWave Cluster

To load your data into your HeatWave Cluster from your MySQL table use the sys.heatwave_load stored procedure.

SQL>
CALL sys.heatwave_load(JSON_ARRAY("homestays"), JSON_OBJECT('include_list', JSON_ARRAY('homestays.reviews_from_mysqlsh')));

+------------------------------------------+
| INITIALIZING HEATWAVE AUTO PARALLEL LOAD |
+------------------------------------------+
| Version: 4.31                            |
|                                          |
| Load Mode: normal                        |
| Load Policy: disable_unsupported_columns |
| Output Mode: normal                      |
|                                          |
+------------------------------------------+
6 rows in set (0.0158 sec)

+-----------------------------------------------------------------------------------------+
| OFFLOAD ANALYSIS                                                                        |
+-----------------------------------------------------------------------------------------+
| Verifying input schemas: 1                                                              |
| User excluded items: 0                                                                  |
|                                                                                         |
| SCHEMA                       OFFLOADABLE    OFFLOADABLE     SUMMARY OF                  |
| NAME                              TABLES        COLUMNS     ISSUES                      |
| ------                       -----------    -----------     ----------                  |
| `homestays`                            0              0     1 table(s) are not loadable |
|                                                                                         |
| No offloadable schema found, HeatWave Auto Load terminating                             |
|                                                                                         |
| Total errors encountered: 1                                                             |
| Total warnings encountered: 3                                                           |
| Retrieve the associated logs from the report table using the query below:               |
|   SELECT log FROM sys.heatwave_autopilot_report WHERE type IN ('error', 'warn');        |
|                                                                                         |
+-----------------------------------------------------------------------------------------+
15 rows in set (0.0158 sec)

Oops!!
It failed! There is an error. Let’s check it:

SQL> 
SELECT log FROM sys.heatwave_autopilot_report WHERE type IN ('error', 'warn'); 
+-------------------------------------------------------------------------------------------------------------------------+
| log                                                                                                                     |
+-------------------------------------------------------------------------------------------------------------------------+
| {"error": "Unable to load table without primary key", "table_name": "reviews_from_mysqlsh", "schema_name": "homestays"} |
| {"warn": "1 table(s) are not loadable", "schema_name": "homestays"}                                                     |
| {"warn": "No offloadable tables found", "schema_name": "homestays"}                                                     |
| {"warn": "No offloadable tables found for given input target"}                                                          |
+-------------------------------------------------------------------------------------------------------------------------+

{“error”: “Unable to load table without primary key”, “table_name”: “reviews_from_mysqlsh”, “schema_name”: “homestays”}

Well, I conveniently forgot to mention one important requirement: the table must have a primary key. 😉

If your table doesn’t have a natural or meaningful primary key, no worries — one option is to use Generated Invisible Primary Keys (GIPKs). This allows MySQL to automatically add an invisible primary key behind the scenes.

An ALTER TABLE operation is all it takes:

SQL> 
ALTER TABLE homestays.reviews_from_mysqlsh ADD COLUMN my_row_id BIGINT UNSIGNED NOT NULL INVISIBLE AUTO_INCREMENT PRIMARY KEY FIRST;
Query OK, 0 rows affected (6.3596 sec)

Records: 0  Duplicates: 0  Warnings: 0

You can read more about GIPKs in the official documentation.

Now, let’s examine the table’s current state:

SQL > 
SHOW CREATE TABLE homestays.reviews_from_mysqlsh\G
*************************** 1. row ***************************
       Table: reviews_from_mysqlsh
Create Table: CREATE TABLE `reviews_from_mysqlsh` (
  `my_row_id` bigint unsigned NOT NULL AUTO_INCREMENT /*!80023 INVISIBLE */,
  `listing_id` bigint unsigned NOT NULL,
  `date` date DEFAULT NULL,
  PRIMARY KEY (`my_row_id`)
) ENGINE=InnoDB AUTO_INCREMENT=2068801 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci

We can try again to load the data into the HeatWave Cluster:

SQL> 
CALL sys.heatwave_load(JSON_ARRAY("homestays"), JSON_OBJECT('include_list', JSON_ARRAY('homestays.reviews_from_mysqlsh')));
+------------------------------------------+
| INITIALIZING HEATWAVE AUTO PARALLEL LOAD |
+------------------------------------------+
| Version: 4.31                            |
|                                          |
| Load Mode: normal                        |
| Load Policy: disable_unsupported_columns |
| Output Mode: normal                      |
|                                          |
+------------------------------------------+
6 rows in set (0.0124 sec)

+------------------------------------------------------------------------+
| OFFLOAD ANALYSIS                                                       |
+------------------------------------------------------------------------+
| Verifying input schemas: 1                                             |
| User excluded items: 0                                                 |
|                                                                        |
| SCHEMA                       OFFLOADABLE    OFFLOADABLE     SUMMARY OF |
| NAME                              TABLES        COLUMNS     ISSUES     |
| ------                       -----------    -----------     ---------- |
| `homestays`                            1              3                |
|                                                                        |
| Total offloadable schemas: 1                                           |
|                                                                        |
+------------------------------------------------------------------------+
10 rows in set (0.0124 sec)

+-----------------------------------------------------------------------------------------------------------------------------+
| CAPACITY ESTIMATION                                                                                                         |
+-----------------------------------------------------------------------------------------------------------------------------+
| Default encoding for string columns: VARLEN (unless specified in the schema)                                                |
| Estimating memory footprint for 1 schema(s)                                                                                 |
|                                                                                                                             |
|                                TOTAL       ESTIMATED       ESTIMATED       TOTAL     DICTIONARY      VARLEN       ESTIMATED |
| SCHEMA                   OFFLOADABLE   HEATWAVE NODE      MYSQL NODE      STRING        ENCODED     ENCODED            LOAD |
| NAME                          TABLES       FOOTPRINT       FOOTPRINT     COLUMNS        COLUMNS     COLUMNS            TIME |
| ------                   -----------       ---------       ---------     -------     ----------     -------       --------- |
| `homestays`                        1       77.61 MiB      256.00 KiB           0              0           0         58.00 s |
|                                                                                                                             |
| Sufficient MySQL host memory available to load all tables.                                                                  |
| Sufficient HeatWave cluster memory available to load all tables.                                                            |
|                                                                                                                             |
+-----------------------------------------------------------------------------------------------------------------------------+
12 rows in set (0.0124 sec)

+---------------------------------------------------------------------------------------------------------------------------------------+
| EXECUTING LOAD SCRIPT                                                                                                                 |
+---------------------------------------------------------------------------------------------------------------------------------------+
| HeatWave Load script generated                                                                                                        |
|   Retrieve load script containing 3 generated DDL command(s) using the query below:                                                   |
| Deprecation Notice: "heatwave_load_report" will be deprecated, please switch to "heatwave_autopilot_report"                           |
|   SELECT log->>"$.sql" AS "Load Script" FROM sys.heatwave_autopilot_report WHERE type = "sql" ORDER BY id;                            |
|                                                                                                                                       |
| Adjusting load parallelism dynamically per internal/external table.                                                                   |
| Using current parallelism of 4 thread(s) as maximum for internal tables.                                                              |
|                                                                                                                                       |
| Warning: Executing the generated script may alter column definitions and secondary engine flags in the schema                         |
|                                                                                                                                       |
| Using SQL_MODE: ONLY_FULL_GROUP_BY,STRICT_TRANS_TABLES,NO_ZERO_IN_DATE,NO_ZERO_DATE,ERROR_FOR_DIVISION_BY_ZERO,NO_ENGINE_SUBSTITUTION |
|                                                                                                                                       |
| Proceeding to load 1 table(s) into HeatWave.                                                                                          |
|                                                                                                                                       |
| Applying changes will take approximately 57.79 s                                                                                      |
|                                                                                                                                       |
+---------------------------------------------------------------------------------------------------------------------------------------+
16 rows in set (0.0124 sec)

+----------------------------------------------------+
| TABLE LOAD                                         |
+----------------------------------------------------+
| TABLE (1 of 1): `homestays`.`reviews_from_mysqlsh` |
| Commands executed successfully: 3 of 3             |
| Warnings encountered: 0                            |
| Table load succeeded!                              |
|   Total columns loaded: 3                          |
|   Table loaded using 4 thread(s)                   |
|   Elapsed time: 7.92 s                             |
|                                                    |
+----------------------------------------------------+
8 rows in set (0.0124 sec)

+-------------------------------------------------------------------------------+
| LOAD SUMMARY                                                                  |
+-------------------------------------------------------------------------------+
|                                                                               |
| SCHEMA                          TABLES       TABLES      COLUMNS         LOAD |
| NAME                            LOADED       FAILED       LOADED     DURATION |
| ------                          ------       ------      -------     -------- |
| `homestays`                          1            0            3       7.92 s |
|                                                                               |
| Total errors encountered: 0                                                   |
| Total warnings encountered: 0                                                 |
|                                                                               |
+-------------------------------------------------------------------------------+
9 rows in set (0.0124 sec)

Et voilà!

The new table structure is:

SQL> 
SHOW CREATE TABLE reviews_from_mysqlsh\G
*************************** 1. row ***************************
       Table: reviews_from_mysqlsh
Create Table: CREATE TABLE `reviews_from_mysqlsh` (
  `my_row_id` bigint unsigned NOT NULL AUTO_INCREMENT /*!80023 INVISIBLE */,
  `listing_id` bigint unsigned NOT NULL,
  `date` date DEFAULT NULL,
  PRIMARY KEY (`my_row_id`)
) ENGINE=InnoDB AUTO_INCREMENT=2068801 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci SECONDARY_ENGINE=RAPID

Please note the SECONDARY_ENGINE=RAPID new clause.

And you can still query your table according to your needs:

SQL>
SELECT COUNT(*) FROM homestays.reviews_from_mysqlsh;
+----------+
| COUNT(*) |
+----------+
|  2068800 |
+----------+
1 row in set (0.1160 sec)
SQL>
EXPLAIN SELECT COUNT(*) FROM homestays.reviews_from_mysqlsh\G
*************************** 1. row ***************************
EXPLAIN: -> Aggregate: count(0)  (cost=16.6e+6..16.6e+6 rows=1)
    -> Table scan on reviews_from_mysqlsh in secondary engine RAPID  (cost=0..0 rows=2.07e+6)

1 row in set, 1 warning (0.0935 sec)
Note (code 1003): Query is executed in secondary engine; the actual query plan may diverge from the printed one

HeatWave’s Auto Parallel Load

HeatWave’s Auto Parallel Load is a key feature within HeatWave that automatically loads data into the HeatWave cluster, without requiring manual intervention or tuning. Data loading is performed using multiple threads across nodes in the HeatWave cluster, significantly speeding up the operation.

And guess what? you already know the command, it is sys.heatwave_load.

To use it, first we need to define the command using JSON syntax. We recommend assigning this JSON structure to a variable, such as @input_list:

SET @input_list = '
[
    {
        "db_name": "homestays",
        "tables": [
            {
                "table_name": "reviews_from_HW_load",
                "engine_attribute": {
                    "file": [
                        {
                            "uri": "oci://Airbnb@mynamespace/reviews.csv"
                        }
                    ],
                    "dialect": {
                        "format": "csv",
                        "field_delimiter": "auto",  
                        "record_delimiter": "auto",
                        "has_header": true
                    }
                }
            }
        ]
    }
]';

You’ll find the previously used configuration:

  • Database name: homestays
  • Table name: reviews_from_HW_load
  • The file is find following the URI:
    • Object storage bucket name: Airbnb
    • Object storage namespace: mynamespace
    • CSV file name: reviews.csv
  • The dialect is CSV, so the only information HeatWave’s Auto Parallel Load requires from us is the presence of a header in the file.

And like we have seen previously, run the stored procedure using the CALL statement:

SQL> 
CALL sys.heatwave_load(CAST(@input_list AS JSON), NULL);
+------------------------------------------+
| INITIALIZING HEATWAVE AUTO PARALLEL LOAD |
+------------------------------------------+
| Version: 4.31                            |
|                                          |
| Load Mode: normal                        |
| Load Policy: disable_unsupported_columns |
| Output Mode: normal                      |
|                                          |
+------------------------------------------+
6 rows in set (0.0128 sec)

+--------------------------------------------------------------------------------------------------------------------+
| LAKEHOUSE AUTO SCHEMA INFERENCE                                                                                    |
+--------------------------------------------------------------------------------------------------------------------+
| Verifying external lakehouse tables: 1                                                                             |
|                                                                                                                    |
| SCHEMA                   TABLE                    TABLE IS           RAW     NUM. OF      ESTIMATED     SUMMARY OF |
| NAME                     NAME                     CREATED      FILE SIZE     COLUMNS      ROW COUNT     ISSUES     |
| ------                   -----                    --------     ---------     -------      ---------     ---------- |
| `homestays`              `reviews_from_HW_load`   NO           45.10 MiB           2         2.07 M                |
|                                                                                                                    |
| New schemas to be created: 0                                                                                       |
| External lakehouse tables to be created: 1                                                                         |
|                                                                                                                    |
+--------------------------------------------------------------------------------------------------------------------+
10 rows in set (0.0128 sec)

+------------------------------------------------------------------------+
| OFFLOAD ANALYSIS                                                       |
+------------------------------------------------------------------------+
| Verifying input schemas: 1                                             |
| User excluded items: 0                                                 |
|                                                                        |
| SCHEMA                       OFFLOADABLE    OFFLOADABLE     SUMMARY OF |
| NAME                              TABLES        COLUMNS     ISSUES     |
| ------                       -----------    -----------     ---------- |
| `homestays`                            1              2                |
|                                                                        |
| Total offloadable schemas: 1                                           |
|                                                                        |
+------------------------------------------------------------------------+
10 rows in set (0.0128 sec)

+-----------------------------------------------------------------------------------------------------------------------------+
| CAPACITY ESTIMATION                                                                                                         |
+-----------------------------------------------------------------------------------------------------------------------------+
| Default encoding for string columns: VARLEN (unless specified in the schema)                                                |
| Estimating memory footprint for 1 schema(s)                                                                                 |
|                                                                                                                             |
|                                TOTAL       ESTIMATED       ESTIMATED       TOTAL     DICTIONARY      VARLEN       ESTIMATED |
| SCHEMA                   OFFLOADABLE   HEATWAVE NODE      MYSQL NODE      STRING        ENCODED     ENCODED            LOAD |
| NAME                          TABLES       FOOTPRINT       FOOTPRINT     COLUMNS        COLUMNS     COLUMNS            TIME |
| ------                   -----------       ---------       ---------     -------     ----------     -------       --------- |
| `homestays`                        1       57.57 MiB      192.00 KiB           0              0           0          7.00 s |
|                                                                                                                             |
| Sufficient MySQL host memory available to load all tables.                                                                  |
| Sufficient HeatWave cluster memory available to load all tables.                                                            |
|                                                                                                                             |
+-----------------------------------------------------------------------------------------------------------------------------+
12 rows in set (0.0128 sec)

+---------------------------------------------------------------------------------------------------------------------------------------+
| EXECUTING LOAD SCRIPT                                                                                                                 |
+---------------------------------------------------------------------------------------------------------------------------------------+
| HeatWave Load script generated                                                                                                        |
|   Retrieve load script containing 2 generated DDL command(s) using the query below:                                                   |
| Deprecation Notice: "heatwave_load_report" will be deprecated, please switch to "heatwave_autopilot_report"                           |
|   SELECT log->>"$.sql" AS "Load Script" FROM sys.heatwave_autopilot_report WHERE type = "sql" ORDER BY id;                            |
|                                                                                                                                       |
| Adjusting load parallelism dynamically per internal/external table.                                                                   |
| Using current parallelism of 4 thread(s) as maximum for internal tables.                                                              |
|                                                                                                                                       |
| Warning: Executing the generated script may alter column definitions and secondary engine flags in the schema                         |
|                                                                                                                                       |
| Using SQL_MODE: ONLY_FULL_GROUP_BY,STRICT_TRANS_TABLES,NO_ZERO_IN_DATE,NO_ZERO_DATE,ERROR_FOR_DIVISION_BY_ZERO,NO_ENGINE_SUBSTITUTION |
|                                                                                                                                       |
| Proceeding to load 1 table(s) into HeatWave.                                                                                          |
|                                                                                                                                       |
| Applying changes will take approximately 7.01 s                                                                                       |
|                                                                                                                                       |
+---------------------------------------------------------------------------------------------------------------------------------------+
16 rows in set (0.0128 sec)

+----------------------------------------------------+
| TABLE LOAD                                         |
+----------------------------------------------------+
| TABLE (1 of 1): `homestays`.`reviews_from_HW_load` |
| Commands executed successfully: 2 of 2             |
| Warnings encountered: 0                            |
| Table load succeeded!                              |
|   Total columns loaded: 2                          |
|   Elapsed time: 30.95 s                            |
|                                                    |
+----------------------------------------------------+
7 rows in set (0.0128 sec)

+----------------------------------------------------------------------------------+
| LOAD SUMMARY                                                                     |
+----------------------------------------------------------------------------------+
|                                                                                  |
| SCHEMA                          TABLES       TABLES      COLUMNS         LOAD    |
| NAME                            LOADED       FAILED       LOADED     DURATION    |
| ------                          ------       ------      -------     --------    |
| `homestays`                          1            0            2      30.95 s    |
|                                                                                  |
| Total errors encountered: 0                                                      |
| Total warnings encountered: 2                                                    |
| Retrieve the associated logs from the report table using the query below:        |
|   SELECT log FROM sys.heatwave_autopilot_report WHERE type IN ('error', 'warn'); |
|                                                                                  |
+----------------------------------------------------------------------------------+
11 rows in set (0.0128 sec)

The operation completed successfully; however, two warnings were generated.
Details regarding these warnings are available in the sys.heatwave_autopilot_report table:

SQL > 
SELECT log FROM sys.heatwave_autopilot_report WHERE type IN ('error', 'warn')\G
*************************** 1. row ***************************
log: {"message": "[WARNINGS SUMMARY] Lakehouse Schema Inference had 1 warning(s) out of which 1 were not recorded (due to max_error_count limit or filtering rules)", "table_name": "reviews_from_HW_load", "schema_name": "homestays", "condition_no": 1}
*************************** 2. row ***************************
log: {"message": "[WARNINGS SUMMARY] 1 warning(s) with code: 6095(ER_LH_WARN_INFER_SKIPPED_LINES)", "table_name": "reviews_from_HW_load", "schema_name": "homestays", "condition_no": 2}

Fortunately, nothing critical.

And now you can query your table according to your needs:

SQL>
SELECT COUNT(*) FROM reviews_from_HW_load;
+----------+
| COUNT(*) |
+----------+
|  2068800 |
+----------+
1 row in set (0.1147 sec)
SQL>
EXPLAIN SELECT COUNT(*) FROM homestays.reviews_from_HW_load\G
*************************** 1. row ***************************
EXPLAIN: -> Aggregate: count(0)  (cost=16.6e+6..16.6e+6 rows=1)
    -> Table scan on reviews_from_HW_load in secondary engine RAPID  (cost=0..0 rows=2.07e+6)

1 row in set, 1 warning (0.0933 sec)
Note (code 1003): Query is executed in secondary engine; the actual query plan may diverge from the printed on

Peroration

Whether you’re working with traditional MySQL or taking advantage of the blazing-fast analytics capabilities of HeatWave, importing CSV data doesn’t have to be a bottleneck. With tools like MySQL Shell’s parallel import utility and HeatWave’s Auto Parallel Load, you have flexible, scalable options that fit a variety of use cases — from local file loading to seamless integration with object storage.

By combining these tools with features like Generated Invisible Primary Keys, you can streamline the ingestion process and get your data ready for powerful, real-time analytics with minimal overhead.

So next time you’re staring at a CSV file and a big dataset to analyze, you’re fully equipped to handle it — the MySQL way.

Stay tuned for more insights!

Follow me on Linkedin

Watch my videos on my YouTube channel and subscribe.

My Slideshare account.

My Speaker Deck account.

Thanks for using HeatWave & MySQL!

0

Oracle Dev Days 2025 – French Edition

May 13, 2025
Tags: ,

Update:
The presentation is available on my Speaker Deck account: Build an AI-Powered Search Engine with HeatWave GenAI – Semantic Search Capabilities with Large Language Models

Join the Oracle Dev Days – French Edition, from May 20 to 22, 2025!
This must-attend event (in French) offers a rich program exploring the latest advancements in AI, databases, cloud, and Java.

Join me on May 21 at 2:00 PM for the day dedicated to “Database & AI.” I’ll be presenting “Building an AI-Powered Search Engine with HeatWave GenAI.” I’ll show you how to go beyond the limits of traditional SQL to harness the power of LLM-driven semantic search.

This approach significantly enhances the relevance of search results by understanding context, interpreting user intent, and handling synonyms.

During this session, we’ll cover:

  • The technology stack used: SQL, Python, and JavaScript-based stored procedures
  • The architecture of a complete RAG (retrieval-augmented generation) pipeline, including data extraction, vectorization, storage, and querying within the database
  • The process of building a chatbot for natural language interaction with the AI

Discover how to implement a powerful, AI-enhanced semantic search engine directly within Oracle HeatWave GenAI.

To register, click here.

0

Build an AI-Powered Search Engine with HeatWave GenAI (part 3)

April 15, 2025

In Build an AI-Powered Search Engine with HeatWave GenAI (part 1), we introduced the fundamentals of creating an AI-powered search engine using HeatWave GenAI. We highlighted the advantages of semantic search powered by large language models over traditional SQL-based approaches and provided a hands-on guide for generating embeddings and running similarity searches — key techniques that significantly improve the retrieval of relevant content.

In the second opus — Build an AI-Powered Search Engine with HeatWave GenAI (part 2) — we shifted our focus to improving search result quality through reranking strategies and the use of article summaries for embedding generation. We demonstrated how to implement these enhancements entirely within HeatWave using JavaScript-based stored procedures. By assigning different weights to title and excerpt distances, and generating embeddings from sanitized summaries, we boosted the precision and relevance of search results. This approach showcases HeatWave GenAI’s ability to embed advanced AI capabilities directly within the database layer.

In this third installment, we’ll take it a step further by incorporating full article content into the search engine. While titles, excerpts, or even summaries may work well in many cases, there are situations where deeper, more detailed information is needed to return truly relevant answers.

What are we going to do?

The process is slightly more complex than what we’ve covered so far (in part 1 & part 2). In WordPress, article content is stored in HTML format within the post_content column. This will be our starting point, and the goal is to generate embeddings.

To achieve this, we’ll need to write a few lines of code. While this could be done directly within HeatWave using JavaScript stored procedures — as we saw in part 2: A Javascript, stored procedure & AI story — I’ll instead use the unofficial language of data: Python.
Please bear in mind that I’m not a developer, so this code is provided for illustrative purposes only. It may contain errors or limitations. Please use it at your own risk and adapt it to your specific needs (also feel free to share back).

Below are the steps we’ll follow to move forward:

  1. Defining the embeddings storage table
  2. Fetch articles from the database.
  3. Remove HTML tags and normalize whitespace.
  4. Split articles into overlapping chunks of words.
  5. Generate embeddings for a given article.
  6. Insert article chunks with their embeddings into HeatWave.

Let’s break down each step together!

I’am using HeatWave 9.2.2:

SELECT version();
+-------------+
| version()   |
+-------------+
| 9.2.2-cloud |
+-------------+

Defining the embeddings storage table

I created a new table named wp_post_chunks_embeddings_minilm to store the embeddings generated from article chunks.

  • chunk_embedding: embedding of the articles
  • chunk_text: contains the part of the articles. Mostly useful for debug
  • and some IDs,
    • chunk_id: id of each chunk
    • post_id: useful to join the wp_posts table and get article information
    • and finally, the table primary key: id

From wp_posts table, I’ll use the ID (the primary key) and post_content which contains the article in HTML format.
All these information will be used by the semantic AI-powered search engine.

SHOW CREATE TABLE wp_post_chunks_embeddings_minilm\G
*************************** 1. row ***************************
       Table: wp_post_chunks_embeddings_minilm
Create Table: CREATE TABLE `wp_post_chunks_embeddings_minilm` (
  `id` bigint unsigned NOT NULL AUTO_INCREMENT,
  `post_id` bigint unsigned DEFAULT NULL,
  `chunk_id` bigint unsigned DEFAULT NULL,
  `chunk_text` text,
  `chunk_embedding` vector(2048) DEFAULT NULL,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci

Fetch articles from the database

First, we need to grab a batch of published blog articles from HeatWave MySQL. Think of it as scooping out a small chunk of posts at a time from a big pool of content. Pagination is used in order to manage the memory consumption.

def fetch_articles_batch(db_config: Dict[str, Any], limit: int = 10, offset: int = 0) -> List[Dict[str, Any]]:
    """Fetch a batch of articles from the database"""

    with mysql.connector.connect(**db_config) as conn:
        with conn.cursor(dictionary=True) as cursor:
            cursor.execute("""
                SELECT ID, post_content FROM wp_posts 
                WHERE post_status = 'publish' AND post_type = 'post'
                LIMIT %s OFFSET %s
            """, (limit, offset))

            return cursor.fetchall()

Remove HTML tags and normalize whitespace.

Articles are stored in HTML format in the database. So the content must be cleaned because we need the raw text ie without all the formatting tags. With this function we:

  1. Remove all the HTML tags
  2. Replace them with spaces (so you don’t accidentally smash words together)
  3. Collapse extra whitespace
  4. Trimming off any leading or trailing spaces
def strip_html_tags_with_space(html_string: str) -> str:
    """Remove HTML tags and normalize whitespace"""

    if not html_string:
        return ""

    text = re.sub(r'<[^>]+>', ' ', html_string)
    text = re.sub(r'\s+', ' ', text)

    return text.strip()

Please also note that, depending of the context, extra processing can be done like Stemming or Lemmatization.

Split articles into overlapping chunks of words

Each article is a big block of text and it must be split it into smaller overlapping chunks. It is like slicing a loaf of bread, where each slice overlaps a bit with the one before it.

def split_text_by_words(text: str, post_id: int, chunk_size: int = 400, overlap: int = 80) -> List[Dict[str, Any]]:
    """Split text into overlapping chunks of words"""

    words = text.split()
    chunks = []
    start = 0
    chunk_id = 0

    while start < len(words):
        end = start + chunk_size
        chunk_words = words[start:end]
        chunk_text = ' '.join(chunk_words)

        char_start = len(' '.join(words[:start]))
        char_end = len(' '.join(words[:end]))

        chunks.append({
            "post_id": post_id,
            "chunk_id": chunk_id,
            "char_start": char_start,
            "char_end": char_end,
            "text": chunk_text
        })

        start += chunk_size - overlap
        chunk_id += 1

    return chunks

Chunking in RAG systems presents several challenges, including issues with chunk size (too small or too large), semantic coherence, and context understanding. These challenges can negatively impact retrieval accuracy, efficiency, and precision. Addressing them requires a carefully balanced approach to optimize chunking strategies while preserving computational efficiency and maintaining nuanced contextual understanding.

chunk_size: Defines the number of words per chunk. In this case, I’ve chosen 400 words. The ideal value can vary depending on your specific use case and content structure.

overlap: Specifies how much each chunk overlaps with the next one—here, it’s set to 20% of the chunk size. This overlap helps preserve context across chunks, improving continuity and retrieval accuracy.

Each chunk contains the id of the article (post_id), the id of the chunk (chunk_id) and the chunk itself (test). char_start & char_end are only there for debug purpose.

Generate embeddings for a given article

Now it is time to generate an embedding (aka a numerical vector) for each chunk. Think of an embedding as a way to turn text into numbers so that machines can understand and compare it.

We are using the HeatWave’s ML_EMBED_ROW stored function to generate embeddings:

def embed_row(db_config: Dict[str, Any], search_item: str, embed_options: str = '{"model_id": "minilm"}') -> str:
    """Generate an embedding for a given text using the ML_EMBED_ROW function"""

    with mysql.connector.connect(**db_config) as conn:
        with conn.cursor() as cursor:
            sql_embed_row = 'SELECT VECTOR_TO_STRING(vect) FROM (SELECT sys.ML_EMBED_ROW(%s, %s) AS vect) AS dt;'
            cursor.execute(sql_embed_row, (search_item, embed_options))
            result = cursor.fetchone()

            return "".join(result) if result else ""

The trick here in Python is obtaining the string representation of the vector, which we can do using the VECTOR_TO_STRING function.

embed_options: specifies the settings for the embedding model. Here we are using "minilm".

Insert article chunks with their embeddings into HeatWave

Now it is time to store the text chunks and their corresponding embeddings in the vector store HeatWave — specifically into the wp_post_chunks_embeddings_minilm table designed for this purpose:

def insert_chunks_into_mysql(chunks: List[Dict[str, Any]], db_config: Dict[str, Any], batch_size: int = 900):
    """Insert text chunks with embeddings into the database"""

    with mysql.connector.connect(**db_config) as conn:
        with conn.cursor() as cursor:
            insert_query = """
                INSERT INTO wp_post_chunks_embeddings_minilm (post_id, chunk_id, chunk_text, chunk_embedding)
                VALUES (%s, %s, %s, STRING_TO_VECTOR(%s))
            """
            buffer = []
            for i, chunk in enumerate(chunks, 1):
                chunk_embedding = embed_row(db_config, chunk['text'])
                buffer.append((
                    chunk['post_id'],
                    chunk['chunk_id'],
                    chunk['text'],
                    chunk_embedding
                ))

                if i % batch_size == 0:
                    cursor.executemany(insert_query, buffer)
                    conn.commit()
                    buffer.clear()

            if buffer:
                cursor.executemany(insert_query, buffer)
                conn.commit()

    print(f"Inserted {len(chunks)} chunks into database.")

batch_size: defines how many chunks are inserted into the database at once. In this case, once the buffer reaches 900 chunks, the data is flushed in bulk — a much more efficient approach than inserting each chunk individually, both in terms of performance and memory usage.

A key step here is converting the embeddings into HeatWave’s native vector type before insertion. This is done using the STRING_TO_VECTOR function.

Query Encoding and Vector Similarity Operations

Now it’s time to transform the user’s search query into a vector embedding, a numerical representation that captures its underlying meaning. This process, known as query encoding, allows us to go beyond simple keyword matching. Once the query is encoded as a vector, we can then perform a similarity search by comparing it against the precomputed vector embeddings of our documents within HeatWave. This comparison, often using metrics like cosine similarity, identifies the documents whose vector representations are most closely aligned with the query vector, thus retrieving the most semantically relevant results.

Encode the query into a vector embedding

Like we have seen in Build an AI-Powered Search Engine with HeatWave GenAI (part 1) and in Build an AI-Powered Search Engine with HeatWave GenAI (part 2), to generate a vector embedding for the query, we use the ML_EMBED_ROW routine.
This function applies the specified embedding model to encode the given text into a vector representation. The routine returns a VECTOR containing the numerical embedding of the text.

Using it is straightforward. Let’s define two variables: @searchItem (the text to encode) and @embeddOptions (the embedding model used for encoding):

SET @embeddOptions = '{"model_id": "minilm"}';
SET @searchItem = "Generative artificial intelligence";

-- Encode the query using the embedding model
SELECT sys.ML_EMBED_ROW(@searchItem, @embeddOptions) into @searchItemEmbedding;

Similarity search

To retrieve relevant blog content, we perform vector similarity calculations using the DISTANCE function. This function computes the distance between two vectors using COSINEDOT, or EUCLIDEAN distance metrics.
Here, the two vectors being compared are the encoded query (@searchItemEmbedding) and the precomputed embeddings stored in the wp_posts & wp_post_chunks_embeddings_minilm  tables.

You can now perform a cosine similarity search on the full articles using the following approach:

SQL >
-- Find the most semantically relevant WordPress posts 
SELECT 
    post_title,
    post_excerpt,
    guid,
    min_distance
FROM (
    SELECT 
        p.post_title,
        p.post_excerpt,
        p.guid,
        DISTANCE(@searchItemEmbedding, e.chunk_embedding, 'COSINE') AS min_distance,
        ROW_NUMBER() OVER (PARTITION BY p.post_title ORDER BY DISTANCE(@searchItemEmbedding, e.chunk_embedding, 'COSINE')) AS rn
    FROM wp_posts p
        INNER JOIN wp_post_chunks_embeddings_minilm e ON p.ID = e.post_id
) AS ranked
WHERE rn = 1
ORDER BY min_distance ASC
LIMIT 5\G

*************************** 1. row ***************************
  post_title: HeatWave GenAI: Sentiment Analysis Made Easy-Peasy
post_excerpt: This new AI tech, called generative AI (or GenAI), can dive deep into what people are saying and tell us if they’re feeling positive, negative, or neutral.
Let’s see how HeatWave GenAI, can help you to enhance your understanding of customer sentiment, improve decision-making, and drive business success.
        guid: https://dasini.net/blog/?p=3456
min_distance: 0.4673360586166382
*************************** 2. row ***************************
  post_title: HeatWave GenAI: Your AI-Powered Content Creation Partner
post_excerpt: Generative artificial intelligence (GenAI) is reshaping the content creation landscape. By training on vast datasets, these "intelligent" systems can produce new, human-quality content across a multitude of domains.

Oracle's HeatWave GenAI (starting with version 9.0.1) is at the forefront of this revolution, offering an integrated platform that combines in-database large language models (LLMs), vector stores, and scale-out vector processing to streamline content generation.
This article explores how HeatWave GenAI is empowering businesses to produce high-quality content rapidly and effectively, making it an indispensable tool for industries demanding speed, accuracy, and security.
        guid: https://dasini.net/blog/?p=1234
min_distance: 0.4789550304412842
*************************** 3. row ***************************
  post_title: Simplifying AI Development: A Practical Guide to HeatWave GenAI’s RAG &amp; Vector Store Features
post_excerpt: This tutorial explores HeatWave GenAI, a cloud service that simplifies interacting with unstructured data using natural language. It combines large language models, vector stores, and SQL queries to enable tasks like content generation, chatbot, and retrieval-augmented generation (RAG). The focus is on RAG and how HeatWave GenAI’s architecture helps users gain insights from their data.
        guid: https://dasini.net/blog/?p=2345
min_distance: 0.5191197395324707
*************************** 4. row ***************************
  post_title: Building an Interactive LLM Chatbot with  HeatWave Using Python
post_excerpt: AI-powered applications require robust and scalable database solutions to manage and process large amounts of data efficiently. HeatWave is an excellent choice for such applications, providing high-performance OLTP, analytics, machine learning and generative artificial intelligence capabilities.

In this article, we will explore a Python 3 script that connects to an HeatWave instance and enables users to interact with different large language models (LLMs) dynamically.
        guid: https://dasini.net/blog/?p=5678
min_distance: 0.6004981994628906
*************************** 5. row ***************************
  post_title: In-Database LLMs for Efficient Text Translation with HeatWave GenAI
post_excerpt: While HeatWave GenAI excels at generating English text, its capabilities extend to translation as well. Let's explore how we can effectively translate English content into French using this powerful tool.
        guid: https://dasini.net/blog/?p=9876
min_distance: 0.6385803818702698

This SQL query is designed to retrieve the top 5 WordPress posts that are most semantically similar to a given embedding vector (@searchItemEmbedding), using cosine distance for comparison — but with a twist: it only considers the single best-matching chunk per post.

Of course, you can also apply the techniques from part 1 and part 2 to further improve the results. Methods like reranking or incorporating additional fields—such as the article title—while fine-tuning the weights can significantly enhance result relevance.

Peroration

In this third and final installment of our series on building an AI-powered search engine with HeatWave GenAI, we’ve successfully expanded our capabilities to incorporate the full content of articles into the semantic search process.

By implementing a more intricate pipeline involving fetching, cleaning, chunking, embedding, and storing article text, we’ve aimed for a deeper level of understanding and more relevant search results. This evolution, building upon the foundations laid in Parts 1 and 2 with basic embedding and similarity searches, and refined with reranking and summary utilization, now leverages the comprehensive information within each article.

The demonstrated SQL queries and Python code illustrate how HeatWave GenAI can efficiently manage and query vector embeddings derived from large text datasets, paving the way for a highly sophisticated and context-aware search experience.

While further optimizations are always possible, this three-part journey showcases the powerful potential of HeatWave GenAI for creating intelligent, in-database search solutions.

Follow me on Linkedin

Watch my videos on my YouTube channel and subscribe.

My Slideshare account.

My Speaker Deck account.

Thanks for using HeatWave & MySQL!

1

Build an AI-Powered Search Engine with HeatWave GenAI (part 2)

April 8, 2025

In Build an AI-Powered Search Engine with HeatWave GenAI (part 1), we explored how to build an AI-powered search engine using HeatWave GenAI. We highlighted the advantages of AI-driven semantic search over traditional SQL-based methods and provided a detailed guide on generating embeddings and conducting similarity searches. These techniques enhance the retrieval of relevant articles, improving the user’s ability to find information efficiently.

In this second part, we will explore how to enhance the relevance of our answers using reranking techniques. Next, we will further refine our results by instructing the model to generate embeddings based on article summaries. All these steps will be performed within HeatWave, leveraging its capability to write Stored Programs in JavaScript.

I’am using HeatWave 9.2.1:

SELECT version();
+-------------+
| version()   |
+-------------+
| 9.2.1-cloud |
+-------------+

In the previous episode

if you remember, in example 3.1, we saw how to run a similarity search on title & excerpt (post_title_embedding & post_excerpt_embedding), using an elegant CTE:

SQL >
-- Ex 3.1 Similarity search on title & excerpt (post_title_embedding & post_excerpt_embedding)

WITH distances AS (
    SELECT
        ID, 
        post_title,
        post_excerpt,
        (
          DISTANCE(post_title_embedding, @searchItemEmbedding, 'COSINE') + 
          DISTANCE(post_excerpt_embedding, @searchItemEmbedding, 'COSINE')
        ) / 2 AS avg_distance
    FROM WP_embeddings.wp_posts_embeddings_minilm
)
SELECT *
FROM distances
ORDER BY avg_distance
LIMIT 5\G
*************************** 1. row ***************************
          ID: 1234
  post_title: HeatWave GenAI: Your AI-Powered Content Creation Partner
post_excerpt: Generative artificial intelligence (GenAI) is reshaping the content creation landscape. By training on vast datasets, these "intelligent" systems can produce new, human-quality content across a multitude of domains.

Oracle's HeatWave GenAI (starting with version 9.0.1) is at the forefront of this revolution, offering an integrated platform that combines in-database large language models (LLMs), vector stores, and scale-out vector processing to streamline content generation.
This article explores how HeatWave GenAI is empowering businesses to produce high-quality content rapidly and effectively, making it an indispensable tool for industries demanding speed, accuracy, and security.
avg_distance: 0.5131600499153137
...

This result can be further improved using reranking techniques.
Reranking involves reordering or refining an initial set of retrieved documents to enhance their relevance to a user’s query. This step is essential for optimizing search quality, often leading to a significant boost in the relevance of the retrieved information.

Reranking

Here we setting up the same context (ie embedding model, natural language query encoding and vector similarity operations) than in part 1:

SQL > 
-- Set variables
SET @embeddOptions = '{"model_id": "minilm"}';
SET @searchItem = "Generative artificial intelligence";

-- Encode the query using the embedding model
SELECT sys.ML_EMBED_ROW(@searchItem, @embeddOptions) into @searchItemEmbedding;

The retrieved results is now sorted using weights on title and excerpt distances:

-- Ex 3.2 Similarity search on title & excerpt (post_title_embedding & post_excerpt_embedding) with weights

WITH initial_results AS (
    SELECT
        ID, 
        post_title,
        post_excerpt,
        DISTANCE(post_title_embedding, @searchItemEmbedding, 'COSINE') AS title_distance, 
        DISTANCE(post_excerpt_embedding, @searchItemEmbedding, 'COSINE') AS excerpt_distance, 
    guid 
    FROM WP_embeddings.wp_posts_embeddings_minilm 
    ORDER BY title_distance + excerpt_distance  -- Simple combination
    LIMIT 15 -- Retrieve a larger initial set
),
reranked_results AS (
    SELECT
        ID,
        post_title,
        post_excerpt,
        (0.3 * title_distance + 0.7 * excerpt_distance) AS combined_distance,   -- Weighted combination
    guid 
    FROM initial_results
)
SELECT post_title, post_excerpt, combined_distance, guid
FROM reranked_results
ORDER BY combined_distance ASC
LIMIT 5\G

*************************** 1. row ***************************
       post_title: HeatWave GenAI: Sentiment Analysis Made Easy-Peasy
     post_excerpt: This new AI tech, called generative AI (or GenAI), can dive deep into what people are saying and tell us if they’re feeling positive, negative, or neutral.
Let’s see how HeatWave GenAI, can help you to enhance your understanding of customer sentiment, improve decision-making, and drive business success.
combined_distance: 0.49683985114097595
             guid: https://dasini.net/blog/?p=3456
*************************** 2. row ***************************
       post_title: HeatWave GenAI: Your AI-Powered Content Creation Partner
     post_excerpt: Generative artificial intelligence (GenAI) is reshaping the content creation landscape. By training on vast datasets, these "intelligent" systems can produce new, human-quality content across a multitude of domains.

Oracle's HeatWave GenAI (starting with version 9.0.1) is at the forefront of this revolution, offering an integrated platform that combines in-database large language models (LLMs), vector stores, and scale-out vector processing to streamline content generation.
This article explores how HeatWave GenAI is empowering businesses to produce high-quality content rapidly and effectively, making it an indispensable tool for industries demanding speed, accuracy, and security.
combined_distance: 0.4994780898094177
             guid: https://dasini.net/blog/?p=1234
*************************** 3. row ***************************
       post_title: Simplifying AI Development: A Practical Guide to HeatWave GenAI’s RAG &amp; Vector Store Features
     post_excerpt: This tutorial explores HeatWave GenAI, a cloud service that simplifies interacting with unstructured data using natural language. It combines large language models, vector stores, and SQL queries to enable tasks like content generation, chatbot, and retrieval-augmented generation (RAG). The focus is on RAG and how HeatWave GenAI’s architecture helps users gain insights from their data.
combined_distance: 0.6582363367080688
             guid: https://dasini.net/blog/?p=2345
*************************** 4. row ***************************
       post_title: Webinar - Apprentissage automatique avec MySQL HeatWave
     post_excerpt: HeatWave Machine Learning (ML) inclut tout ce dont les utilisateurs ont besoin pour créer, former, déployer et expliquer des modèles d’apprentissage automatique dans MySQL HeatWave, sans coût supplémentaire.

Dans ce webinaire vous apprendrez...
combined_distance: 0.694593733549118
             guid: https://dasini.net/blog/?p=6789
*************************** 5. row ***************************
       post_title: Building an Interactive LLM Chatbot with  HeatWave Using Python
     post_excerpt: AI-powered applications require robust and scalable database solutions to manage and process large amounts of data efficiently. HeatWave is an excellent choice for such applications, providing high-performance OLTP, analytics, machine learning and generative artificial intelligence capabilities.

In this article, we will explore a Python 3 script that connects to an HeatWave instance and enables users to interact with different large language models (LLMs) dynamically.
combined_distance: 0.7135995388031006
             guid: https://dasini.net/blog/?p=5678

In the example above, the excerpt is given more than twice the weight of the title (0.7 vs 0.3), based on the assumption that the excerpt holds more relevant information in this context.
Depending on your use case, you may want to fine-tune these weights, as adjusting them can significantly improve the quality of search results.

Now, let’s explore how we can further refine our results by leveraging article summaries along with JavaScript-based stored procedures and functions.

A Javascript, stored procedure & AI story

When searching through articles, relying solely on the title and excerpt may not yield the most relevant results.
What if we used a summary instead?
This approach can strike an excellent balance between relevance and implementation simplicity. With HeatWave GenAI, the entire workflow can be handled directly within the database using SQL and JavaScript-based stored procedures.

In WordPress, article content is stored as HTML in the post_content column. To make it suitable for processing by a large language model (LLM), this content must first be sanitized — that is, all HTML tags need to be removed, as they are not meaningful to the LLM.

Table preparation

As a preparation step, I added a new column (post_content_text longtext NOT NULL) to the wp_posts_embeddings_minilm table that we have used in part 1.

ALTER TABLE wp_posts_embeddings_minilm ADD COLUMN post_content_text longtext NOT NULL;

The structure of the table is now:

CREATE TABLE `wp_posts_embeddings_minilm` (
  `ID` bigint unsigned NOT NULL AUTO_INCREMENT,
  `post_title` text NOT NULL,
  `post_excerpt` text NOT NULL,
  `guid` varchar(255) NOT NULL DEFAULT '',
  `post_title_embedding` vector(2048) NOT NULL COMMENT 'GENAI_OPTIONS=EMBED_MODEL_ID=minilm',
  `post_excerpt_embedding` vector(2048) NOT NULL COMMENT 'GENAI_OPTIONS=EMBED_MODEL_ID=minilm',
  `post_content_text` longtext NOT NULL,
  PRIMARY KEY (`ID`),
  KEY `post_title` (`post_title`(255)),
  KEY `post_excerpt` (`post_excerpt`(255)),
  KEY `guid` (`guid`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci

This new column — post_content_text — will be populated with the sanitized content of post_content.

For example, original HTML in post_content:

<!-- wp:paragraph -->
<p>HeatWave GenAI brings LLMs directly into your database, enabling powerful AI capabilities and natural language processing.</p>
<!-- /wp:paragraph -->

Should be stored in post_content_text as:

HeatWave GenAI brings LLMs directly into your database, enabling powerful AI capabilities and natural language processing

Strip HTML tags with a javascript stored routine

This could be done in Javascript, inside HeatWave, as a stored program. Isn’t it magnificent?
HeatWave supports stored routines written in JavaScript, since version 9.0.

A simple implementation could be the following:

CREATE FUNCTION sp.stripHtmlTags(htmlString LONGTEXT) RETURNS LONGTEXT NO SQL LANGUAGE JAVASCRIPT AS
$$
    if (!htmlString) {
        return "";
    }

    // Replace HTML tags with a space
    return htmlString
        .replace(/<[^>]+>/g, " ")  // Replace all tags with a space
        .replace(/\s+/g, " ")      // Replace multiple spaces with a single space
        .trim(); 
$$
;

Please bear in mind that I’m not a developer, so this code is provided for illustrative purposes only. It may contain errors or limitations. Please use it at your own risk and adapt it to your specific needs (also feel free to share back).

Let’s see if it works:

SELECT sp.stripHtmlTags('<!-- wp:paragraph --><p>HeatWave GenAI brings LLMs directly into your database, enabling powerful AI capabilities and natural language processing.</p><!-- /wp:paragraph -->') ;
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| sp.stripHtmlTags('<!-- wp:paragraph --><p>HeatWave GenAI brings LLMs directly into your database, enabling powerful AI capabilities and natural language processing.</p><!-- /wp:paragraph -->') |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| HeatWave GenAI brings LLMs directly into your database, enabling powerful AI capabilities and natural language processing.                                                                       |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

Looks like this Javascript stored function is doing the job.

We can now uses it to sanitized all the articles.

UPDATE wp_posts_embeddings_minilm 
    INNER JOIN wp_posts USING (ID) 
    SET post_content_text = sp.stripHtmlTags(post_content) 
    WHERE post_status = 'publish' 
       AND post_type = 'post' ; 

Now all the published posts have a sanitized text only version in the database. We can now uses it to generate article summaries.

Generating summaries with the JavaScript HeatWave GenAI API

HeatWave offers a JavaScript API that enables seamless integration with HeatWave GenAI, allowing you to perform natural language searches powered by LLMs. You can find more details in the official documentation.

To generate summaries for all my articles, I’ll create a stored procedure using the ml.generate method. This method supports two modes: single invocation and batch processing. While single invocation is ideal for handling new articles individually, we’ll focus on the batch mode here to process all existing articles efficiently:

SQL > 
CREATE PROCEDURE sp.summarizePostBatch() LANGUAGE JAVASCRIPT AS
$$
    let schema = session.getSchema("wordpress");
    let table = schema.getTable("wp_posts_embeddings_minilm");

    ml.generate(table, "post_content_text", "post_summary_json", {model_id: "mistral-7b-instruct-v1", task: "summarization"});
$$
;

ml.generate loads the model (mistral-7b-instruct-v1), generates a response (article summary) inside the post_summary_json column (automatically created), based on the prompt (article) from the post_content_text column and returns the response (a summary).

Then call the stored procedure:

CALL sp.summarizePostBatch();

 Query OK, 0 rows affected (1 hour 2 min 16.3605 sec)

The process successfully created a new JSON column named post_summary_json:

SQL > SHOW CREATE TABLE wp_posts_embeddings_minilm;

CREATE TABLE `wp_posts_embeddings_minilm` (
  `ID` bigint unsigned NOT NULL AUTO_INCREMENT,
  `post_title` text NOT NULL,
  `post_excerpt` text NOT NULL,
  `guid` varchar(255) NOT NULL DEFAULT '',
  `post_title_embedding` vector(2048) NOT NULL COMMENT 'GENAI_OPTIONS=EMBED_MODEL_ID=minilm',
  `post_excerpt_embedding` vector(2048) NOT NULL COMMENT 'GENAI_OPTIONS=EMBED_MODEL_ID=minilm',
  `post_content_text` longtext NOT NULL,
  `post_summary_json` json DEFAULT NULL,
  PRIMARY KEY (`ID`),
  KEY `post_title` (`post_title`(255)),
  KEY `post_excerpt` (`post_excerpt`(255)),
  KEY `guid` (`guid`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci

Using the HeatWave MySQL JSON functions, we can check the content of this new column:

SELECT LEFT(post_summary_json->>"$.text", 100) AS json FROM wp_posts_embeddings_minilm WHERE ID = 1234\G
*************************** 1. row ***************************
json: Generative artificial intelligence (GenAI) is reshaping the content creation landscape. By training 

Now let’s see how to create embeddings for all the summaries

Create the embeddings

We have seen in Build an AI-Powered Search Engine with HeatWave GenAI (part 1) how to create embeddings with the minilm model, using sys.ML_EMBED_TABLE routine.
We’ll follow the same process, but this time using JavaScript stored routines. Specifically, we’ll use the ml.embed method.

Before implementing the routine, we need to do some modification on the table. The reason is that in HeatWave 9.2.1, the table columns must be in one of the following format: varchar, tinytext, text, mediumtext and longtext. If it is not the case you’ll trigger the ERROR: 1644 (45000): ML006093: Type of 'post_summary_json' must be in ["varchar", "tinytext", "text", "mediumtext", "longtext"]
Quite explicit though!

We’ll transfer the content of post_summary_json into a text column named post_summary:

SQL >
ALTER TABLE wp_posts_embeddings_minilm ADD COLUMN post_summary text;

UPDATE wp_posts_embeddings_minilm SET post_summary =  post_summary_json->>"$.text"; 

And finally to avoid error ERROR: 1644 (45000): ML006093, we must drop post_summary_json:

ALTER TABLE wp_posts_embeddings_minilm DROP COLUMN post_summary_json;

ml.embed  also supports two variants, one for a single invocation, and one for batch processing.

Single invocation of ml.embed  

The goal is to create a stored procedure that encodes a summarized article into a vector embedding using the minilm embedding model:

SQL >
CREATE PROCEDURE sp.createEmbeddings(IN text2embed LONGTEXT, OUT vect VECTOR) LANGUAGE JAVASCRIPT AS
$$
    let embedding = ml.embed(text2embed, {model_id: "minilm"});
    vect = embedding;
$$
;

The input parameter is the summarized article:

SQL >
CALL sp.createEmbeddings(" This article explores the integration of HeatWave GenAI with Python for AI-driven applications. It provides a step-by-step guide to building a simple chatbot system that interacts with HeatWave using its in-database LLMs and external APIs from OCI Generative AI Service. The script demonstrates how to establish a connection to the HeatWave MySQL database, load and manage multiple LLMs within HeatWave, allow users to select their preferred model dynamically, facilitate chatbot interactions using HeatWave Chat, and retrieve and manage chat options. The article also discusses the benefits of using HeatWave GenAI for AI-driven applications, including its modular and scalable design, dynamic LLM selection, and powerful capabilities in AI.", @myVect);

The output parameter — @myVect — contains the embedding:

SQL >
SELECT @myVect\G
*************************** 1. row ***************************
@myVect: ?RK?p?????!<Q?c?
?????h?[???;??qV@=?mW???F?_?M=??<??$=EN&??Xv?6?`=E????P?V^v????=Dw?<Q?f?.?x?
                                                                            ?N?????<?1=
                                                                                        P-???=??O=
                                                                                                  ?A=21?=Ez????@??1j??*]?T??K????o?V?Ž?l
??+??=W??=3??????                                                                                                                       >?<iyx<?????<j????<???;z?_?H<??,=?'@?j?<N??=???M?C?`Wb?Qr?U?c???V<DmT<{N6=
                ????T??$?<?u
                            ?v$?=g?=?k???D?"??6?\??

A more human representation (or not) is possible using FROM_VECTOR:

SQL >
SELECT FROM_VECTOR(@myVect)\G
*************************** 1. row ***************************
FROM_VECTOR(@myVect): [-4.96396e-02,-8.32623e-02,9.86042e-03,-1.38911e-02,-1.98374e-02,-5.67944e-02,-3.55247e-02,2.43442e-03,-7.37070e-03,4.69574e-02,-5.25947e-02,-4.85626e-02,5.02704e-02,9.01929e-03,4.01000e-02,-2.53762e-03,-1.50358e-02,5.48794e-02,-3.82407e-02,-5.09363e-02,-6.01486e-02,6.62680e-02,1.97102e-02,-3.52313e-03,-6.07311e-02,-5.04761e-02,-2.28147e-02,2.70393e-02,3.71569e-02,-2.64454e-03,1.19642e-01,5.07343e-02,4.72823e-02,6.94298e-02,-1.87351e-02,-1.17510e-02,-5.71763e-02,-5.39956e-02,-1.96540e-02,-2.69486e-02,-3.46894e-02,-3.81926e-02,-9.65525e-02,3.40393e-02,2.23684e-02,1.51657e-02,-1.00843e-01,2.82697e-02,-2.81484e-02,2.06150e-02,-8.44855e-02,-9.88659e-02,1.22386e-02,4.21836e-02,-2.93204e-03,7.91035e-03,7.20659e-02,-8.70026e-03,-4.77326e-02,-1.38148e-02,-2.58113e-02,-4.66768e-03,-9.27910e-03,1.30998e-02,1.29655e-02,4.45084e-02,-2.69566e-02,7.25501e-02,7.15968e-02,-1.46195e-01,-6.30012e-02,-7.91235e-02,-5.19264e-02,2.40655e-02,-3.42919e-02,7.72180e-02,7.22950e-02,-7.19829e-02,-7.17984e-03,-2.89341e-02,-1.34624e-02,-4.61435e-02,1.58105e-02,7.25958e-02,-2.99175e-02,-7.80187e-02,-5.74389e-03,-2.57604e-02,7.18643e-02,4.47061e-02,2.45664e-02,-1.13831e-02,4.58998e-02,-1.19727e-02,5.77563e-02,9.52516e-02,3.84456e-02,-3.92485e-02,-3.90266e-02,2.87119e-02,-7.21781e-02,4.85492e-02,-8.24129e-02,-2.01809e-02,-1.77014e-02,6.55753e-02,-4.22585e-02,-6.67007e-02,6.90129e-02,-2.04837e-02,-5.49840e-02,-8.76547e-03,4.42857e-02,5.63733e-02,7.53812e-02,-4.63425e-02,1.82254e-02,1.18100e-02,-7.03449e-02,7.29235e-02,2.41146e-02,-3.35154e-02,-3.56129e-02,7.28611e-02,6.86842e-02,-5.54628e-02,-1.93234e-02,-1.89223e-03,9.01134e-02,6.44596e-02,2.64851e-02,9.06968e-02,4.06748e-02,5.16320e-02,3.33822e-02,5.54104e-02,-3.21358e-02,-6.66219e-02,2.00648e-02,1.09276e-01,-6.25206e-02,4.89027e-02,-4.17487e-02,-5.67389e-02,7.44691e-03,-4.42884e-02,3.18385e-02,-1.98049e-02,-7.42827e-02,-6.14671e-02,7.44849e-02,3.02982e-02,2.98403e-02,7.55829e-02,1.20768e-01,1.26906e-02,1.04550e-01,6.59261e-02,1.30538e-02,-1.39226e-02,-3.02682e-02,7.50245e-02,1.57195e-02,-2.63368e-03,-9.18113e-02,-6.67840e-02,-2.06254e-02,7.86020e-02,-2.10694e-02,1.30604e-03,5.31497e-02,5.42747e-02,-1.07493e-01,-4.50561e-02,-1.25854e-01,-2.63633e-02,2.99242e-02,-2.35737e-02,-1.81825e-02,2.76637e-02,-4.12063e-02,4.46310e-02,-1.92102e-02,-8.92773e-02,4.45359e-02,-1.11110e-02,-5.22624e-02,-2.45076e-02,-1.96262e-02,-1.98535e-02,8.15337e-02,7.87625e-02,5.73686e-02,1.47477e-02,6.91806e-03,1.53566e-02,7.33980e-03,1.07936e-02,1.51880e-03,2.81434e-02,-6.74229e-02,2.64910e-02,-1.99371e-02,5.59245e-02,2.73276e-02,-1.19287e-01,8.55642e-03,6.91925e-02,1.21589e-02,-4.05408e-02,-7.55252e-02,-3.61071e-02,2.72453e-02,-8.27803e-02,3.18795e-02,-5.05173e-03,-1.99760e-02,3.63341e-02,-6.50054e-02,-8.56357e-03,5.15133e-02,6.26782e-02,-4.57907e-02,-1.35400e-33,-3.12649e-02,1.42193e-02,-6.76464e-02,2.92021e-02,-5.87203e-02,-1.18268e-01,6.15763e-02,4.44978e-02,4.98419e-02,-1.72100e-02,-1.13092e-01,6.02123e-02,1.55891e-02,-5.68059e-02,-1.35132e-02,-4.19842e-02,-2.60558e-02,2.80308e-03,1.32598e-01,1.07273e-01,-3.47401e-02,4.75040e-02,-7.63112e-02,-2.59081e-02,2.11842e-02,2.25822e-02,-1.98663e-02,-1.04105e-02,-6.49599e-02,1.09372e-03,-5.17167e-02,1.38617e-02,-6.26832e-02,1.26345e-02,-1.71078e-02,5.77133e-02,-5.38819e-03,-6.03902e-02,1.65896e-02,-8.53037e-02,6.79781e-02,-2.80634e-02,-8.47060e-02,5.93425e-02,-1.46708e-02,4.85509e-02,-1.12832e-01,6.27912e-03,-6.16615e-03,-2.94036e-02,3.11167e-02,2.30855e-03,6.65979e-02,-6.83186e-02,4.35274e-02,-1.24276e-01,1.87970e-02,-3.84284e-02,-1.53174e-02,-3.97068e-02,1.07243e-02,-3.07626e-02,2.31247e-02,-3.29633e-02,-6.20930e-02,1.08394e-01,7.25451e-02,7.60207e-02,-4.54095e-02,-8.78456e-02,4.84518e-02,-4.07884e-02,7.50114e-02,1.14496e-02,2.41463e-02,5.81300e-02,-1.30148e-02,1.39014e-02,-1.40898e-02,-1.72861e-02,-3.93898e-02,-2.61493e-02,2.51394e-02,-2.53317e-02,-9.82916e-04,-2.21086e-02,5.22690e-02,3.53555e-02,6.22829e-03,3.66027e-02,2.17727e-03,6.67700e-02,-2.32061e-02,-1.69419e-03,-9.06145e-03,6.34304e-32,-1.78320e-02,-2.15657e-02,-6.16218e-03,5.44959e-02,1.88435e-02,-1.29032e-02,-4.80581e-02,4.21709e-02,1.50341e-02,4.37988e-02,8.35770e-02,-8.84255e-02,7.72715e-02,5.24237e-02,-3.56883e-02,-8.12593e-03,-3.41413e-02,-1.69388e-02,3.47573e-02,-1.21286e-01,3.22201e-02,3.63997e-02,-2.33719e-02,-2.07887e-02,-8.05931e-02,-7.69887e-02,-4.39181e-02,-9.76368e-03,-1.48867e-03,3.44279e-02,-3.62778e-02,-7.50277e-03,4.86021e-02,5.18756e-02,1.85040e-02,-2.77495e-03,-3.53111e-02,5.16714e-02,-2.28214e-02,-6.10510e-02,2.86529e-02,1.37693e-01,8.80472e-03,-5.79396e-02,3.57790e-02,1.58254e-03,3.59281e-02,-9.29721e-02,1.09798e-02,1.83488e-02,-1.42538e-03,-3.40717e-02,3.98733e-02,-1.05952e-02,3.72981e-02,-5.85627e-02,3.13814e-02,-5.80935e-02,-3.15143e-03,6.94674e-02,-6.51819e-03,6.39965e-02,3.89966e-02,-7.56700e-02]

Batch processing of ml.embed  

To generate embeddings for all current posts at once, batch processing is the most efficient approach. Let’s create a new stored procedure in JavaScript:

SQL >
CREATE PROCEDURE sp.createEmbeddingsBatch() LANGUAGE JAVASCRIPT AS
$$
    let schema = session.getSchema("wordpress");
    let table = schema.getTable("wp_posts_embeddings_minilm");

    ml.embed(table, "post_summary", "post_summary_embedding", {model_id: "minilm"});
$$
;

ml.embed loads the model (minilm), generates an embedding inside the post_summary_embedding column (automatically created), based on the post summaries from the post_summary column.

Then run the stored procedure:

SQL > 
CALL sp.createEmbeddingsBatch();
Query OK, 0 rows affected (32.9131 sec)

The process successfully created a new VECTOR column named post_summary_embedding:

SQL > SHOW CREATE TABLE wp_posts_embeddings_minilm;

CREATE TABLE `wp_posts_embeddings_minilm` (
  `ID` bigint unsigned NOT NULL AUTO_INCREMENT,
  `post_title` text NOT NULL,
  `post_excerpt` text NOT NULL,
  `guid` varchar(255) NOT NULL DEFAULT '',
  `post_title_embedding` vector(2048) NOT NULL COMMENT 'GENAI_OPTIONS=EMBED_MODEL_ID=minilm',
  `post_excerpt_embedding` vector(2048) NOT NULL COMMENT 'GENAI_OPTIONS=EMBED_MODEL_ID=minilm',
  `post_content_text` longtext NOT NULL,
  `post_summary` text,
  `post_summary_embedding` vector(2048) NOT NULL COMMENT 'GENAI_OPTIONS=EMBED_MODEL_ID=minilm',
  PRIMARY KEY (`ID`),
  KEY `post_title` (`post_title`(255)),
  KEY `post_excerpt` (`post_excerpt`(255)),
  KEY `guid` (`guid`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci

Query Encoding and Vector Similarity Operations

The penultimate step is to convert the user’s query into a vector representation that captures its semantic meaning. This process, known as query encoding, transforms text into numerical embeddings. Once encoded, we will be able to perform a similarity search to find the most relevant results by comparing the query’s embedding with precomputed vectors from HeatWave.

Encode the query into a vector embedding

Like we have seen in part1, to generate a vector embedding for the query, we use the ML_EMBED_ROW routine. This function applies the specified embedding model to encode the given text into a vector representation. The routine returns a VECTOR containing the numerical embedding of the text.

Using it is straightforward. Let’s define two variables: @searchItem (the text to encode) and @embeddOptions (the embedding model used for encoding):

SQL >
-- Set variables
SET @embeddOptions = '{"model_id": "minilm"}';
SET @searchItem = "Generative artificial intelligence";

-- Encode the query using the embedding model
SELECT sys.ML_EMBED_ROW(@searchItem, @embeddOptions) into @searchItemEmbedding;

You can print the vector using the following query;

SELECT FROM_VECTOR(vect) 
FROM (
      SELECT sys.ML_EMBED_ROW(@searchItem, @embeddOptions) AS vect
     ) AS dt;
[-6.21515e-02,1.61460e-02,1.25987e-02,-1.98096e-02,... (truncated)

Similarity search

To retrieve relevant blog content, we perform vector similarity calculations using the DISTANCE function. This function computes the distance between two vectors using COSINE, DOT, or EUCLIDEAN distance metrics.
Here, the two vectors being compared are the encoded query (@searchItemEmbedding) and the precomputed embeddings stored in the wp_posts_embeddings_minilm table (post_summary_embedding)

A cosine similarity search for article summaries can be conducted using:

-- Ex 4.1. Similarity search only on the post summaries (post_summary_embedding)

WITH distances AS (
    SELECT
        ID, 
        post_title,
        post_excerpt,
        DISTANCE(@searchItemEmbedding, post_summary_embedding, 'COSINE') AS min_distance
    FROM WP_embeddings.wp_posts_embeddings_minilm
)
SELECT *
FROM distances
ORDER BY min_distance
LIMIT 5\G

A cosine similarity search for article titles, excerpts and summaries can be conducted using:

-- Ex 5.1 Similarity search on title, excerpt & post summary (post_title_embedding, post_excerpt_embedding & post_summary_embedding)

WITH distances AS (
    SELECT
        post_title,
        post_excerpt,
        (
          DISTANCE(post_title_embedding, @searchItemEmbedding, 'COSINE') + 
          DISTANCE(post_excerpt_embedding, @searchItemEmbedding, 'COSINE') + 
          DISTANCE(post_summary_embedding, @searchItemEmbedding, 'COSINE')
        ) / 3 AS avg_distance,
        guid
    FROM WP_embeddings.wp_posts_embeddings_minilm
)
SELECT *
FROM distances
ORDER BY avg_distance
LIMIT 5\G

Finally, you can (try to) improve the results using a reranking technique:

-- Ex 5.2 Weighted Similarity search on title, excerpt & post summary (post_title_embedding, post_excerpt_embedding & post_summary_embedding)

WITH initial_results AS (
    SELECT
        post_title,
        post_excerpt,
        DISTANCE(post_title_embedding, @searchItemEmbedding, 'COSINE') AS title_distance, 
        DISTANCE(post_excerpt_embedding, @searchItemEmbedding, 'COSINE') AS excerpt_distance,
        DISTANCE(post_summary_embedding,  @searchItemEmbedding, 'COSINE') AS summary_distance, 
        guid
    FROM WP_embeddings.wp_posts_embeddings_minilm 
    ORDER BY title_distance + excerpt_distance + summary_distance ASC -- Simple combination
    LIMIT 15 -- Retrieve a larger initial set
),
reranked_results AS (
    SELECT
        ID,
        post_title,
        post_excerpt,
        (0.2 * title_distance + 0.3 * excerpt_distance + 0.5 * summary_distance) AS combined_distance,  -- Weighted combination
        guid
    FROM initial_results
)
SELECT post_title, post_excerpt, combined_distance, guid
FROM reranked_results
ORDER BY combined_distance ASC
LIMIT 5\G

*************************** 1. row ***************************
       post_title: HeatWave GenAI: Your AI-Powered Content Creation Partner
     post_excerpt: Generative artificial intelligence (GenAI) is reshaping the content creation landscape. By training on vast datasets, these "intelligent" systems can produce new, human-quality content across a multitude of domains.

Oracle's HeatWave GenAI (starting with version 9.0.1) is at the forefront of this revolution, offering an integrated platform that combines in-database large language models (LLMs), vector stores, and scale-out vector processing to streamline content generation.
This article explores how HeatWave GenAI is empowering businesses to produce high-quality content rapidly and effectively, making it an indispensable tool for industries demanding speed, accuracy, and security.
combined_distance: 0.5093500733375549
             guid: https://dasini.net/blog/?p=1234
*************************** 2. row ***************************
       post_title: Simplifying AI Development: A Practical Guide to HeatWave GenAI’s RAG &amp; Vector Store Features
     post_excerpt: This tutorial explores HeatWave GenAI, a cloud service that simplifies interacting with unstructured data using natural language. It combines large language models, vector stores, and SQL queries to enable tasks like content generation, chatbot, and retrieval-augmented generation (RAG). The focus is on RAG and how HeatWave GenAI’s architecture helps users gain insights from their data.
combined_distance: 0.637738311290741
             guid: https://dasini.net/blog/?p=2345
*************************** 3. row ***************************
       post_title: HeatWave GenAI: Sentiment Analysis Made Easy-Peasy
     post_excerpt: This new AI tech, called generative AI (or GenAI), can dive deep into what people are saying and tell us if they’re feeling positive, negative, or neutral.
Let’s see how HeatWave GenAI, can help you to enhance your understanding of customer sentiment, improve decision-making, and drive business success.
combined_distance: 0.6417026937007904
             guid: https://dasini.net/blog/?p=3456
*************************** 4. row ***************************
       post_title: Building an Interactive LLM Chatbot with  HeatWave Using Python
     post_excerpt: AI-powered applications require robust and scalable database solutions to manage and process large amounts of data efficiently. HeatWave is an excellent choice for such applications, providing high-performance OLTP, analytics, machine learning and generative artificial intelligence capabilities.

In this article, we will explore a Python 3 script that connects to an HeatWave instance and enables users to interact with different large language models (LLMs) dynamically.
combined_distance: 0.6545232772827148
             guid: https://dasini.net/blog/?p=5678
*************************** 5. row ***************************
       post_title: Webinar - Apprentissage automatique avec MySQL HeatWave
     post_excerpt: HeatWave Machine Learning (ML) inclut tout ce dont les utilisateurs ont besoin pour créer, former, déployer et expliquer des modèles d’apprentissage automatique dans MySQL HeatWave, sans coût supplémentaire.

Dans ce webinaire vous apprendrez...
combined_distance: 0.7031511843204499
             guid: https://dasini.net/blog/?p=6789

Peroration

In this second part of our journey into building an AI-powered search engine with HeatWave GenAI, we explored advanced techniques to refine search relevance. By incorporating reranking strategies and leveraging article summaries for embedding generation, we significantly improved the quality of retrieved results.

Furthermore, we demonstrated how to harness HeatWave’s support for JavaScript-based stored programs to sanitize content, generate summaries, and compute embeddings — all within the database. This seamless integration of AI-powered search within HeatWave showcases its potential for efficient, scalable, and intelligent information retrieval.

With these enhancements, our search engine is now more capable of understanding and delivering relevant results, although we still can further optimize performance by experimenting with different embedding models, or even integrating additional AI-driven ranking techniques.

Stay tuned for more insights!

To be continued…

Follow me on Linkedin

Watch my videos on my YouTube channel and subscribe.

My Slideshare account.

My Speaker Deck account.

Thanks for using HeatWave & MySQL!

2