Building an AI Vision Search Engine with MySQL HeatWave GenAI

May 14, 2026

Modern AI systems increasingly rely on multimodal data: text, images, documents, audio, and video. Among these modalities, image understanding has become one of the most important capabilities for AI-powered applications.

Organizations now expect systems to:

  • Search images using natural language
  • Compare visually similar assets
  • Classify and enrich image catalogs
  • Validate AI-generated descriptions
  • Build semantic retrieval pipelines

Traditionally, implementing these capabilities required specialized computer vision infrastructure, external vector databases, custom ML pipelines, and multiple frameworks.

With MySQL HeatWave GenAI, many of these capabilities can now be implemented directly inside SQL workflows using built-in AI routines such as:

In this article, we will build the foundations of a Vision Model Evaluation Assistant using MySQL HeatWave GenAI.

The objective is not to create another image classifier, but rather a semantic image understanding platform capable of:

  • Generating image descriptions
  • Producing semantic embeddings
  • Performing text-to-image search
  • Performing semantic reverse image search
  • Evaluating image understanding quality

This article intentionally focuses on the database layer and the SQL logic powering the system.

Landscape of Vision-to-Embedding Approaches

Before building the solution, it is important to understand the major approaches currently used for AI-powered image understanding systems.

Let’s have a look on three paradigms for representing images in AI systems.

1. Direct Image Embeddings

The first approach consists of converting raw image pixels directly into vectors.

Popular architectures include:

In this approach:

  • The image is fed directly into a neural network
  • The network produces a dense vector embedding
  • Similarity search is performed directly on image vectors

Workflow

Strengths

  • Extremely fast similarity search
  • Excellent for image-to-image matching
  • End-to-end architecture
  • No intermediate textual representation required

Limitations

These embeddings are often difficult to interpret:

  • No human-readable explanation
  • Harder to debug
  • Harder to refine
  • Black-box behavior

If a search result is incorrect, it is difficult to understand why the model failed.

To summarize

Pros

  • High-performance retrieval
  • Strong visual similarity matching
  • Can compare image-to-image directly

Cons

  • Requires specialized model deployment
  • Less explainable
  • Harder to integrate into SQL-centric workflows

2. Semantic Metadata Embeddings

This is the approach used in this article.

Instead of embedding raw pixels directly, we first generate a textual understanding of the image.

The workflow becomes:

This pipeline is composed of two distinct stages:

Step 1: Image -> Text

A Vision Language Model (VLM) analyzes the image and generates:

  • A description and / or Keywords
  • Contextual understanding

Example:

{    "description": "A red motorcycle parked on a city street near buildings during daytime.",

    "keywords": ["motorcycle", "street", "urban", "vehicle", "city"]
}

Step 2: Text -> Embedding

The generated text is then converted into a vector embedding using a text embedding model.

This creates semantic vectors representing the meaning of the image.

Strengths

This approach offers several advantages:

Human-readable and explainable

Unlike direct image embeddings, the generated metadata is understandable by humans.

If a search fails, developers can inspect:

  • The generated description
  • The extracted keywords
  • The semantic interpretation

This creates a powerful audit trail.

SQL-native workflow

Because the representation becomes textual:

  • Standard embedding models can be used
  • VECTOR columns integrate naturally
  • SQL workflows remain simple

This aligns perfectly with MySQL HeatWave GenAI capabilities.

Trade-offs

The quality of retrieval depends heavily on:

  • Caption quality
  • Prompt engineering
  • Vision model performance

Low-level visual details may also be lost:

  • Texture
  • Exact shape
  • Pixel-specific patterns

To summarize

Pros

  • Human-readable
  • Easy to debug
  • Works well with SQL systems
  • Easier operationalization

Cons

  • Less precise for purely visual similarity
  • Depends on generated metadata quality

3. Multimodal Fusion Models

A third category consists of multimodal architectures capable of jointly processing:

  • Images
  • Text
  • Audio
  • Video
  • Structured data

These systems use:

  • Early fusion
  • Intermediate fusion
  • Late fusion
  • Hybrid fusion architectures

Their objective is to create unified multimodal embeddings.

Strengths

  • Very powerful semantic understanding
  • Strong cross-modal reasoning
  • Rich contextual interpretation

Challenges

These architectures are often:

  • More complex
  • Harder to operationalize
  • Less transparent
  • Difficult to integrate into SQL-centric architectures

Why We Chose the Image -> Text -> Embedding Approach?

In this article, we intentionally choose the Semantic Metadata Embeddings pipeline because it aligns perfectly with MySQL HeatWave GenAI capabilities.

Specifically:

This approach also creates a human-readable semantic audit trail.

If retrieval quality degrades, architects and developers can directly inspect the generated descriptions to understand:

  • What the model detected
  • What it missed
  • Why semantic similarity failed

For AI evaluation systems, explainability is often just as important as accuracy.

Overview of MySQL HeatWave GenAI

MySQL HeatWave is a fully managed cloud service that combines:

  • Transactional processing (OLTP)
  • Analytics (OLAP)
  • Machine Learning
  • Generative AI
  • Vector Store capabilities

within a single MySQL platform.

One of the most interesting aspects of HeatWave GenAI is that it enables AI-powered workflows directly inside SQL.

Instead of exporting data to external AI pipelines, developers can:

  • Generate embeddings
  • Execute semantic similarity search
  • Perform retrieval augmented generation (RAG)
  • Analyze images
  • Invoke LLMs

using SQL routines \o/.

MySQL HeatWave Key GenAI Capabilities

MySQL HeatWave GenAI includes:

  • Text generation
  • Text embeddings
  • Vector similarity search
  • In-database AI processing
  • Vision model integration

Key routines and data types used in this article include:

In this article, I’m using MySQL HeatWave version 9.6.1:

mysql> 
SELECT VERSION();
+-------------+
| VERSION()   |
+-------------+
| 9.6.1-cloud |
+-------------+

Significantly lower the Total Cost of Ownership

Traditionally, AI pipelines require:

  • Multiple services
  • External vector databases
  • Dedicated ML infrastructure
  • Separate orchestration layers

With MySQL HeatWave:

  • Data remains inside MySQL
  • AI logic becomes SQL-native
  • Governance is simplified
  • Operational complexity is reduced

By consolidating these diverse requirements into a single platform, MySQL HeatWave significantly lowers the Total Cost of Ownership (TCO). It eliminates the expensive licensing fees associated with multiple third-party services and external vector databases.

Furthermore, by making AI logic SQL-native and reducing operational complexity, organizations can leverage their existing database expertise rather than hiring specialized ML engineers to manage separate orchestration layers, drastically cutting both infrastructure and personnel overhead.

Summary of TCO Drivers

  • Skill Set Leverage: AI tasks become accessible to anyone with SQL skills, reducing the need for niche, high-cost specialists.
  • Reduced Licensing: Consolidating multiple services into one platform removes the need for separate vendor contracts.
  • Lower Infrastructure Costs: Eliminating dedicated ML infrastructure and external databases reduces the physical or cloud hardware footprint.
  • Operational Efficiency: Keeping data inside MySQL HeatWave simplifies governance and reduces the labor hours required for complex data movement and synchronization.

Generating Image Understanding with sys.ML_GENERATE

The foundation of our evaluation assistant is the ability to convert images into semantic understanding.

This is where sys.ML_GENERATE becomes extremely powerful.

Prerequisites

The environment requires:

  • A MySQL HeatWave instance with a HeatWave cluster attached
  • Access to supported vision models (like google.gemini-2.5-pro)
  • Access to supported embedding models (like cohere.embed-english-v3.0)
  • A table capable of storing:
    • Images
    • Descriptions
    • Keywords
    • Embeddings

Table Definition

CREATE TABLE image_details (  id_image int unsigned NOT NULL AUTO_INCREMENT,
  image_name varchar(255) NOT NULL,
  image_base64 longtext NOT NULL,
  image_description json NOT NULL,
  image_description_embedding vector(2048) NOT NULL COMMENT 'GENAI_OPTIONS=EMBED_MODEL_ID=cohere.embed-english-v3.0',
  image_keywords json NOT NULL,
  image_keywords_embedding vector(2048) NOT NULL COMMENT 'GENAI_OPTIONS=EMBED_MODEL_ID=cohere.embed-english-v3.0',
  created_at timestamp NULL DEFAULT CURRENT_TIMESTAMP,
  PRIMARY KEY (id_image)
) ENGINE=InnoDB;

This table stores:

  • The original image (in base 64).
  • Semantic metadata (Image description and keywords).
  • Vector embeddings (of the description and the keywords).

directly inside MySQL.

What sys.ML_GENERATE Enables

The sys.ML_GENERATE routine supports multimodal prompts combining:

  • Text
  • Images

This enables:

  • Caption generation
  • Object detection prompts
  • Context extraction
  • Semantic interpretation

Please read the MySQL HeatWave ML_GENERATE Documentation if you want to know more.

Bridging Structured and Unstructured Data

One of the most interesting aspects of ML_GENERATE is that it bridges the gap between:

  • Structured SQL systems
  • Unstructured image content

The output becomes structured semantic metadata that can:

  • Be stored
  • Indexed
  • Embedded
  • Queried
  • Compared

using standard SQL operations.

Supported Vision Models

The supported vision models can be queried directly from MySQL HeatWave:

mysql>
SELECT * 
FROM sys.ML_SUPPORTED_LLMS
WHERE model_id LIKE 'google.gemini%'\G

*************************** 1. row ***************************
         provider: OCI Generative AI Service
         model_id: google.gemini-2.5-flash
availability_date: 2026-01-22
     capabilities: ["GENERATION"]
    default_model: 0
*************************** 2. row ***************************
         provider: OCI Generative AI Service
         model_id: google.gemini-2.5-pro
availability_date: 2026-01-22
     capabilities: ["GENERATION"]
    default_model: 0
*************************** 3. row ***************************
         provider: OCI Generative AI Service
         model_id: google.gemini-2.5-flash-lite
availability_date: 2026-01-22
     capabilities: ["GENERATION"]
    default_model: 0

The model used in this article is:

google.gemini-2.5-pro

The workflow is straightforward:

This transforms unstructured visual data into structured semantic data.

Examples

Object Detection Prompt

mysql> 
SELECT image_base64
FROM image_details
WHERE id_image = 1
INTO @image_base64;

SET @prompt_desc_key_json = 'What objects are present?';

SELECT JSON_UNQUOTE(
    JSON_EXTRACT(
        sys.ML_GENERATE(@prompt_desc_key_json, JSON_OBJECT("model_id", "google.gemini-2.5-pro", "image", @image_base64)),
        '$.text'
    )
) AS description
\G

*************************** 1. row ***************************
description: Based on the image provided, here are the objects that are present:

**Main Subject:**
*   A black motorcycle, which appears to be a Honda Hornet, is the central object. Its visible parts include:
    *   Wheels and tires
    *   Engine
    *   Exhaust pipe and muffler
    *   Frame
    *   Seat
    *   Fuel tank (with a black tank cover on it)
    *   Handlebars with mirrors and levers
    *   Front and rear disc brakes
    *   Gold-colored front forks

**Background and Surroundings:**
*   A white or light gray garage door with vertical panels.
*   A red brick wall on the left.
*   A white drainpipe between the wall and the garage door.
*   Pavement or concrete on the ground, with some patches of frost or thin snow.
*   Dry, brown leaves scattered on the ground.

Scene Context Prompt

mysql> 
SET @prompt_desc_key_json = 'What is the context of this scene?';

SELECT JSON_UNQUOTE(
    JSON_EXTRACT(
        sys.ML_GENERATE(@prompt_desc_key_json, JSON_OBJECT("model_id", "google.gemini-2.5-pro", "image", @image_base64)),
        '$.text'
    )
) AS description
\G

*************************** 1. row ***************************
description: Based on the image, the context of the scene is an outdoor, residential setting, likely during a cold season like late autumn or winter.

Here are the key details that establish this context:

*   **Subject:** The central focus is a modern, black and silver Honda Hornet motorcycle. It is parked and appears to be the subject of the photo, possibly taken by its owner.
*   **Location:** The motorcycle is parked on a paved surface, such as a driveway or a courtyard, in front of white, vertical-paneled garage doors. To the left, there is a red brick wall. This combination suggests a residential area.
*   **Season/Weather:** The ground is littered with dead, brown leaves and patches of what looks like frost or old, melting snow. This indicates cold weather, pointing to late autumn or winter. The lighting appears flat and overcast, which is common for that time of year.

This capability becomes the semantic foundation of the evaluation assistant. We are not merely generating captions.

We are enabling:

  • Queryable visual understanding
  • Semantic enrichment
  • Explainable AI retrieval

Text-to-Image Search (Semantic Retrieval)

One of the most powerful applications of semantic embeddings is natural language image search.

The objective is simple:

Input: Text
Output: Relevant Images

But internally, the logic is fundamentally different from traditional image retrieval systems.

The Core Idea

We are not searching for:

  • Pixels
  • Shapes
  • Exact visual patterns

We are searching for:

  • Meaning
  • Context
  • Semantics

This is semantic retrieval.

Workflow Overview

The workflow contains two phases:

  1. Preprocessing
  2. Query execution

Preprocessing Phase

Each image must first be semantically enriched.

The workflow is the following:

Examples


Supported Embedding Models

mysql> 
SELECT *
FROM sys.ML_SUPPORTED_LLMS
WHERE capabilities LIKE '["TEXT_EMBEDDINGS"]'\G

*************************** 1. row ***************************
         provider: HeatWave
         model_id: all_minilm_l12_v2
availability_date: 2024-07-01
     capabilities: ["TEXT_EMBEDDINGS"]
    default_model: 0
*************************** 2. row ***************************
         provider: HeatWave
         model_id: multilingual-e5-small
availability_date: 2024-07-24
     capabilities: ["TEXT_EMBEDDINGS"]
    default_model: 1
*************************** 3. row ***************************
         provider: OCI Generative AI Service
         model_id: cohere.embed-english-v3.0
availability_date: 2024-02-17
     capabilities: ["TEXT_EMBEDDINGS"]
    default_model: 0
*************************** 4. row ***************************
         provider: OCI Generative AI Service
         model_id: cohere.embed-multilingual-v3.0
availability_date: 2024-02-17
     capabilities: ["TEXT_EMBEDDINGS"]
    default_model: 0

Generating Semantic Metadata

mysql> 
SET @prompt_desc_key_json = '
Analyze the image and return a minified JSON object.
The JSON must contain two fields:
"description" (a concise, single-paragraph description of visible content with no speculation)
and "keywords" (a JSON array of relevant lowercase visual keywords).
Output ONLY the raw JSON on a single line without any markdown formatting, backticks, or preamble
';

Generate Description and Keywords

mysql> 
SELECT JSON_UNQUOTE(
    JSON_EXTRACT(
        sys.ML_GENERATE(@prompt_desc_key_json, JSON_OBJECT("model_id", "google.gemini-2.5-pro", "image", @image_base64)),
        '$.text'
    )
) AS description
\G

*************************** 1. row ***************************
description: {"description":"A side profile view of a black Honda Hornet naked motorcycle parked on a paved surface with scattered leaves. The bike has a black frame, engine, and fuel tank, with a silver tail section and exhaust pipe. The front forks are gold-colored. In the background, there is a red brick wall on the left and a light-colored, vertically-paneled garage door on the right.","keywords":["motorcycle","honda","honda hornet","naked bike","street bike","black","parked","outdoors","garage door","brick wall","engine","exhaust","two-wheeler","vehicle","wheel","tire","disc brake"]}

Generate Embeddings and Store Data

mysql> 
INSERT INTO image_details (
    image_name,
    image_base64,
    image_description,
    image_description_embedding,
    image_keywords,
    image_keywords_embedding
)
VALUES (
    image_name,
    image_base64,
    image_description,
    sys.ML_EMBED_ROW(
        @source_image_descriptions,
        '{"model_id": "cohere.embed-english-v3.0"}'
    ),
    image_keywords,
    sys.ML_EMBED_ROW(
        @source_image_keywords,
        '{"model_id": "cohere.embed-english-v3.0"}'
    )
);

Query Phase

At query time:

  1. The user enters a text description
  2. The text becomes an embedding
  3. Similarity search retrieves matching images

Generate Query Embedding

mysql> 
SET @embeddOptions = '{"model_id": "cohere.embed-english-v3.0"}';

SET @searchImage = 'street motorcycle';

SELECT sys.ML_EMBED_ROW(
    @searchImage,
    @embeddOptions
) INTO @searchImageEmbedding;

If you want to see what a vector embedding looks like under the hood, you can use the FROM_VECTOR() function:

mysql>
SELECT from_vector(@searchImageEmbedding)\G
*************************** 1. row ***************************
from_vector(@searchImageEmbedding): [2.09045e-02,1.06277e-02,2.59590e-03,-6.01807e-02,-4.44336e-02,8.44574e-03,-4.88281e-02,8.72803e-03,-1.45645e-02,2.78168e-02,-4.73328e-02,2.53296e-03,-5.47791e-03,-1.36337e-02,4.00696e-02,-1.90735e-02,6.69556e-02,-3.72314e-02,3.95203e-02,2.24457e-02,-1.06964e-02,7.75757e-02,-1 
                 ... even more numbers ...
,-3.04871e-02,5.33295e-03,3.96729e-03]

Similarity Search Using Descriptions

mysql> 
SELECT 
    image_name,
    DISTANCE(
        image_description_embedding,
        @searchImageEmbedding,
        'COSINE'
    ) AS min_distance
FROM image_details
ORDER BY min_distance
LIMIT 3;

+-----------------------+---------------------+
| image_name            | min_distance        |
+-----------------------+---------------------+
| vintage_sidecar.JPG   | 0.46460431814193726 |
| Le_Mans_24_moto.png   | 0.49795448780059814 |
| hornet_black_gold.jpg |  0.5088052749633789 |
+-----------------------+---------------------+

Similarity Search Using Description + Keywords

mysql> 
WITH distances AS (
    SELECT
        image_name,
        (
            DISTANCE(
                image_keywords_embedding,
                @searchImageEmbedding,
                'COSINE'
            )
            +
            DISTANCE(
                image_description_embedding,
                @searchImageEmbedding,
                'COSINE'
            )
        ) / 2 AS avg_distance
    FROM image_details
)
SELECT *
FROM distances
ORDER BY avg_distance ASC
LIMIT 3;

+-----------------------+---------------------+
| image_name            | avg_distance        |
+-----------------------+---------------------+
| hornet_black_gold.jpg | 0.48873063921928406 |
| vintage_sidecar.JPG   |  0.4904197156429291 |
| Yamaha_DTX.jpg        |  0.5230678915977478 |
+-----------------------+---------------------+

Simple Re-ranking Strategy

mysql> 
WITH initial_results AS (
    SELECT
        image_name,
        DISTANCE(
            image_keywords_embedding,
            @searchImageEmbedding,
            'COSINE'
        ) AS keywords_distance,

        DISTANCE(
            image_description_embedding,
            @searchImageEmbedding,
            'COSINE'
        ) AS description_distance

    FROM image_details

    ORDER BY keywords_distance + description_distance
    LIMIT 15
),

reranked_results AS (
    SELECT
        image_name,
        (
            0.3 * keywords_distance +
            0.7 * description_distance
        ) AS combined_distance

    FROM initial_results
)

SELECT *
FROM reranked_results
ORDER BY combined_distance ASC
LIMIT 3;

+-----------------------+--------------------+
| image_name            | combined_distance  |
+-----------------------+--------------------+
| vintage_sidecar.JPG   | 0.4800935566425323 |
| hornet_black_gold.jpg |  0.496760493516922 |
| Le_Mans_24_moto.png   | 0.5196295261383057 |
+-----------------------+--------------------+

Key Insight

The most important concept here is this:

  • We are not comparing images.
  • We are comparing semantic representations of images.

This distinction fundamentally changes how image retrieval systems can be designed using SQL-native AI capabilities.

Reverse Image Search (Image-to-Image via Semantics)

Traditional reverse image search systems rely heavily on:

  • Pixel similarity
  • Feature extraction
  • Specialized computer vision pipelines

Our approach is different.

We perform Semantic Image-to-Image Search.

This is:

  • Similarity of meaning
  • Not similarity of pixels

Workflow

Why This Is Interesting

Two images may be visually different while still being semantically related.

For example:

  • Different motorcycles
  • Different lighting conditions
  • Different camera angles

Yet both images may describe: « A motorcycle parked on an urban street ».

Semantic retrieval captures this meaning.

Advantages of the Semantic Approach

Simpler Architecture

No need for:

  • Complex Computer Vision frameworks
  • GPU-heavy image feature pipelines
  • Specialized image vector databases

Explainability

Because the retrieval is based on generated semantic metadata:

  • Results can be audited
  • Descriptions can be refined
  • Prompts can be improved

SQL-native Workflow

Everything remains inside MySQL HeatWave:

  • Metadata
  • Embeddings
  • Similarity search
  • Ranking logic

Architectural Trade-offs & Design Considerations

Building AI-powered semantic image systems involves several important architectural decisions.

Description vs Keywords

Both representations serve different purposes.

Descriptions

Descriptions provide:

  • Rich contextual understanding
  • Better semantic reasoning
  • Natural language flexibility

However:

  • They may introduce noise
  • Longer text may dilute embeddings

Keywords

Keywords provide:

  • Focused semantic signals
  • Better precision
  • Faster matching

However:

  • They lose contextual richness

Strategy implemented in this article

A hybrid approach:

  • Used descriptions for semantic context
  • Used keywords for precision
  • Combined both during re-ranking

However

I’m storing two embeddings doubles vector storage (2 × 2048 dimensions per image).
For very large libraries, consider using only one – either concatenated or choose the one that performs better on your validation set.

Storing Images in the Database vs Object Storage

Another important design consideration concerns image storage.

Storing Images in MySQL

Advantages:

  • Simpler architecture
  • Centralized governance
  • Easier transactional consistency

Disadvantages:

  • Larger database size increases buffer pool pressure
  • Backup overhead
  • Potentially performance impact (queries, replication,… )

Storing Images in Object Storage

Advantages:

  • Better scalability
  • Lower storage costs
  • Better CDN integration

Disadvantages:

  • More complex architecture
  • Additional security management

Recommended Hybrid Strategy

A practical enterprise approach is often:

  • Store thumbnails (small, e.g., 512×512) as base64 in MySQL (should be sufficient for vision model analysis)
  • Store full-resolution images in object storage for later retrieval or user download

This enables:

  • Fast previews
  • Efficient semantic retrieval
  • Scalable storage architecture

Prompt Engineering Matters

Semantic quality heavily depends on prompts.

Poor prompts generate:

  • Vague descriptions
  • Weak keywords
  • Low-quality embeddings

Well-designed prompts improve:

  • Retrieval quality
  • Explainability
  • Evaluation consistency

Prompt engineering becomes a critical part of semantic architecture design.

Péroraison

MySQL HeatWave GenAI significantly simplifies the implementation of semantic image understanding systems while substantially lowering the Total Cost of Ownership through the elimination of external infrastructure and fragmented AI pipelines.

By combining:

  • Vision-language models
  • Text embeddings
  • Vector search
  • SQL-native AI routines

organizations can build powerful multimodal systems directly inside MySQL HeatWave.

The « image -> text -> embedding » approach presented in this article provides:

  • Explainability
  • Operational simplicity
  • SQL-native integration
  • Semantic search capabilities

Most importantly, it enables architects and developers to build AI systems that remain understandable, debuggable, and governable — which is often one of the biggest challenges in enterprise AI adoption.

Resources

Follow me on Linkedin

Watch my videos on my YouTube channel and subscribe.

My Slideshare account.

My Speaker Deck account.

Thanks for using HeatWave & MySQL!

Leave a Reply