Building an AI Vision Search Engine with MySQL HeatWave GenAI
Modern AI systems increasingly rely on multimodal data: text, images, documents, audio, and video. Among these modalities, image understanding has become one of the most important capabilities for AI-powered applications.
Organizations now expect systems to:
- Search images using natural language
- Compare visually similar assets
- Classify and enrich image catalogs
- Validate AI-generated descriptions
- Build semantic retrieval pipelines
Traditionally, implementing these capabilities required specialized computer vision infrastructure, external vector databases, custom ML pipelines, and multiple frameworks.
With MySQL HeatWave GenAI, many of these capabilities can now be implemented directly inside SQL workflows using built-in AI routines such as:
sys.ML_GENERATEsys.ML_EMBED_ROW- VECTOR data types
- Similarity search with
DISTANCEand the other Vector functions.
In this article, we will build the foundations of a Vision Model Evaluation Assistant using MySQL HeatWave GenAI.
The objective is not to create another image classifier, but rather a semantic image understanding platform capable of:
- Generating image descriptions
- Producing semantic embeddings
- Performing text-to-image search
- Performing semantic reverse image search
- Evaluating image understanding quality
This article intentionally focuses on the database layer and the SQL logic powering the system.
Landscape of Vision-to-Embedding Approaches
Before building the solution, it is important to understand the major approaches currently used for AI-powered image understanding systems.
Let’s have a look on three paradigms for representing images in AI systems.
1. Direct Image Embeddings
The first approach consists of converting raw image pixels directly into vectors.
Popular architectures include:
- CLIP-like (Contrastive Language-Image Pretraining) models
- ResNet + projector architectures
- Vision Transformers (ViT)
In this approach:
- The image is fed directly into a neural network
- The network produces a dense vector embedding
- Similarity search is performed directly on image vectors
Workflow

Strengths
- Extremely fast similarity search
- Excellent for image-to-image matching
- End-to-end architecture
- No intermediate textual representation required
Limitations
These embeddings are often difficult to interpret:
- No human-readable explanation
- Harder to debug
- Harder to refine
- Black-box behavior
If a search result is incorrect, it is difficult to understand why the model failed.
To summarize
Pros
- High-performance retrieval
- Strong visual similarity matching
- Can compare image-to-image directly
Cons
- Requires specialized model deployment
- Less explainable
- Harder to integrate into SQL-centric workflows
2. Semantic Metadata Embeddings
This is the approach used in this article.
Instead of embedding raw pixels directly, we first generate a textual understanding of the image.
The workflow becomes:

This pipeline is composed of two distinct stages:
Step 1: Image -> Text
A Vision Language Model (VLM) analyzes the image and generates:
- A description and / or Keywords
- Contextual understanding
Example:
{ "description": "A red motorcycle parked on a city street near buildings during daytime.",
"keywords": ["motorcycle", "street", "urban", "vehicle", "city"]
}
Step 2: Text -> Embedding
The generated text is then converted into a vector embedding using a text embedding model.
This creates semantic vectors representing the meaning of the image.
Strengths
This approach offers several advantages:
Human-readable and explainable
Unlike direct image embeddings, the generated metadata is understandable by humans.
If a search fails, developers can inspect:
- The generated description
- The extracted keywords
- The semantic interpretation
This creates a powerful audit trail.
SQL-native workflow
Because the representation becomes textual:
- Standard embedding models can be used
- VECTOR columns integrate naturally
- SQL workflows remain simple
This aligns perfectly with MySQL HeatWave GenAI capabilities.
Trade-offs
The quality of retrieval depends heavily on:
- Caption quality
- Prompt engineering
- Vision model performance
Low-level visual details may also be lost:
- Texture
- Exact shape
- Pixel-specific patterns
To summarize
Pros
- Human-readable
- Easy to debug
- Works well with SQL systems
- Easier operationalization
Cons
- Less precise for purely visual similarity
- Depends on generated metadata quality
3. Multimodal Fusion Models
A third category consists of multimodal architectures capable of jointly processing:
- Images
- Text
- Audio
- Video
- Structured data
These systems use:
- Early fusion
- Intermediate fusion
- Late fusion
- Hybrid fusion architectures
Their objective is to create unified multimodal embeddings.
Strengths
- Very powerful semantic understanding
- Strong cross-modal reasoning
- Rich contextual interpretation
Challenges
These architectures are often:
- More complex
- Harder to operationalize
- Less transparent
- Difficult to integrate into SQL-centric architectures
Why We Chose the Image -> Text -> Embedding Approach?
In this article, we intentionally choose the Semantic Metadata Embeddings pipeline because it aligns perfectly with MySQL HeatWave GenAI capabilities.
Specifically:
handles image-to-text generationML_GENERATEhandles text-to-vector conversionML_EMBED_ROW
This approach also creates a human-readable semantic audit trail.
If retrieval quality degrades, architects and developers can directly inspect the generated descriptions to understand:
- What the model detected
- What it missed
- Why semantic similarity failed
For AI evaluation systems, explainability is often just as important as accuracy.
Overview of MySQL HeatWave GenAI

MySQL HeatWave is a fully managed cloud service that combines:
- Transactional processing (OLTP)
- Analytics (OLAP)
- Machine Learning
- Generative AI
- Vector Store capabilities
within a single MySQL platform.

One of the most interesting aspects of HeatWave GenAI is that it enables AI-powered workflows directly inside SQL.
Instead of exporting data to external AI pipelines, developers can:
- Generate embeddings
- Execute semantic similarity search
- Perform retrieval augmented generation (RAG)
- Analyze images
- Invoke LLMs
using SQL routines \o/.
MySQL HeatWave Key GenAI Capabilities
MySQL HeatWave GenAI includes:
- Text generation
- Text embeddings
- Vector similarity search
- In-database AI processing
- Vision model integration
Key routines and data types used in this article include:
In this article, I’m using MySQL HeatWave version 9.6.1:
mysql>
SELECT VERSION();
+-------------+
| VERSION() |
+-------------+
| 9.6.1-cloud |
+-------------+
Significantly lower the Total Cost of Ownership
Traditionally, AI pipelines require:
- Multiple services
- External vector databases
- Dedicated ML infrastructure
- Separate orchestration layers
With MySQL HeatWave:
- Data remains inside MySQL
- AI logic becomes SQL-native
- Governance is simplified
- Operational complexity is reduced
By consolidating these diverse requirements into a single platform, MySQL HeatWave significantly lowers the Total Cost of Ownership (TCO). It eliminates the expensive licensing fees associated with multiple third-party services and external vector databases.
Furthermore, by making AI logic SQL-native and reducing operational complexity, organizations can leverage their existing database expertise rather than hiring specialized ML engineers to manage separate orchestration layers, drastically cutting both infrastructure and personnel overhead.
Summary of TCO Drivers
- Skill Set Leverage: AI tasks become accessible to anyone with SQL skills, reducing the need for niche, high-cost specialists.
- Reduced Licensing: Consolidating multiple services into one platform removes the need for separate vendor contracts.
- Lower Infrastructure Costs: Eliminating dedicated ML infrastructure and external databases reduces the physical or cloud hardware footprint.
- Operational Efficiency: Keeping data inside MySQL HeatWave simplifies governance and reduces the labor hours required for complex data movement and synchronization.
Generating Image Understanding with sys.ML_GENERATE
The foundation of our evaluation assistant is the ability to convert images into semantic understanding.
This is where becomes extremely powerful.sys.ML_GENERATE
Prerequisites
The environment requires:
- A MySQL HeatWave instance with a HeatWave cluster attached
- Access to supported vision models (like google.gemini-2.5-pro)
- Access to supported embedding models (like cohere.embed-english-v3.0)
- A table capable of storing:
- Images
- Descriptions
- Keywords
- Embeddings
Table Definition
CREATE TABLE image_details ( id_image int unsigned NOT NULL AUTO_INCREMENT,
image_name varchar(255) NOT NULL,
image_base64 longtext NOT NULL,
image_description json NOT NULL,
image_description_embedding vector(2048) NOT NULL COMMENT 'GENAI_OPTIONS=EMBED_MODEL_ID=cohere.embed-english-v3.0',
image_keywords json NOT NULL,
image_keywords_embedding vector(2048) NOT NULL COMMENT 'GENAI_OPTIONS=EMBED_MODEL_ID=cohere.embed-english-v3.0',
created_at timestamp NULL DEFAULT CURRENT_TIMESTAMP,
PRIMARY KEY (id_image)
) ENGINE=InnoDB;
This table stores:
- The original image (in base 64).
- Semantic metadata (Image description and keywords).
- Vector embeddings (of the description and the keywords).
directly inside MySQL.
What sys.ML_GENERATE Enables
The sys.ML_GENERATE routine supports multimodal prompts combining:
- Text
- Images
This enables:
- Caption generation
- Object detection prompts
- Context extraction
- Semantic interpretation
Please read the MySQL HeatWave ML_GENERATE Documentation if you want to know more.
Bridging Structured and Unstructured Data
One of the most interesting aspects of ML_GENERATE is that it bridges the gap between:
- Structured SQL systems
- Unstructured image content
The output becomes structured semantic metadata that can:
- Be stored
- Indexed
- Embedded
- Queried
- Compared
using standard SQL operations.
Supported Vision Models
The supported vision models can be queried directly from MySQL HeatWave:
mysql>
SELECT *
FROM sys.ML_SUPPORTED_LLMS
WHERE model_id LIKE 'google.gemini%'\G
*************************** 1. row ***************************
provider: OCI Generative AI Service
model_id: google.gemini-2.5-flash
availability_date: 2026-01-22
capabilities: ["GENERATION"]
default_model: 0
*************************** 2. row ***************************
provider: OCI Generative AI Service
model_id: google.gemini-2.5-pro
availability_date: 2026-01-22
capabilities: ["GENERATION"]
default_model: 0
*************************** 3. row ***************************
provider: OCI Generative AI Service
model_id: google.gemini-2.5-flash-lite
availability_date: 2026-01-22
capabilities: ["GENERATION"]
default_model: 0
The model used in this article is:
google.gemini-2.5-pro
The workflow is straightforward:

This transforms unstructured visual data into structured semantic data.
Examples

Object Detection Prompt
mysql>
SELECT image_base64
FROM image_details
WHERE id_image = 1
INTO @image_base64;
SET @prompt_desc_key_json = 'What objects are present?';
SELECT JSON_UNQUOTE(
JSON_EXTRACT(
sys.ML_GENERATE(@prompt_desc_key_json, JSON_OBJECT("model_id", "google.gemini-2.5-pro", "image", @image_base64)),
'$.text'
)
) AS description
\G
*************************** 1. row ***************************
description: Based on the image provided, here are the objects that are present:
**Main Subject:**
* A black motorcycle, which appears to be a Honda Hornet, is the central object. Its visible parts include:
* Wheels and tires
* Engine
* Exhaust pipe and muffler
* Frame
* Seat
* Fuel tank (with a black tank cover on it)
* Handlebars with mirrors and levers
* Front and rear disc brakes
* Gold-colored front forks
**Background and Surroundings:**
* A white or light gray garage door with vertical panels.
* A red brick wall on the left.
* A white drainpipe between the wall and the garage door.
* Pavement or concrete on the ground, with some patches of frost or thin snow.
* Dry, brown leaves scattered on the ground.
Scene Context Prompt
mysql>
SET @prompt_desc_key_json = 'What is the context of this scene?';
SELECT JSON_UNQUOTE(
JSON_EXTRACT(
sys.ML_GENERATE(@prompt_desc_key_json, JSON_OBJECT("model_id", "google.gemini-2.5-pro", "image", @image_base64)),
'$.text'
)
) AS description
\G
*************************** 1. row ***************************
description: Based on the image, the context of the scene is an outdoor, residential setting, likely during a cold season like late autumn or winter.
Here are the key details that establish this context:
* **Subject:** The central focus is a modern, black and silver Honda Hornet motorcycle. It is parked and appears to be the subject of the photo, possibly taken by its owner.
* **Location:** The motorcycle is parked on a paved surface, such as a driveway or a courtyard, in front of white, vertical-paneled garage doors. To the left, there is a red brick wall. This combination suggests a residential area.
* **Season/Weather:** The ground is littered with dead, brown leaves and patches of what looks like frost or old, melting snow. This indicates cold weather, pointing to late autumn or winter. The lighting appears flat and overcast, which is common for that time of year.
This capability becomes the semantic foundation of the evaluation assistant. We are not merely generating captions.
We are enabling:
- Queryable visual understanding
- Semantic enrichment
- Explainable AI retrieval
Text-to-Image Search (Semantic Retrieval)
One of the most powerful applications of semantic embeddings is natural language image search.
The objective is simple:
Input: TextOutput: Relevant Images
But internally, the logic is fundamentally different from traditional image retrieval systems.
The Core Idea
We are not searching for:
- Pixels
- Shapes
- Exact visual patterns
We are searching for:
- Meaning
- Context
- Semantics
This is semantic retrieval.
Workflow Overview
The workflow contains two phases:
- Preprocessing
- Query execution
Preprocessing Phase
Each image must first be semantically enriched.
The workflow is the following:

Examples
Supported Embedding Models
mysql>
SELECT *
FROM sys.ML_SUPPORTED_LLMS
WHERE capabilities LIKE '["TEXT_EMBEDDINGS"]'\G
*************************** 1. row ***************************
provider: HeatWave
model_id: all_minilm_l12_v2
availability_date: 2024-07-01
capabilities: ["TEXT_EMBEDDINGS"]
default_model: 0
*************************** 2. row ***************************
provider: HeatWave
model_id: multilingual-e5-small
availability_date: 2024-07-24
capabilities: ["TEXT_EMBEDDINGS"]
default_model: 1
*************************** 3. row ***************************
provider: OCI Generative AI Service
model_id: cohere.embed-english-v3.0
availability_date: 2024-02-17
capabilities: ["TEXT_EMBEDDINGS"]
default_model: 0
*************************** 4. row ***************************
provider: OCI Generative AI Service
model_id: cohere.embed-multilingual-v3.0
availability_date: 2024-02-17
capabilities: ["TEXT_EMBEDDINGS"]
default_model: 0
Generating Semantic Metadata
mysql>
SET @prompt_desc_key_json = '
Analyze the image and return a minified JSON object.
The JSON must contain two fields:
"description" (a concise, single-paragraph description of visible content with no speculation)
and "keywords" (a JSON array of relevant lowercase visual keywords).
Output ONLY the raw JSON on a single line without any markdown formatting, backticks, or preamble
';
Generate Description and Keywords
mysql>
SELECT JSON_UNQUOTE(
JSON_EXTRACT(
sys.ML_GENERATE(@prompt_desc_key_json, JSON_OBJECT("model_id", "google.gemini-2.5-pro", "image", @image_base64)),
'$.text'
)
) AS description
\G
*************************** 1. row ***************************
description: {"description":"A side profile view of a black Honda Hornet naked motorcycle parked on a paved surface with scattered leaves. The bike has a black frame, engine, and fuel tank, with a silver tail section and exhaust pipe. The front forks are gold-colored. In the background, there is a red brick wall on the left and a light-colored, vertically-paneled garage door on the right.","keywords":["motorcycle","honda","honda hornet","naked bike","street bike","black","parked","outdoors","garage door","brick wall","engine","exhaust","two-wheeler","vehicle","wheel","tire","disc brake"]}
Generate Embeddings and Store Data
mysql>
INSERT INTO image_details (
image_name,
image_base64,
image_description,
image_description_embedding,
image_keywords,
image_keywords_embedding
)
VALUES (
image_name,
image_base64,
image_description,
sys.ML_EMBED_ROW(
@source_image_descriptions,
'{"model_id": "cohere.embed-english-v3.0"}'
),
image_keywords,
sys.ML_EMBED_ROW(
@source_image_keywords,
'{"model_id": "cohere.embed-english-v3.0"}'
)
);
Query Phase
At query time:
- The user enters a text description
- The text becomes an embedding
- Similarity search retrieves matching images
Generate Query Embedding
mysql>
SET @embeddOptions = '{"model_id": "cohere.embed-english-v3.0"}';
SET @searchImage = 'street motorcycle';
SELECT sys.ML_EMBED_ROW(
@searchImage,
@embeddOptions
) INTO @searchImageEmbedding;
If you want to see what a vector embedding looks like under the hood, you can use the FROM_VECTOR() function:
mysql>
SELECT from_vector(@searchImageEmbedding)\G
*************************** 1. row ***************************
from_vector(@searchImageEmbedding): [2.09045e-02,1.06277e-02,2.59590e-03,-6.01807e-02,-4.44336e-02,8.44574e-03,-4.88281e-02,8.72803e-03,-1.45645e-02,2.78168e-02,-4.73328e-02,2.53296e-03,-5.47791e-03,-1.36337e-02,4.00696e-02,-1.90735e-02,6.69556e-02,-3.72314e-02,3.95203e-02,2.24457e-02,-1.06964e-02,7.75757e-02,-1
... even more numbers ...
,-3.04871e-02,5.33295e-03,3.96729e-03]
Similarity Search Using Descriptions
mysql>
SELECT
image_name,
DISTANCE(
image_description_embedding,
@searchImageEmbedding,
'COSINE'
) AS min_distance
FROM image_details
ORDER BY min_distance
LIMIT 3;
+-----------------------+---------------------+
| image_name | min_distance |
+-----------------------+---------------------+
| vintage_sidecar.JPG | 0.46460431814193726 |
| Le_Mans_24_moto.png | 0.49795448780059814 |
| hornet_black_gold.jpg | 0.5088052749633789 |
+-----------------------+---------------------+
Similarity Search Using Description + Keywords
mysql>
WITH distances AS (
SELECT
image_name,
(
DISTANCE(
image_keywords_embedding,
@searchImageEmbedding,
'COSINE'
)
+
DISTANCE(
image_description_embedding,
@searchImageEmbedding,
'COSINE'
)
) / 2 AS avg_distance
FROM image_details
)
SELECT *
FROM distances
ORDER BY avg_distance ASC
LIMIT 3;
+-----------------------+---------------------+
| image_name | avg_distance |
+-----------------------+---------------------+
| hornet_black_gold.jpg | 0.48873063921928406 |
| vintage_sidecar.JPG | 0.4904197156429291 |
| Yamaha_DTX.jpg | 0.5230678915977478 |
+-----------------------+---------------------+
Simple Re-ranking Strategy
mysql>
WITH initial_results AS (
SELECT
image_name,
DISTANCE(
image_keywords_embedding,
@searchImageEmbedding,
'COSINE'
) AS keywords_distance,
DISTANCE(
image_description_embedding,
@searchImageEmbedding,
'COSINE'
) AS description_distance
FROM image_details
ORDER BY keywords_distance + description_distance
LIMIT 15
),
reranked_results AS (
SELECT
image_name,
(
0.3 * keywords_distance +
0.7 * description_distance
) AS combined_distance
FROM initial_results
)
SELECT *
FROM reranked_results
ORDER BY combined_distance ASC
LIMIT 3;
+-----------------------+--------------------+
| image_name | combined_distance |
+-----------------------+--------------------+
| vintage_sidecar.JPG | 0.4800935566425323 |
| hornet_black_gold.jpg | 0.496760493516922 |
| Le_Mans_24_moto.png | 0.5196295261383057 |
+-----------------------+--------------------+
Key Insight
The most important concept here is this:
- We are not comparing images.
- We are comparing semantic representations of images.
This distinction fundamentally changes how image retrieval systems can be designed using SQL-native AI capabilities.
Reverse Image Search (Image-to-Image via Semantics)
Traditional reverse image search systems rely heavily on:
- Pixel similarity
- Feature extraction
- Specialized computer vision pipelines
Our approach is different.
We perform Semantic Image-to-Image Search.
This is:
- Similarity of meaning
- Not similarity of pixels
Workflow

Why This Is Interesting
Two images may be visually different while still being semantically related.
For example:
- Different motorcycles
- Different lighting conditions
- Different camera angles
Yet both images may describe: « A motorcycle parked on an urban street ».
Semantic retrieval captures this meaning.
Advantages of the Semantic Approach
Simpler Architecture
No need for:
- Complex Computer Vision frameworks
- GPU-heavy image feature pipelines
- Specialized image vector databases
Explainability
Because the retrieval is based on generated semantic metadata:
- Results can be audited
- Descriptions can be refined
- Prompts can be improved
SQL-native Workflow
Everything remains inside MySQL HeatWave:
- Metadata
- Embeddings
- Similarity search
- Ranking logic
Architectural Trade-offs & Design Considerations
Building AI-powered semantic image systems involves several important architectural decisions.
Description vs Keywords
Both representations serve different purposes.
Descriptions
Descriptions provide:
- Rich contextual understanding
- Better semantic reasoning
- Natural language flexibility
However:
- They may introduce noise
- Longer text may dilute embeddings
Keywords
Keywords provide:
- Focused semantic signals
- Better precision
- Faster matching
However:
- They lose contextual richness
Strategy implemented in this article
A hybrid approach:
- Used descriptions for semantic context
- Used keywords for precision
- Combined both during re-ranking
However
I’m storing two embeddings doubles vector storage (2 × 2048 dimensions per image).
For very large libraries, consider using only one – either concatenated or choose the one that performs better on your validation set.
Storing Images in the Database vs Object Storage
Another important design consideration concerns image storage.
Storing Images in MySQL
Advantages:
- Simpler architecture
- Centralized governance
- Easier transactional consistency
Disadvantages:
- Larger database size increases buffer pool pressure
- Backup overhead
- Potentially performance impact (queries, replication,… )
Storing Images in Object Storage
Advantages:
- Better scalability
- Lower storage costs
- Better CDN integration
Disadvantages:
- More complex architecture
- Additional security management
Recommended Hybrid Strategy
A practical enterprise approach is often:
- Store thumbnails (small, e.g., 512×512) as base64 in MySQL (should be sufficient for vision model analysis)
- Store full-resolution images in object storage for later retrieval or user download
This enables:
- Fast previews
- Efficient semantic retrieval
- Scalable storage architecture
Prompt Engineering Matters
Semantic quality heavily depends on prompts.
Poor prompts generate:
- Vague descriptions
- Weak keywords
- Low-quality embeddings
Well-designed prompts improve:
- Retrieval quality
- Explainability
- Evaluation consistency
Prompt engineering becomes a critical part of semantic architecture design.
Péroraison
MySQL HeatWave GenAI significantly simplifies the implementation of semantic image understanding systems while substantially lowering the Total Cost of Ownership through the elimination of external infrastructure and fragmented AI pipelines.
By combining:
- Vision-language models
- Text embeddings
- Vector search
- SQL-native AI routines
organizations can build powerful multimodal systems directly inside MySQL HeatWave.
The « image -> text -> embedding » approach presented in this article provides:
- Explainability
- Operational simplicity
- SQL-native integration
- Semantic search capabilities
Most importantly, it enables architects and developers to build AI systems that remain understandable, debuggable, and governable — which is often one of the biggest challenges in enterprise AI adoption.

Resources
- Ask Your Database Anything: Natural Language to SQL in MySQL HeatWave
- Let Your AI DBA Assistant Write Your MySQL Queries
- Querying the Unstructured: Natural Language to SQL for JSON Data
- MySQL HeatWave GenAI Supported Models and Languages
- Building an Interactive LLM Chatbot with HeatWave Using Python”>Building an Interactive LLM Chatbot with HeatWave Using Python
- Build an AI-Powered Search Engine with HeatWave GenAI (part 1) / (part 2) / (part 3)
Watch my videos on my YouTube channel and subscribe.
Thanks for using HeatWave & MySQL!

MySQL Cloud AI & Analytics Solutions Architect at Oracle
MySQL Geek, strong interest in data and AI, author, blogger and speaker
I’m an insatiable hunger of learning.
—–
Blog: www.dasini.net/blog/en/
Linkedin : www.linkedin.com/in/olivier-dasini
Speaker Deck: https://speakerdeck.com/freshdaz
SlideShare: www.slideshare.net/freshdaz
Youtube: https://www.youtube.com/channel/UC12TulyJsJZHoCmby3Nm3WQ
—–
Leave a Reply