Build an AI-Powered Search Engine with HeatWave GenAI (part 3)

April 15, 2025

Tags: Cloud, GenAI, Heatwave, LLM, Python

In Build an AI-Powered Search Engine with HeatWave GenAI (part 1), we introduced the fundamentals of creating an AI-powered search engine using HeatWave GenAI. We highlighted the advantages of semantic search powered by large language models over traditional SQL-based approaches and provided a hands-on guide for generating embeddings and running similarity searches — key techniques that significantly improve the retrieval of relevant content.

In the second opus — Build an AI-Powered Search Engine with HeatWave GenAI (part 2) — we shifted our focus to improving search result quality through reranking strategies and the use of article summaries for embedding generation. We demonstrated how to implement these enhancements entirely within HeatWave using JavaScript-based stored procedures. By assigning different weights to title and excerpt distances, and generating embeddings from sanitized summaries, we boosted the precision and relevance of search results. This approach showcases HeatWave GenAI’s ability to embed advanced AI capabilities directly within the database layer.

In this third installment, we’ll take it a step further by incorporating full article content into the search engine. While titles, excerpts, or even summaries may work well in many cases, there are situations where deeper, more detailed information is needed to return truly relevant answers.

What are we going to do?

The process is slightly more complex than what we’ve covered so far (in part 1 & part 2). In WordPress, article content is stored in HTML format within the post_content column. This will be our starting point, and the goal is to generate embeddings.

To achieve this, we’ll need to write a few lines of code. While this could be done directly within HeatWave using JavaScript stored procedures — as we saw in part 2: A Javascript, stored procedure & AI story — I’ll instead use the unofficial language of data: Python.
Please bear in mind that I’m not a developer, so this code is provided for illustrative purposes only. It may contain errors or limitations. Please use it at your own risk and adapt it to your specific needs (also feel free to share back).

Below are the steps we’ll follow to move forward:

Defining the embeddings storage table
Fetch articles from the database.
Remove HTML tags and normalize whitespace.
Split articles into overlapping chunks of words.
Generate embeddings for a given article.
Insert article chunks with their embeddings into HeatWave.

Let’s break down each step together!

I’am using HeatWave 9.2.2:

SELECT version();
+-------------+
| version()   |
+-------------+
| 9.2.2-cloud |
+-------------+

Defining the embeddings storage table

I created a new table named wp_post_chunks_embeddings_minilm to store the embeddings generated from article chunks.

chunk_embedding: embedding of the articles
chunk_text: contains the part of the articles. Mostly useful for debug
and some IDs,
- chunk_id: id of each chunk
- post_id: useful to join the wp_posts table and get article information
- and finally, the table primary key: id

From wp_posts table, I’ll use the ID (the primary key) and post_content which contains the article in HTML format.
All these information will be used by the semantic AI-powered search engine.

SHOW CREATE TABLE wp_post_chunks_embeddings_minilm\G
*************************** 1. row ***************************
       Table: wp_post_chunks_embeddings_minilm
Create Table: CREATE TABLE `wp_post_chunks_embeddings_minilm` (
  `id` bigint unsigned NOT NULL AUTO_INCREMENT,
  `post_id` bigint unsigned DEFAULT NULL,
  `chunk_id` bigint unsigned DEFAULT NULL,
  `chunk_text` text,
  `chunk_embedding` vector(2048) DEFAULT NULL,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci

Fetch articles from the database

First, we need to grab a batch of published blog articles from HeatWave MySQL. Think of it as scooping out a small chunk of posts at a time from a big pool of content. Pagination is used in order to manage the memory consumption.

def fetch_articles_batch(db_config: Dict[str, Any], limit: int = 10, offset: int = 0) -> List[Dict[str, Any]]:
    """Fetch a batch of articles from the database"""

    with mysql.connector.connect(**db_config) as conn:
        with conn.cursor(dictionary=True) as cursor:
            cursor.execute("""
                SELECT ID, post_content FROM wp_posts 
                WHERE post_status = 'publish' AND post_type = 'post'
                LIMIT %s OFFSET %s
            """, (limit, offset))

            return cursor.fetchall()

Remove HTML tags and normalize whitespace.

Articles are stored in HTML format in the database. So the content must be cleaned because we need the raw text ie without all the formatting tags. With this function we:

Remove all the HTML tags
Replace them with spaces (so you don’t accidentally smash words together)
Collapse extra whitespace
Trimming off any leading or trailing spaces

def strip_html_tags_with_space(html_string: str) -> str:
    """Remove HTML tags and normalize whitespace"""

    if not html_string:
        return ""

    text = re.sub(r'<[^>]+>', ' ', html_string)
    text = re.sub(r'\s+', ' ', text)

    return text.strip()

Please also note that, depending of the context, extra processing can be done like Stemming or Lemmatization.

Split articles into overlapping chunks of words

Each article is a big block of text and it must be split it into smaller overlapping chunks. It is like slicing a loaf of bread, where each slice overlaps a bit with the one before it.

def split_text_by_words(text: str, post_id: int, chunk_size: int = 400, overlap: int = 80) -> List[Dict[str, Any]]:
    """Split text into overlapping chunks of words"""

    words = text.split()
    chunks = []
    start = 0
    chunk_id = 0

    while start < len(words):
        end = start + chunk_size
        chunk_words = words[start:end]
        chunk_text = ' '.join(chunk_words)

        char_start = len(' '.join(words[:start]))
        char_end = len(' '.join(words[:end]))

        chunks.append({
            "post_id": post_id,
            "chunk_id": chunk_id,
            "char_start": char_start,
            "char_end": char_end,
            "text": chunk_text
        })

        start += chunk_size - overlap
        chunk_id += 1

    return chunks

Chunking in RAG systems presents several challenges, including issues with chunk size (too small or too large), semantic coherence, and context understanding. These challenges can negatively impact retrieval accuracy, efficiency, and precision. Addressing them requires a carefully balanced approach to optimize chunking strategies while preserving computational efficiency and maintaining nuanced contextual understanding.

chunk_size: Defines the number of words per chunk. In this case, I’ve chosen 400 words. The ideal value can vary depending on your specific use case and content structure.

overlap: Specifies how much each chunk overlaps with the next one—here, it’s set to 20% of the chunk size. This overlap helps preserve context across chunks, improving continuity and retrieval accuracy.

Each chunk contains the id of the article (post_id), the id of the chunk (chunk_id) and the chunk itself (test). char_start & char_end are only there for debug purpose.

Generate embeddings for a given article

Now it is time to generate an embedding (aka a numerical vector) for each chunk. Think of an embedding as a way to turn text into numbers so that machines can understand and compare it.

We are using the HeatWave’s ML_EMBED_ROW stored function to generate embeddings:

def embed_row(db_config: Dict[str, Any], search_item: str, embed_options: str = '{"model_id": "minilm"}') -> str:
    """Generate an embedding for a given text using the ML_EMBED_ROW function"""

    with mysql.connector.connect(**db_config) as conn:
        with conn.cursor() as cursor:
            sql_embed_row = 'SELECT VECTOR_TO_STRING(vect) FROM (SELECT sys.ML_EMBED_ROW(%s, %s) AS vect) AS dt;'
            cursor.execute(sql_embed_row, (search_item, embed_options))
            result = cursor.fetchone()

            return "".join(result) if result else ""

The trick here in Python is obtaining the string representation of the vector, which we can do using the VECTOR_TO_STRING function.

embed_options: specifies the settings for the embedding model. Here we are using "minilm".

Insert article chunks with their embeddings into HeatWave

Now it is time to store the text chunks and their corresponding embeddings in the vector store — HeatWave — specifically into the wp_post_chunks_embeddings_minilm table designed for this purpose:

def insert_chunks_into_mysql(chunks: List[Dict[str, Any]], db_config: Dict[str, Any], batch_size: int = 900):
    """Insert text chunks with embeddings into the database"""

    with mysql.connector.connect(**db_config) as conn:
        with conn.cursor() as cursor:
            insert_query = """
                INSERT INTO wp_post_chunks_embeddings_minilm (post_id, chunk_id, chunk_text, chunk_embedding)
                VALUES (%s, %s, %s, STRING_TO_VECTOR(%s))
            """
            buffer = []
            for i, chunk in enumerate(chunks, 1):
                chunk_embedding = embed_row(db_config, chunk['text'])
                buffer.append((
                    chunk['post_id'],
                    chunk['chunk_id'],
                    chunk['text'],
                    chunk_embedding
                ))

                if i % batch_size == 0:
                    cursor.executemany(insert_query, buffer)
                    conn.commit()
                    buffer.clear()

            if buffer:
                cursor.executemany(insert_query, buffer)
                conn.commit()

    print(f"Inserted {len(chunks)} chunks into database.")

batch_size: defines how many chunks are inserted into the database at once. In this case, once the buffer reaches 900 chunks, the data is flushed in bulk — a much more efficient approach than inserting each chunk individually, both in terms of performance and memory usage.

A key step here is converting the embeddings into HeatWave’s native vector type before insertion. This is done using the STRING_TO_VECTOR function.

Query Encoding and Vector Similarity Operations

Now it’s time to transform the user’s search query into a vector embedding, a numerical representation that captures its underlying meaning. This process, known as query encoding, allows us to go beyond simple keyword matching. Once the query is encoded as a vector, we can then perform a similarity search by comparing it against the precomputed vector embeddings of our documents within HeatWave. This comparison, often using metrics like cosine similarity, identifies the documents whose vector representations are most closely aligned with the query vector, thus retrieving the most semantically relevant results.

Encode the query into a vector embedding

Like we have seen in Build an AI-Powered Search Engine with HeatWave GenAI (part 1) and in Build an AI-Powered Search Engine with HeatWave GenAI (part 2), to generate a vector embedding for the query, we use the ML_EMBED_ROW routine.
This function applies the specified embedding model to encode the given text into a vector representation. The routine returns a VECTOR containing the numerical embedding of the text.

Using it is straightforward. Let’s define two variables: @searchItem (the text to encode) and @embeddOptions (the embedding model used for encoding):

SET @embeddOptions = '{"model_id": "minilm"}';
SET @searchItem = "Generative artificial intelligence";

-- Encode the query using the embedding model
SELECT sys.ML_EMBED_ROW(@searchItem, @embeddOptions) into @searchItemEmbedding;

Similarity search

To retrieve relevant blog content, we perform vector similarity calculations using the DISTANCE function. This function computes the distance between two vectors using COSINE, DOT, or EUCLIDEAN distance metrics.
Here, the two vectors being compared are the encoded query (@searchItemEmbedding) and the precomputed embeddings stored in the wp_posts & wp_post_chunks_embeddings_minilm tables.

You can now perform a cosine similarity search on the full articles using the following approach:

SQL >
-- Find the most semantically relevant WordPress posts 
SELECT 
    post_title,
    post_excerpt,
    guid,
    min_distance
FROM (
    SELECT 
        p.post_title,
        p.post_excerpt,
        p.guid,
        DISTANCE(@searchItemEmbedding, e.chunk_embedding, 'COSINE') AS min_distance,
        ROW_NUMBER() OVER (PARTITION BY p.post_title ORDER BY DISTANCE(@searchItemEmbedding, e.chunk_embedding, 'COSINE')) AS rn
    FROM wp_posts p
        INNER JOIN wp_post_chunks_embeddings_minilm e ON p.ID = e.post_id
) AS ranked
WHERE rn = 1
ORDER BY min_distance ASC
LIMIT 5\G

*************************** 1. row ***************************
  post_title: HeatWave GenAI: Sentiment Analysis Made Easy-Peasy
post_excerpt: This new AI tech, called generative AI (or GenAI), can dive deep into what people are saying and tell us if they’re feeling positive, negative, or neutral.
Let’s see how HeatWave GenAI, can help you to enhance your understanding of customer sentiment, improve decision-making, and drive business success.
        guid: https://dasini.net/blog/?p=3456
min_distance: 0.4673360586166382
*************************** 2. row ***************************
  post_title: HeatWave GenAI: Your AI-Powered Content Creation Partner
post_excerpt: Generative artificial intelligence (GenAI) is reshaping the content creation landscape. By training on vast datasets, these "intelligent" systems can produce new, human-quality content across a multitude of domains.

Oracle's HeatWave GenAI (starting with version 9.0.1) is at the forefront of this revolution, offering an integrated platform that combines in-database large language models (LLMs), vector stores, and scale-out vector processing to streamline content generation.
This article explores how HeatWave GenAI is empowering businesses to produce high-quality content rapidly and effectively, making it an indispensable tool for industries demanding speed, accuracy, and security.
        guid: https://dasini.net/blog/?p=1234
min_distance: 0.4789550304412842
*************************** 3. row ***************************
  post_title: Simplifying AI Development: A Practical Guide to HeatWave GenAI’s RAG &amp; Vector Store Features
post_excerpt: This tutorial explores HeatWave GenAI, a cloud service that simplifies interacting with unstructured data using natural language. It combines large language models, vector stores, and SQL queries to enable tasks like content generation, chatbot, and retrieval-augmented generation (RAG). The focus is on RAG and how HeatWave GenAI’s architecture helps users gain insights from their data.
        guid: https://dasini.net/blog/?p=2345
min_distance: 0.5191197395324707
*************************** 4. row ***************************
  post_title: Building an Interactive LLM Chatbot with  HeatWave Using Python
post_excerpt: AI-powered applications require robust and scalable database solutions to manage and process large amounts of data efficiently. HeatWave is an excellent choice for such applications, providing high-performance OLTP, analytics, machine learning and generative artificial intelligence capabilities.

In this article, we will explore a Python 3 script that connects to an HeatWave instance and enables users to interact with different large language models (LLMs) dynamically.
        guid: https://dasini.net/blog/?p=5678
min_distance: 0.6004981994628906
*************************** 5. row ***************************
  post_title: In-Database LLMs for Efficient Text Translation with HeatWave GenAI
post_excerpt: While HeatWave GenAI excels at generating English text, its capabilities extend to translation as well. Let's explore how we can effectively translate English content into French using this powerful tool.
        guid: https://dasini.net/blog/?p=9876
min_distance: 0.6385803818702698

This SQL query is designed to retrieve the top 5 WordPress posts that are most semantically similar to a given embedding vector (@searchItemEmbedding), using cosine distance for comparison — but with a twist: it only considers the single best-matching chunk per post.

Of course, you can also apply the techniques from part 1 and part 2 to further improve the results. Methods like reranking or incorporating additional fields—such as the article title—while fine-tuning the weights can significantly enhance result relevance.

Peroration

In this third and final installment of our series on building an AI-powered search engine with HeatWave GenAI, we’ve successfully expanded our capabilities to incorporate the full content of articles into the semantic search process.

By implementing a more intricate pipeline involving fetching, cleaning, chunking, embedding, and storing article text, we’ve aimed for a deeper level of understanding and more relevant search results. This evolution, building upon the foundations laid in Parts 1 and 2 with basic embedding and similarity searches, and refined with reranking and summary utilization, now leverages the comprehensive information within each article.

The demonstrated SQL queries and Python code illustrate how HeatWave GenAI can efficiently manage and query vector embeddings derived from large text datasets, paving the way for a highly sophisticated and context-aware search experience.

While further optimizations are always possible, this three-part journey showcases the powerful potential of HeatWave GenAI for creating intelligent, in-database search solutions.

Follow me on Linkedin

Watch my videos on my YouTube channel and subscribe.

My Slideshare account.

My Speaker Deck account.

Thanks for using HeatWave & MySQL!

Olivier DASINI

Cloud Solutions Architect at Oracle
MySQL Geek, strong interest in data and AI, author, blogger and speaker
I’m an insatiable hunger of learning.
—–

Blog: www.dasini.net/blog/en/
Twitter: https://twitter.com/freshdaz
SlideShare: www.slideshare.net/freshdaz
Youtube: https://www.youtube.com/channel/UC12TulyJsJZHoCmby3Nm3WQ
—–

dasini.net – Diary of a MySQL expert