{"id":7711,"date":"2025-04-15T09:42:34","date_gmt":"2025-04-15T08:42:34","guid":{"rendered":"https:\/\/dasini.net\/blog\/?p=7711"},"modified":"2025-04-15T09:42:35","modified_gmt":"2025-04-15T08:42:35","slug":"build-an-ai-powered-search-engine-with-heatwave-genai-part-3","status":"publish","type":"post","link":"https:\/\/dasini.net\/blog\/2025\/04\/15\/build-an-ai-powered-search-engine-with-heatwave-genai-part-3\/","title":{"rendered":"Build an AI-Powered Search Engine with HeatWave GenAI (part 3)"},"content":{"rendered":"\n
In Build an AI-Powered Search Engine with HeatWave GenAI (part 1)<\/a>, we introduced the fundamentals of creating an AI-powered search engine using HeatWave GenAI<\/strong>. We highlighted the advantages of semantic search powered by large language models<\/strong> over traditional SQL-based approaches and provided a hands-on guide for generating embeddings<\/strong> and running similarity searches<\/strong> \u2014 key techniques that significantly improve the retrieval of relevant content.<\/p>\n\n\n\n In the second opus \u2014 Build an AI-Powered Search Engine with HeatWave GenAI (part 2)<\/a> \u2014 we shifted our focus to improving search result quality through reranking strategies<\/strong> and the use of article summaries for embedding generation. We demonstrated how to implement these enhancements entirely within HeatWave using JavaScript-based stored procedures<\/strong>. By assigning different weights to title and excerpt distances, and generating embeddings from sanitized summaries, we boosted the precision and relevance of search results. This approach showcases HeatWave GenAI<\/strong>\u2019s ability to embed advanced AI capabilities directly within the database layer<\/strong>.<\/p>\n\n\n\n In this third installment, we\u2019ll take it a step further by incorporating full article content into the search engine. While titles, excerpts, or even summaries may work well in many cases, there are situations where deeper, more detailed information is needed to return truly relevant answers.<\/p>\n\n\n\n The process is slightly more complex than what we’ve covered so far (in part 1<\/a> & part 2<\/a>). In WordPress, article content is stored in HTML format within the To achieve this, we\u2019ll need to write a few lines of code. While this could be done directly within HeatWave using JavaScript stored procedures \u2014 as we saw in part 2: A Javascript, stored procedure & AI story<\/a> \u2014 I\u2019ll instead use the unofficial language of data: Python. Below are the steps we\u2019ll follow to move forward:<\/p>\n\n\n\n Let\u2019s break down each step together!<\/p>\n\n\n\n I’am using HeatWave 9.2.2:<\/p>\n\n\n\n I created a new table named From First, we need to grab a batch of published blog articles<\/strong> from HeatWave MySQL. Think of it as scooping out a small chunk of posts at a time from a big pool of content. Pagination is used in order to manage the memory consumption.<\/p>\n\n\n\n Articles are stored in HTML format in the database. So the content must be cleaned<\/strong> because we need the raw text ie without all the formatting tags. With this function we:<\/p>\n\n\n\n Please also note that, depending of the context, extra processing can be done like Stemming<\/a> or Lemmatization<\/a>.<\/p>\n\n\n\n Each article is a big block of text<\/strong> and it must be split it into smaller overlapping chunks<\/strong>. It is like slicing a loaf of bread, where each slice overlaps a bit with the one before it.<\/p>\n\n\n\n Chunking in RAG systems presents several challenges, including issues with chunk size<\/strong> (too small or too large), semantic coherence<\/strong>, and context understanding<\/strong>. These challenges can negatively impact retrieval accuracy, efficiency, and precision. Addressing them requires a carefully balanced approach to optimize chunking strategies while preserving computational efficiency and maintaining nuanced contextual understanding.<\/p>\n\n\n\n Each chunk contains the id of the article ( Now it is time to generate an embedding<\/strong> (aka a numerical vector) for each chunk. Think of an embedding as a way to turn text into numbers so that machines can understand<\/em> and compare<\/em> it.<\/p>\n\n\n\n We are using the HeatWave’s ML_EMBED_ROW<\/strong><\/a> stored function to generate embeddings:<\/p>\n\n\n\n The trick here in Python is obtaining the string representation of the vector<\/strong>, which we can do using the VECTOR_TO_STRING<\/strong><\/a> function.<\/p>\n\n\n\n Now it is time to store the text chunks and their corresponding embeddings in the vector store<\/strong> \u2014 HeatWave<\/strong> \u2014 specifically into the A key step here is converting the embeddings into HeatWave’s native vector type<\/strong> before insertion. This is done using the Now it’s time to transform the user’s search query into a vector embedding<\/strong>, a numerical representation that captures its underlying meaning. This process, known as query encoding<\/strong>, allows us to go beyond simple keyword matching. Once the query is encoded as a vector, we can then perform a similarity search<\/strong> by comparing it against the precomputed vector embeddings of our documents within HeatWave. This comparison, often using metrics like cosine similarity<\/strong>, identifies the documents whose vector representations are most closely aligned with the query vector, thus retrieving the most semantically relevant results.<\/p>\n\n\n\n Like we have seen in <\/a>Build an AI-Powered Search Engine with HeatWave GenAI (part 1)<\/a> and in Build an AI-Powered Search Engine with HeatWave GenAI (part 2)<\/a>, to generate a vector embedding for the query, we use the ML_EMBED_ROW<\/a> routine. Using it is straightforward. Let\u2019s define two variables: To retrieve relevant blog content, we perform vector similarity calculations using the You can now perform a cosine similarity search <\/strong>on the full articles using the following approach:<\/p>\n\n\n\n This SQL query is designed to retrieve the top 5 WordPress posts<\/strong> that are most semantically similar<\/strong> to a given embedding vector ( Of course, you can also apply the techniques from part 1<\/a> and part<\/a> 2<\/a> to further improve the results. Methods like reranking or incorporating additional fields\u2014such as the article title\u2014while fine-tuning the weights can significantly enhance result relevance.<\/p>\n\n\n\n In this third and final installment of our series on building an AI-powered search engine with HeatWave GenAI<\/strong>, we’ve successfully expanded our capabilities to incorporate the full content of articles into the semantic search process. <\/p>\n\n\n\n By implementing a more intricate pipeline involving fetching, cleaning, chunking, embedding, and storing article text, we’ve aimed for a deeper level of understanding and more relevant search results. This evolution, building upon the foundations laid in Parts 1 and 2 with basic embedding<\/strong> and similarity searches<\/strong>, and refined with reranking<\/strong> and summary utilization, now leverages the comprehensive information within each article. <\/p>\n\n\n\n The demonstrated SQL queries<\/strong> and Python<\/strong> code illustrate how HeatWave GenAI can efficiently manage and query vector embeddings<\/strong> derived from large text datasets, paving the way for a highly sophisticated and context-aware search experience. <\/p>\n\n\n\n While further optimizations are always possible, this three-part journey showcases the powerful potential of HeatWave GenAI for creating intelligent, in-database search solutions.<\/p>\n\n\n\nWhat are we going to do?<\/h2>\n\n\n\n
post_content<\/em><\/code> column. This will be our starting point, and the goal is to generate embeddings. <\/p>\n\n\n\n
Please bear in mind that\u00a0I\u2019m not a developer<\/strong>, so\u00a0this code is provided for illustrative purposes only<\/strong>. It may contain errors or limitations. Please\u00a0use it at your own risk<\/strong>\u00a0and adapt it to your specific needs (also feel free to share back).<\/p>\n\n\n\n\n
SELECT version();\n+-------------+\n| version() |\n+-------------+\n| 9.2.2-cloud |\n+-------------+<\/code><\/pre>\n\n\n\n
Defining the embeddings storage table<\/h2>\n\n\n\n
wp_post_chunks_embeddings_minilm<\/em><\/code> to store the embeddings generated from article chunks.<\/p>\n\n\n\n
\n
chunk_embedding<\/code><\/em>: embedding of the articles<\/li>\n\n\n\n
chunk_text<\/code><\/em>: contains the part of the articles. Mostly useful for debug<\/li>\n\n\n\n
\n
chunk_id<\/code><\/em>: id of each chunk<\/li>\n\n\n\n
post_id<\/code><\/em>: useful to join the
wp_posts<\/code><\/em> table and get article information<\/li>\n\n\n\n
id<\/code><\/em> <\/li>\n<\/ul>\n<\/li>\n<\/ul>\n\n\n\n
wp_posts<\/code><\/em> table, I’ll use the
ID<\/code><\/em> (the primary key) and
post_content<\/code><\/em> which contains the article in HTML format.
All these information will be used by the semantic AI-powered search engine.<\/p>\n\n\n\nSHOW CREATE TABLE wp_post_chunks_embeddings_minilm\\G\n*************************** 1. row ***************************\n Table: wp_post_chunks_embeddings_minilm\nCreate Table: CREATE TABLE `wp_post_chunks_embeddings_minilm` (\n `id` bigint unsigned NOT NULL AUTO_INCREMENT,\n `post_id` bigint unsigned DEFAULT NULL,\n `chunk_id` bigint unsigned DEFAULT NULL,\n `chunk_text` text,\n `chunk_embedding` vector(2048) DEFAULT NULL,\n PRIMARY KEY (`id`)\n) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci<\/code><\/pre>\n\n\n\n
Fetch articles from the database<\/h2>\n\n\n\n
def fetch_articles_batch(db_config: Dict[str, Any], limit: int = 10, offset: int = 0) -> List[Dict[str, Any]]:\n \"\"\"Fetch a batch of articles from the database\"\"\"\n\n with mysql.connector.connect(**db_config) as conn:\n with conn.cursor(dictionary=True) as cursor:\n cursor.execute(\"\"\"\n SELECT ID, post_content FROM wp_posts \n WHERE post_status = 'publish' AND post_type = 'post'\n LIMIT %s OFFSET %s\n \"\"\", (limit, offset))\n\n return cursor.fetchall()<\/code><\/pre>\n\n\n\n
Remove HTML tags and normalize whitespace.<\/h2>\n\n\n\n
\n
def strip_html_tags_with_space(html_string: str) -> str:\n \"\"\"Remove HTML tags and normalize whitespace\"\"\"\n\n if not html_string:\n return \"\"\n\n text = re.sub(r'<[^>]+>', ' ', html_string)\n text = re.sub(r'\\s+', ' ', text)\n\n return text.strip()<\/code><\/pre>\n\n\n\n
Split articles into overlapping chunks of words<\/h2>\n\n\n\n
def split_text_by_words(text: str, post_id: int, chunk_size: int = 400, overlap: int = 80) -> List[Dict[str, Any]]:\n \"\"\"Split text into overlapping chunks of words\"\"\"\n\n words = text.split()\n chunks = []\n start = 0\n chunk_id = 0\n\n while start < len(words):\n end = start + chunk_size\n chunk_words = words[start:end]\n chunk_text = ' '.join(chunk_words)\n\n char_start = len(' '.join(words[:start]))\n char_end = len(' '.join(words[:end]))\n\n chunks.append({\n \"post_id\": post_id,\n \"chunk_id\": chunk_id,\n \"char_start\": char_start,\n \"char_end\": char_end,\n \"text\": chunk_text\n })\n\n start += chunk_size - overlap\n chunk_id += 1\n\n return chunks<\/code><\/pre>\n\n\n\n
chunk_size<\/em><\/code>: Defines the number of words per chunk. In this case, I’ve chosen 400 words. The ideal value can vary depending on your specific use case and content structure.<\/p>\n\n\n\n
overlap<\/em><\/code>: Specifies how much each chunk overlaps with the next one\u2014here, it’s set to 20% of the chunk size. This overlap helps preserve context across chunks, improving continuity and retrieval accuracy.<\/p>\n\n\n\n
post_id<\/code><\/em>), the id of the chunk (
chunk_id<\/code><\/em>) and the chunk itself (
test<\/code><\/em>).
char_start<\/i><\/code> &
char_end<\/code><\/em> are only there for debug purpose. <\/p>\n\n\n\n
Generate embeddings for a given article<\/h2>\n\n\n\n
def embed_row(db_config: Dict[str, Any], search_item: str, embed_options: str = '{\"model_id\": \"minilm\"}') -> str:\n \"\"\"Generate an embedding for a given text using the ML_EMBED_ROW function\"\"\"\n\n with mysql.connector.connect(**db_config) as conn:\n with conn.cursor() as cursor:\n sql_embed_row = 'SELECT VECTOR_TO_STRING(vect) FROM (SELECT sys.ML_EMBED_ROW(%s, %s) AS vect) AS dt;'\n cursor.execute(sql_embed_row, (search_item, embed_options))\n result = cursor.fetchone()\n\n return \"\".join(result) if result else \"\"<\/code><\/pre>\n\n\n\n
embed_options<\/em><\/code>: specifies the settings for the embedding model. Here we are using
\"minilm<\/em>\"<\/code>.<\/p>\n\n\n\n
Insert article chunks with their embeddings into HeatWave<\/h2>\n\n\n\n
wp_post_chunks_embeddings_minilm<\/code><\/em> table designed for this purpose:<\/p>\n\n\n\n
def insert_chunks_into_mysql(chunks: List[Dict[str, Any]], db_config: Dict[str, Any], batch_size: int = 900):\n \"\"\"Insert text chunks with embeddings into the database\"\"\"\n\n with mysql.connector.connect(**db_config) as conn:\n with conn.cursor() as cursor:\n insert_query = \"\"\"\n INSERT INTO wp_post_chunks_embeddings_minilm (post_id, chunk_id, chunk_text, chunk_embedding)\n VALUES (%s, %s, %s, STRING_TO_VECTOR(%s))\n \"\"\"\n buffer = []\n for i, chunk in enumerate(chunks, 1):\n chunk_embedding = embed_row(db_config, chunk['text'])\n buffer.append((\n chunk['post_id'],\n chunk['chunk_id'],\n chunk['text'],\n chunk_embedding\n ))\n\n if i % batch_size == 0:\n cursor.executemany(insert_query, buffer)\n conn.commit()\n buffer.clear()\n\n if buffer:\n cursor.executemany(insert_query, buffer)\n conn.commit()\n\n print(f\"Inserted {len(chunks)} chunks into database.\")<\/code><\/pre>\n\n\n\n
batch_size<\/em><\/code>: defines how many chunks are inserted into the database at once. In this case, once the buffer reaches 900 chunks, the data is flushed in bulk \u2014 a much more efficient approach than inserting each chunk individually, both in terms of performance and memory usage.<\/p>\n\n\n\n
STRING_TO_VECTOR<\/a><\/em><\/strong><\/code> function.<\/p>\n\n\n\n
Query Encoding and Vector Similarity Operations<\/h2>\n\n\n\n
Encode the query into a vector embedding<\/h3>\n\n\n\n
This function applies the specified embedding model to encode the given text into a vector representation. The routine returns a VECTOR<\/code><\/a><\/code> containing the numerical embedding of the text.<\/p>\n\n\n\n
@searchItem<\/em><\/code> (the text to encode) and
@embeddOptions<\/code><\/em> (the embedding model used for encoding):<\/p>\n\n\n\n
SET @embeddOptions = '{\"model_id\": \"minilm\"}';\nSET @searchItem = \"Generative artificial intelligence\";\n\n-- Encode the query using the embedding model\nSELECT sys.ML_EMBED_ROW(@searchItem, @embeddOptions) into @searchItemEmbedding;<\/code><\/pre>\n\n\n\n
Similarity search<\/h3>\n\n\n\n
DISTANCE<\/strong><\/a><\/code> function. This function computes the distance between two vectors using
COSINE<\/strong><\/code>,
DOT<\/strong><\/code>, or
EUCLIDEAN<\/strong><\/code> distance metrics.
Here, the two vectors being compared are the encoded query (@searchItemEmbedding<\/code>) and the precomputed embeddings stored in the
wp_posts & wp_post_chunks_embeddings_minilm<\/em><\/code> tables.<\/p>\n\n\n\n
SQL >\n-- Find the most semantically relevant WordPress posts \nSELECT \n post_title,\n post_excerpt,\n guid,\n min_distance\nFROM (\n SELECT \n p.post_title,\n p.post_excerpt,\n p.guid,\n DISTANCE(@searchItemEmbedding, e.chunk_embedding, 'COSINE') AS min_distance,\n ROW_NUMBER() OVER (PARTITION BY p.post_title ORDER BY DISTANCE(@searchItemEmbedding, e.chunk_embedding, 'COSINE')) AS rn\n FROM wp_posts p\n INNER JOIN wp_post_chunks_embeddings_minilm e ON p.ID = e.post_id\n) AS ranked\nWHERE rn = 1\nORDER BY min_distance ASC\nLIMIT 5\\G\n\n*************************** 1. row ***************************\n post_title: HeatWave GenAI: Sentiment Analysis Made Easy-Peasy\npost_excerpt: This new AI tech, called generative AI (or GenAI), can dive deep into what people are saying and tell us if they\u2019re feeling positive, negative, or neutral.\nLet\u2019s see how HeatWave GenAI, can help you to enhance your understanding of customer sentiment, improve decision-making, and drive business success.\n guid: https:\/\/dasini.net\/blog\/?p=3456\nmin_distance: 0.4673360586166382\n*************************** 2. row ***************************\n post_title: HeatWave GenAI: Your AI-Powered Content Creation Partner\npost_excerpt: Generative artificial intelligence (GenAI) is reshaping the content creation landscape. By training on vast datasets, these \"intelligent\" systems can produce new, human-quality content across a multitude of domains.\n\nOracle's HeatWave GenAI (starting with version 9.0.1) is at the forefront of this revolution, offering an integrated platform that combines in-database large language models (LLMs), vector stores, and scale-out vector processing to streamline content generation.\nThis article explores how HeatWave GenAI is empowering businesses to produce high-quality content rapidly and effectively, making it an indispensable tool for industries demanding speed, accuracy, and security.\n guid: https:\/\/dasini.net\/blog\/?p=1234\nmin_distance: 0.4789550304412842\n*************************** 3. row ***************************\n post_title: Simplifying AI Development: A Practical Guide to HeatWave GenAI\u2019s RAG & Vector Store Features\npost_excerpt: This tutorial explores HeatWave GenAI, a cloud service that simplifies interacting with unstructured data using natural language. It combines large language models, vector stores, and SQL queries to enable tasks like content generation, chatbot, and retrieval-augmented generation (RAG). The focus is on RAG and how HeatWave GenAI\u2019s architecture helps users gain insights from their data.\n guid: https:\/\/dasini.net\/blog\/?p=2345\nmin_distance: 0.5191197395324707\n*************************** 4. row ***************************\n post_title: Building an Interactive LLM Chatbot with HeatWave Using Python\npost_excerpt: AI-powered applications require robust and scalable database solutions to manage and process large amounts of data efficiently. HeatWave is an excellent choice for such applications, providing high-performance OLTP, analytics, machine learning and generative artificial intelligence capabilities.\n\nIn this article, we will explore a Python 3 script that connects to an HeatWave instance and enables users to interact with different large language models (LLMs) dynamically.\n guid: https:\/\/dasini.net\/blog\/?p=5678\nmin_distance: 0.6004981994628906\n*************************** 5. row ***************************\n post_title: In-Database LLMs for Efficient Text Translation with HeatWave GenAI\npost_excerpt: While HeatWave GenAI excels at generating English text, its capabilities extend to translation as well. Let's explore how we can effectively translate English content into French using this powerful tool.\n guid: https:\/\/dasini.net\/blog\/?p=9876\nmin_distance: 0.6385803818702698<\/code><\/pre>\n\n\n\n
@searchItemEmbedding<\/code>), using cosine distance<\/strong> for comparison \u2014 but with a twist: it only considers the single best-matching chunk per post<\/strong>.<\/p>\n\n\n\n
Peroration<\/h2>\n\n\n\n