Large language models (LLMs) often feel like having the world’s greatest toolkit for fixing a problem you can’t quite identify. This gap can feel particularly acute when working with search indexes such as Elasticsearch, Solr, or OpenSearch. Existing search technologies are exceptional at making rapid textual matches but are unable to match documents without direct text matches.
Consider the Humble Search Bar
Finding the best search results often means going beyond what users put into the search bar. When considering a search bar, it becomes clear it inherits context. In other words, a search bar on a web site for purchasing auto parts should seek ways to apply an auto parts perspective to whatever the user enters. You shouldn’t expect the user to input something like “I’m searching for an auto part called a bumper.” Moreover, people often don’t have quite the right word to describe what they are looking for. Sometimes, they may have the right word, just not the word matching key results that would best serve them.
Large Language Models (LLMs) can bring the power of contextual inference into the search bar. A simple way to leverage that power is to use it for identifying and surfacing potential synonyms. This blog will discuss one simple approach for managing this process, considering limitations in how LLMs function as well as the volume of queries and potential synonyms that need to be considered. Human oversight will be considered a critical component necessary to avoid irrelevant or counterproductive synonyms. Due to this necessity, an overarching goal is to make human interaction as efficient and effective as possible.
A Quick Primer on LLMs
By using context inference, LLMs can make associations beyond direct text matches. In the LLM data, words is represented in a sort of coordinate system with hundreds, or even thousands, of dimensions. This coordinate system effectively creates innumerable conceptual vectors. There’s no way to know what concepts are represented in the data. They could be as fundamental as “gender”, as weirdly specific as “astrophysics”, or as nebulous as “things people think about at 3pm”. This is generally processed through a mechanism known as cosine similarity. Visually, the word “signal” might be thought about like this:
In this example, the red vector (1) might represent “things related to radios”, the green vector (2) might represent “cars”, and the blue vector (3) could be “astrophysics”. The word “signal” has a relationship to “wave”, “blinker”, and “pulsar” in each of these contexts, respectively.
Finding the Right Perspective
The diagram above is a limited representation of generalized word embeddings from LLMs such as Google’s Gemini or Meta’s Llama. They are both built from massive sets of data with no particular focus. However, as a user of one of these LLMs, a search for “signal” on the aforementioned web site of an auto parts store would ideally just utilize vector 2. Similarly, for a search on scientific papers, vectors 1 and 3 might both be applicable. As a search professional, the goal is to make use of vectors best match the user’s perspective when they fill in that search box.
Is This a Problem?
It might not seem like this is a big deal. After all, there’s no car part called a “wave”. Unfortunately, it’s possible the store could sell a window cleaner called “Bright Wave” or a product description for a window flag might include the text “waving in the wind.” A search for the term “signal” could conceivably return window cleaner and sports team flags when the user most likely desires a replacement blinker. In many ways, these represent a typical synonym issue in Solr, Elastic, OpenSearch, and other BM25 based tools. Since synonyms apply universally on a field, it’s important to find that sweet spot where documents relevant to a user are returned without generating too many false positives.
Developing a LLM to Search Synonym Pipeline
Since you never know what matches an LLM might return, validating the synonyms before implementing them is critical. In a large commerce site, you might have tens of thousands of words users search on and hundreds of thousands of potential synonyms for those words. Clearly, making this task into something a single subject matter expert or search professional can manage will take some planning.
Scoring!
The simplest mechanism to reduce the potential synonyms a human needs to consider is to look at the vector search score. All LLM based searches are done with vector searches. These kinds of searches generate a score based on distance or relevancy associated with each potential synonym which is always used to determine sort order. Each vector search will return exactly the number of values requested because, technically, every document in the vector database is a “match”, just with increasingly bad relevancy.
Let’s consider a log with 250,000 historical user searches. A search team could go through this log trying to think of potential synonyms that may find suitable results in a product catalog. Assuming they are doing due diligence and averaging roughly 15 seconds on each search, that log will require about 1000-man hours to process. So, the first thing to do is limit the maximum number of potential synonyms returned for each user search to something reasonable, like 5. Now the search team doesn’t need to try to imagine possible matches, they can simply look at the 5 that came back for each query and think “yes” or “no”. Less thinking means less time, this alone could shave off a solid third of the time.
That still leave about 650-man hours, but what if 5 options are just the maximum. After all, not every word searched by a user will have 5 words that are even in the ballpark to be used as a synonym. In fact, having 5 suitable synonyms is rare. It’s far more likely that no suitable synonyms are available. Just because the LLM thinks one word is more like the user’s query than any other word doesn’t mean the closest word is even worthy of consideration. This is where applying a threshold value on scores can drastically reduce the possibilities a search team needs to consider. Even a reasonable threshold value could reduce the word/synonym pairs up for consideration down to less than 5% of the total “matches” made. Now the 650-man hours becomes a much more manageable 30-to-35-man hours. This is the kind of effort a first-time evaluation might require, but iterative cycles might just take an hour a week after common synonym pairing have been evaluated by the search team.
Reduce Potential Synonyms
Another way to improve LLM results is to exclude words that should never be used as a synonym. It doesn’t matter if the LLM thinks “wave” is a good match for “signal” if the vector database doesn’t contain the word “wave” as an option. Pruning potential synonyms could be done automatically as a function of how potential synonyms are gathered and filtered before being added to the database.
For instance, a good way to determine a pool of potential synonyms is to draw them from a description field for products. However, description fields contain a lot of terms that LLMs consider highly interchangeable. This is notoriously true for numerical indicators. Setting up a process to block numerical words (such as “one” or even the actual number 1) removes the risk of it being offered as a synonym entirely.
Custom Source Embeddings
This is a little technical, so the details won’t be discussed here, but an LLM can be used to generate a reduced set of data based on an overarching phrase or concept. If a search is very specifically for auto parts, the LLM could generate a smaller model that is tuned based on that idea. A well-constructed phrase will create a much better set of embeddings to base the searches on, which will improve score results and make the threshold filtering more effective. However, developing a good set of embeddings is a non-trivial task.
In Conclusion
LLMs can bring suggestions to the attention that may have been overlooked. However, some human intervention will be required to ensure synonyms are not added that will adversely affect user experience. The key to a useful implementation is to reduce the amount of user searches and potential synonyms a search team needs to evaluate. Instead of 250,000 searches with an undefined set of possibilities, a good system may have the team needing only to consider a few hundred legitimate pairs in a more readable format. It may be helpful to create your own synonym management application based on these principles or take advantage of an existing product such as our FindTuner application. Good luck on your own path in turning an LLM into a synonym generating machine!