AEM Search Indexing: Synonyms, Filters, and Stop Words (oh my!)
Editor's Note: Ali presents on AEM Search Index Configuration at Adobe IMMERSE'18 on June 14. Ali is one of five engineers from HS2 presenting at IMMERSE.
Adobe Experience Manager (AEM) offers powerful searching capabilities under the hood that can be leveraged to make your site searches more robust and return the results your customers are looking for. One of the ways we can leverage improved searches is by setting up indexing options that will allow us to search on words by their synonyms, index words based on various rules, select which words to ignore, remove HTML code, and more.
Synonyms are used to inform AEM that searching for one word should also search for others. When searching for "Airplane", one may also want to find "plane", "jet", or etc. Options can also be added to ensure all words that are searched and indexed are automatically lowercased, have HTML stripped out, are split up by whitespace characters, and other options. By updating some indexing files used by AEM, we can easily reindex our queries to include our custom options.
How It Works
AEM by default leverages Apache Lucene for indexing, which is leveraged by the OOTB /oak:index/lucene node. We will customize this node to modify our searches by adding synonyms, filters, stopwords, and more. We first start by adding the "analyzers" property to the existing lucene node in the jcr with the correct configurations.
Refer to https://jackrabbit.apache.org/oak/docs/query/lucene.html#Create_analyzer_via_composition for official documentation.
In the codebase, the change is made at /jcr_root/oak:index/.content.xml and will look something like this (this change will affect the lucene index. You can customize this to specific indexes to suit your own approach):
Don't forget to place the synonym.txt file under /jcr_root/oak:index/lucene/analyzers/default/filters/Synonym/synonym.txt and stop.txt under /jcr_root/oak:index/lucene/analyzers/default/filters/Stop/stop.txt.
The synonym.txt file is a simple comma-separated list of synonyms. All matching terms should exist in a single row. Any word that is searched in the row will match all other words in that same row. Common uses for Synonyms are matching on variations of a word (like plane/airplane or bulb/lightbulb). Synonyms should be relevant to the search terms you expect to be used within your application. One suggestion is to log any failed searches (0 results) to review and determine if there are more synonyms to be added. Example Synonym.txt file:
Filters & Tokenizers
Many different filters exist to modify indexed content. Some filters will remove code, replace patterns, convert accented characters, and more. In our example above, we've used the following filters:
- Strips out all HTML/XML code from the indexed content. This helps make searches more precise
- Allows you to change the search query (like using HTMLStrip above or to convert accented characters like é into non-accented characters e)
- For more examples of CharFilters and how they're used, see https://lucene.apache.org/solr/guide/6_6/charfilterfactories.html
- Converts all searches and indexes to lowercase. Unfortunately "Happy" doesn't match "happy". By ensuring both the stored index and the search query are always lowercase, we won't have to worry about matching words based on case.
- For more examples of filters and how they're used, see https://lucene.apache.org/solr/guide/6_6/filter-descriptions.html
- Tokenizers convert user's text input into individual "tokens" that will be used in the search. Which tokenizer you use determines how the user input is split up. There are multiple types of tokenizers. In our example above, I've used the "classic" tokenizer.
- Classic Tokenizer Examples:
- For more examples of tokenizers and how they're used, see https://lucene.apache.org/solr/guide/6_6/tokenizers.html
Stopwords tell AEM to not index or query on specified words and is used to avoid indexing simple words. These words are usually ignored so we don't match the wrong results. In a search of "Cans and Bottles" the actual terms the user is looking for are "Cans" and "Bottles". To avoid missing results that contain "cans" and "bottles" but not the word "and", include "and" as a stopword. Some words that are often ignored in searches are listed in this sample stop.txt file (note this is a very limited set of example words and you can use as many or as few as you like):
By default, AEM does not use any stopwords. By adding the stopwords file, searches for terms can be more efficient by not getting hung up on looking for common words that may or may not be used in the results (like our Cans and Bottles example).
After updating the synonyms configuration, you can trigger a reindex by modifying the /oak:index/lucene node and setting reindex property to true. It will reindex on save.
Alternately you can navigate to /etc/acs-commons/oak-index-manager.html on the AEM server and click on the reindex button for the correct index (Lucene in our example).
We skimmed a lot of topics around indexing options in AEM by modifying Lucene Indexes. From searching word synonyms (so searching on "plane" can match "airplane") to stripping out unwanted text and even ignoring certain words, we can start to utilize far more robust search functionalities. There are many more filters and tokenizers to choose from and for a deeper dive you can find more documentation at https://lucene.apache.org/solr/guide/6_6/understanding-analyzers-tokenizers-and-filters.html.