Introducing txtai, a Python package for creating a semantic search engine; in this practical post & supporting Google Colabatory notebook we look at how you could use semantic search on your domain to make your content/products more accessible, as well as creating a semantically related internal linking widget.

Before we get into, if you are more interested in playing around with some code to implement semantic search and internal linking automation, jump to the following section. 

What do we mean by semantic search?

Semantic search is searching with “meaning” – there is understanding of context between words, as opposed to lexical search, where a search engine is looking for a literal match of a word or a set of words.

Semantic search has historically been pretty difficult. Think back to the early days of the internet when search engines were starting out. The results returned were often exact-match (i.e. the exact keyword was found in the webpage) or very close variations of the word (in other words, lexical search).

That meant webmasters had to use keyword synonyms within page content to rank for them. For example, if you sold sofas online you might have included all variations of sofa, like ‘couches’ and ‘settees’, because if you didn’t, it was very likely that you wouldn’t pop up in SERPs for those terms. Hence the widespread practice of keyword stuffing in the late 90s and early noughties.

Over the last couple of decades, Google and other search engines have invested heavily in the field of natural language processing (NLP) to help with semantic search. Going back to our previous example, if you now search for ‘couch’,  you’ll now get results that include words like ‘sofa’ and ‘settee’. Search engines now “understand” these words are synonymous. 

If you grew up in the late 90’s / early noughties you can probably recall search engines’ limited capacity to understand language. And if your a real computer nerd, you might have even heard of the term “googlewhack”;  a game in which you try to find a search result with one listing by typing in just two words from the dictionary. Today this has become near impossible to achieve due to search engines’ ability to better understand language (and that fact the internet has grown vastly).

Without getting into the weeds of this vast topic area, you can see how NLP is used to process text and “understand” relationships to build relationships between words.

  

Implementing semantic search and internal linking widgets

In the next section of this blog I will be doing some simple programming in Python to build a semantic search engine. This “search engine” will be an internal site search for content on Melt Digitals Knowledge Hub (although it will not be deployed to the site). We will subsequently use this internal search engine to demo how we could create an internal linking widget/module for related posts (again it will not be deployed), to help our users find similar content on our Knowledge Hub.

You might be thinking: why would I want or need semantic search on my website? Think about it from the perspective of an ecommerce site with a lot of inventory, like Amazon or ASOS.

Users typically carry out a search on these sites to find what they’re looking for. If the results are poor, then not only does it make for an unsatisfactory user experience, but for Amazon and ASOS respectively, they are missing out on maximum revenue generating opportunities because they are unable to effectively surface all relevant results.

And now think about the power of linking to related products. Your customer has just added a sofa to their basket – how about a footstool to go with it? It’s quite easy to understand how linking to related products or content is powerful for revenue, but it’s also good for SEO because you’re building authority for given products or topics.

The consensus, loosely speaking that is, is that having tightly knitted products or topics – by way of internal links – creates authority, leading to improved rankings and thus more traffic.

Getting started

For the remainder of this post I’ll be showing some code snippets and screen grabs from a Python notebook in Google Colaboratory. You can find the notebook here and make a copy to tweak for your use case.

Introducing txtai

txtai is a package written in Python that uses machine learning workflows to build a semantically searchable index. That’s a bit of a mouthful but essentially it’s a package used to ingest content and create a semantic search engine.

So, for example, if you search for ‘pizza’, you’ll get results that contain the word ‘pizza’ but you may also get results that don’t don’t contain the word ‘pizza’ but include words like ‘cheese’ or ‘pepperoni’ – words that often closely associated with ‘pizza’.

Building our search engine with txtai

To build our search engine, we need txtai to ingest all the content that we want to be searchable. So, I have taken all the blog posts from Knowledge Hub, along with their meta titles, the author of the articles and the date they were published:

We feed this to our txtai model in the required format. This might take some time depending upon the number of pages you’re using, but once it’s complete you can now search your index. The notebook contains all this code, but you can also use txtai documentation as a reference.

I’ve created a custom function in which you input a keyword and the function returns some articles in HTML. I’ve used IPython to display the HTML within my notebook to demonstrate how this could look on a website.  Running my function with the word ‘football’, we get the following HTML response:

You’ll notice that with the first article, which is about changing search trends in response to this year’s Women’s Euros, there is the word ‘football’. In contrast, the second article has no mention of football in its title but that txtai has been able to “understand” that ‘Euros’, ‘tournament’ and other words are related to football. This is just a small example of just how powerful semantic search can be.

Site search

Using txtai, we have built a basic semantic search engine – just think of what is possible when it’s tailored to your business. At a very basic level, you’re making content more accessible by search. At a deeper level you’re giving your audience what they want and need – and that means more business for you.

Again, thinking about a furniture ecommerce site, you could feed product names or descriptions into txtai and use it to make your products more discoverable, regardless of whether someone searches for ‘sofa’, ‘couch’ or ‘settee’.

Deploying this search engine to a site goes beyond the remit of this post, but to get started you could consider a Python framework in which to build a microservice – Django, Flask or AWS lambda are all popular choses.

It’s also worth noting that there are already many services out there which build semantic search into your site, such as Elastic, which has many other features built in. But of course, they all come with a cost.

Internal linking widget/automation

Going one step further, we can use this search engine to find related posts. Many sites use an internal linking widget for ‘related products’ or ‘related articles’. Using our example search engine for Melt Digital’s Knowledge Hub, we can find related posts for all our Knowledge Hub posts – we just need to feed our txtai model a knowledge hub title and it will return the most related posts.

In the GIF below, I am discovering related posts for given a title, and outputting something that could resemble a rather ugly internal linking widget.

Summary

Hopefully this post has not only given you insight into the power of semantic search but also given you something actionable to get started with. If you have a site with alot of content or inventory, with limited search functionality, you should have some lightbulbs going off for how you could not only improve site search but how you could automate internal linking to assist users and search engines alike!

It’s worth reiterating that there are many tools out there which will do this job for you; you’ll find though that they are rigid, and often challenging to integrate into your tech stack in the exact manner you require. As this technology becomes more widespread and userable, it’s worth considering how you can build tooling that is bespoke to your domain. 

Lastly, it’s worth noting that our implementation is simplistic – in reality there are many optimisations that go behind the remit of this post. If you would like to add internal semantic search and internal linking modules, bespoke to your domain, we’d love to hear from you. 

 

If you would like to consider how to add semantic search and internal linking widgets to your domain, get in touch with us here.