Latent Semantic Indexing (LSI), also referred to as Latent Semantic Analysis, is a concept adopted by search engines to link keywords within the context they are searched for. By understanding the relationship between different terms, the use of LSI vastly improves the relevance of search results.

How does LSI work?

In its most basic form LSI is a mathematical model designed to link the content of a page with the contextual relevance a user is trying to find. For example, the search term “Big Ben” is clearly a reference to a clock tower, however taken in isolation, a search engine could conceivably return hundreds of irrelevant articles about Big items or famous people called Ben (who one would presume would be of the larger disposition). By scanning content and identifying terms that are more commonly linked together, basic LSI can be used to predict the intent behind a search. When you see a list of predictive terms in the Google drop down bar (e.g. Big Ben London, Big Ben facts, Big Ben repair) these are an example of LSI at play, grouping various word combinations with the most relevant content.

Using LSI to predict intent

Many searches take the form of questions. To take the previous example, the query “Can I go inside the Elizabeth tower?” could be a search term used by any visitor to London. Although there is no explicit mention of Big Ben, LSI recognises that the terms Elizabeth, Big and Ben appear within multiple pages housing similar content and thus displays relevant pages regardless of which term was used. Likewise, LSI sees the searcher intent posed by the question term “can I” and therefore a search engine displays pages providing information about Elizabeth tower and Parliament tours on its SERPs.

Under the hood of LSI

Not all words are equal when it comes to the amount of semantic meaning they hold. That is, how much information a word carries about its context. Conjunctions, pronouns and common verbs (such as to be, do, think and see) carry no weight in the eyes of LSI. However, if you strip these away you’re largely left with a group of “content” words, able to be placed in what’s known as a term document matrix (TDMs). TDMs are essentially a fancy way of saying big grid – whereby the number of documents one word appears in can be compared against another. This is known as co-occurrence. Because this information is binary, documents sharing the two words can given a value of 1 and those that don’t can be marked as 0. It’s this nature that gives the mathematical principle underlying LSI (singular value decomposition or SVD) the first part of its name. Adding a third word simply adds a third axis, turning a TDM from a 2D structure to a 3D one. Now add a fourth, then a fifth, then another thousand… as you can see things soon become very complicated indeed!

Although there’s no need to go into advanced mathematics here it’s worth knowing that SVD is used to make sense of this multidimensional space, linking various words together to ascertain meaning from the fog of data. It can be helpful to think of words as planets in a vast universe, far flung concepts won’t be visible if you stand on one and look into the night sky. However close words will form a solar system of sorts, indicative of a wider concept.

And who said there was no meaning in the universe?  

Making the most of LSI

Fortunately you don’t have to understand the minutiae of LSI to make the most of it. Just be aware that keywords only have to appear once in a document to gain a value. With this in mind, creating high value content that is relevant to the reader is a far better use of time than outdated black hat methods, such as keyword stuffing, if you’re serious about implementing an effective SEO strategy. Google first introduced LSI as a way of purging low quality, irrelevant sites in favour of high quality content – putting the user experience front and centre. Do likewise and LSI will take care of the rest, meaning that good content doesn’t have to compromise.

In 2015 Google went a step further and confirmed the introduction of a further algorithm known as ‘Rankbrain’. By utilizing a branch of mathematics known as fuzzy set theory (machine learning), yet another layer of complexity has been added into its search function. For the first time google can automatically modulate the importance of domains and backlinks in accordance with real life behaviour. That is to say, it can figure out what you’re trying to say. Even if you don’t. With time the system will only continue to improve itself, building on the work of LSI and leading the world of search into unheralded territory.