Google search, math and latent semantic analysis

By Murray Bourne, 13 Jul 2008

Google has become the dominant search engine because of its relevance and efficiency. Relevance is achieved through its propriety PageRank algorithm, which determines which pages are the most likely to satisfy your search query. Efficiency is achieved by using thousands of PCs rather than big servers to hold all the indexing, document and media information.

I wrote about this a while back in Math that made Google rich.

Now let's move on to an aspect of matrices that search engines use, called latent semantic analysis.

Here's what Wikipedia has to say on the subject:

Latent semantic analysis (LSA) is a technique in natural language processing, in particular in vectorial semantics, of analyzing relationships between a set of documents and the terms they contain by producing a set of concepts related to the documents and terms. LSA can use a term-document matrix which describes the occurrences of terms in documents; it is a sparse matrix whose rows correspond to terms and whose columns correspond to documents, typically stemmed words that appear in the documents.

Let's put this in everyday language. Simply put, latent semantic indexing is something the search engines do when they analyze the content of a Web site in order to figure out what the site is about.

It's actually what we humans do every day of our lives — try to figure out the meaning in what we see, hear and feel.

That Wikipedia article delves into the matrix operations that are involved in latent semantic analysis.

(If you are a bit rusty, see an Introduction to Matrices.)

See the 4 Comments below.

4 Comments on “Google search, math and latent semantic analysis”

  1. khudhair says:

    Many thanks for yourselves
    I am a researcher and interested in measuring the similarity of the text and type currently LSA and I hope to take advantage of your experiences

  2. Latent Semantic Explorer says:

    Hello,

    Please, do you consider that the Google patent "phrase based indexing in an information retrieval system" (https://www.google.com/patents/US7536408) describes an lsi method? Thank you for your post.

    PS: I built an application (php based) and called SEO Hero that query Google with a given keyword and extract the first 100 documents that rank on that query, then the application parses every single document to extract words and phrases before storing all these terms in a database with datas like, term frequency, document frequency for each entry. The main scope is to understand how much the words correlated to a given query, is it too much if describes this tool as a "latent semantic Explorer"? Thank you for any advice

  3. Murray says:

    Thanks for sharing your application. I'll try to look at it more closely when I have time.

    I believe that Google patent would be the latent semantic index method, and it looks like it would be legitimate to call your app a "latent semantic explorer".

  4. Latent Semantic Explorer says:

    Hello Murray,

    some people in the seo industry didn't like the name "latent semantic explorer". They said because it can mislead the user as it remember too much LSA/LSI.

    Anyway, how we call it is not so important so I have decided to describe SEO Hero as a Topic Explorer.

    I don't you if you had time to take a look, but thank you very much for your kindness

Leave a comment


Comment Preview

HTML: You can use simple tags like <b>, <a href="...">, etc.

To enter math, you can can either:

  1. Use simple calculator-like input in the following format (surround your math in backticks, or qq on tablet or phone):
    `a^2 = sqrt(b^2 + c^2)`
    (See more on ASCIIMath syntax); or
  2. Use simple LaTeX in the following format. Surround your math with \( and \).
    \( \int g dx = \sqrt{\frac{a}{b}} \)
    (This is standard simple LaTeX.)

NOTE: You can't mix both types of math entry in your comment.