Monday, August 16, 2010

How does google balance the conflict between phrase search and stopwords?

i know when google indexs documents, stopwords are ignored. but when a user conducts a phrase search which contains one or two stop words, how does google retrieve these stopwords? you know, these words are not indexed by google. can anyone give me some clues? thank you a lot.

How does google balance the conflict between phrase search and stopwords?
I believe what Google and other search engines use is something


called "approximate string matching".


It's something similar in results to the "like" function in SQL.





There's something called the "levenshtein distance" (sometimes


called the "edit distance"), which is a


widely used algorithm for assigning a value to the differences


between a string.


It basically figures out how many inserts/deletes of characters it would take to convert one string to another.


for example:


to change the word CAT to COT - it would take 1 deletion and 1 addition -giving an edit distance of 2








A weighting system ,which can be different for each program


performing the search, is then applied to select the closest matches.


Combine that with some regular expressions, and very close matches can be found to complete sentences which can contain


these "stop words".


Then, there are q-grams(also called "n grams" )- which are algorithmic functions combined with to Levenshtien distances


to provide some really good approximations for search results.





By combining these methods, Google can return excellent results without even considering these "stop words"





BTW - this is also how search engines guess at what you were trying to say ("did you mean to search for this?")


No comments:

Post a Comment