goglidea.blogg.se - Spelling corrector

SPELLING CORRECTOR HOW TO
SPELLING CORRECTOR UPDATE

In that case, we can normalize the historical frequencies of unique queries to establish a probability distribution. Let’s consider the simplifying assumption that all of our queries are single tokens. The language model estimates the probability of an intended query - that is, the probability, given no further information, that a searcher would set out to type a particular query. With these two models and Bayes’ theorem, we can score candidate spelling correction candidates and rank them based on their probability. In order to do that, we need two models: a language model that tells us a priori probability of a query, and an error model that tells us the probability of a query string given the intended query.

The goal of spelling correction is to find the correction, out of all possible candidate corrections (including the original query), that has the highest probability of being correct. This approach is most useful for words with unintuitive spellings, such as proper names or words adopted from other languages. It is also possible to index tokens based on how they sound, such as canonicalizing them into Metaphone codes. Hence, a spelling correction system must make tradeoffs among storage, efficiency, and quality in indexing and candidate generation. Moreover, as vocabulary grows, the index becomes more dense: in particular, short substrings are mapped to large numbers of tokens.

This index grows with the size of the corpus vocabulary, so it can become quite large.

SPELLING CORRECTOR UPDATE

We generate this index by identifying the unique tokens in the corpus, iterating through them, and inserting the appropriate substring-to-string mappings into a substring index, such as an n-gram index or a suffix tree.Īlthough this description assumes a batch, offline indexing process, it is straightforward to update the index incrementally as we insert, remove, or update documents. An index for approximate string matching enables discovery of these near-misses through retrieval of strings - in our case, tokens - by their substrings. Most misspelled tokens differ from the intended tokens by at most a few characters (i.e, a small edit distance). In contrast, indexing for spelling correction typically maps substrings of tokens ( character n-grams) to tokens. The fundamental data structure for document retrieval is an inverted index (also known as a posting list) that maps tokens to documents. Indexing for spelling correction is a bit different than for document retrieval. Let’s drill down into each of these aspects.

SPELLING CORRECTOR HOW TO

Determining whether and how to present a spelling correction. Computing the score or probability for each candidate. Identifying the spelling correction candidates for the query. Computing the model to estimate the probability of a particular misspelling, given an intended query. Computing the model to estimate the a priori probability of an intended query. Building the index used at query time for candidate generation.

Let’s look at the key aspects of a spelling-correction system. A great starting point is Peter Norvig’s classic post on “ How to Write a Spelling Corrector”, which walks through fundamental concepts like edit distance, language models, and error models. It’s still important, however, to understand how spelling correction works. Off-the-shelf spelling correction systems, such as Aspell or Hunspell, are highly customizable and should suit most people’s needs. That doesn’t mean, however, that you should build a spelling correction system from scratch. For today’s searchers, a search engine without robust spelling correction simply doesn’t work. Estimates of the fraction of misspelled search queries vary, but a variety of studies place it between 10% and 15%. Spelling correction is a must-have for any modern search engine.