Developing a spell checker

Authors

DOI:

https://doi.org/10.35699/1983-3652.2021.26469

Keywords:

Spell Checke, Spelling, Orthography, Affixes, Computational Linguistics

Abstract

Spell checkers are ubiquitous computational tools that help us in correctly writing texts or messages andimproving information inquiry and data mining. The present work presents the history of development of spellcheckers and illustrates how, in a simple way, it is possible to create an efficient spell checker from Norvig’sproposal. We also highlight some tools and how they are used in the development of spell checkers, such asaffix removal and n-gram computation. Moreover, we present an implementation of Norvig’s spell checkerand its performance in automatic correction for different spelling error data sets. Also, in a comparison ofspell checkers performance, we expose that it is worth removing affixes.

Author Biographies

  • Leonardo Carneiro de Araujo, Universidade Federal de São João del Rei, Ouro Branco, MG, Brasil

    Possui graduação em Engenharia Elétrica pela Universidade Federal de Minas Gerais (2003), mestrado em Engenharia Elétrica pela Universidade Federal de Minas Gerais (2007) e doutorado em Engenharia Elétrica pela Universidade Federal de Minas Gerais (2013). Atualmente é professor adjunto da Universidade Federal de São João Del-Rei. Tem experiência na área de Engenharia Elétrica, com ênfase em Processamento de Sinais, atuando principalmente nos seguintes temas: linguística quantitativa, processamento de sinais, teoria da informação, reconhecimento de padrões, reconhecimento de fala e inteligência artificial.

  • Aline de Lima Benevides, Universidade de São Paulo, São Paulo, SP, Brasil

    Doutoranda em Letras pelo Programa de Pós-Graduação em Linguística da Universidade de São Paulo (CNPq/ USP), sob orientação da Profa. Dra. Raquel Santana Santos e co-orientação da Profª Drª Thaïs Cristófaro Silva (UFMG), com a pesquisa intitulada "O processamento acentual em PB". Mestra em Letras também pelo Programa de Pós-Graduação em Linguística da Universidade de São Paulo (CNPq/ USP), sob orientação do Prof. Dr. Paulo Chagas de Souza (USP) e com co-orientação da Profa. Dra. Thaïs Cristófaro Silva (UFMG), com a pesquisa "O acento primário em português: uma abordagem experimental". Atualmente, é membro do grupo de pesquisa "Aquisição e Uso de Estratégias rítmicas em português brasileiro", coordenado pela Profa. Dra. Raquel Santana Santos. Entre 2010 e 2016, foi membro do Grupo de Estudos de Fonologia e Morfologia (FONEMOS/USP) e, entre 2014 e 2015, do Grupo de Estudos de Linguística Computacional da USP (GLIC/USP). Possui bacharelado em Letras com habilitação em Português/Linguística pela Universidade de São Paulo (USP), com projetos de pesquisas nas áreas de Fonologia, Fonologia Experimental, Linguística Computacional e Modelos baseados no Uso, especificamente no âmbito acentual. Realizou graduação sanduíche na Universidade de Lisboa, com projeto de pesquisa que analisou contrastivamente a atribuição e flutuação acentual em duas variedades do português, a brasileira e a europeia, sob orientação da Profa. Dra. Marisa Cruz.

  • João Pedro Hallack Sansão, Universidade Federal de São João del Rei, Ouro Branco, MG, Brasil

    É Engenheiro Eletricista, graduado, mestre e doutor em Engenharia Elétrica pela UFMG. Tem experiência na área de Engenharia Eletrônica, seus interesses principais são: processamento de sinais e imagens, aprendizado de máquina, projeto de sistemas eletrônicos embarcados, psicoacústica e análise da voz disfônica. Atua como professor na área de eletrônica no Campus Alto Paraopeba da Universidade Federal de São João del-Rei.

References

ARAUJO, Leonardo. leolca/spellcheck v0.1-alpha. [S.l.]: Zenodo, 2020. DOI: 10.5281/ZENODO.3235670. Disponível em:https://zenodo.org/record/3235670. Acesso em: 27 nov. 2020.

ATKINSON, Kevin.Aspell. [S.l.: s.n.], 1998. Disponível em:http://aspell.net/. Acesso em: 8 jul. 2020.

AVANÇO, Lucas Vinicius; DURAN, Magali Sanches. Towards a Phonetic Brazilian Portuguese Spell Checker.In: PROCEEDINGS of ToRPorEsp Workshop PROPOR 2014. [S.l.: s.n.], 2014.

BEIDER, Alexander; MORSE, Stephen P. An Alternative to Soundex with Fewer False Hits. 2008. Disponível em:https://stevemorse.org/phonetics/bmpm.htm. Acesso em: 8 jul. 2020.

BEIDER, Alexander; MORSE, Stephen P. Phonetic Matching: A Better Soundex. 2010. Disponível em: http://stevemorse.org/phonetics/bmpm2.htm. Acesso em: 8 jul. 2020.

CACHO, Jorge Ramon Fonseca.Improving OCR Post Processing with Machine Learning Tools. Ago. 2012.Tese (Doutorado) – University of Nevada, Las Vegas.CASTRO, Daniel.Métodos para la corrección ortográfica automática del español. 2012. Tese (Doutorado) –Universidad de Oriente, Santiago de Cuba. Disponível em:http://www.cerpamid.co.cu/sitio/files/tesis_master_daniel.pdf

CHURCH, Kenneth W.; GALE, William A. Probability scoring for spelling correction. Statistics and

Computing, Springer Science e Business Media LLC, v. 1, n. 2, p. 93–103, dez. 1991. DOI: 10.1007/bf01889984.

CORPORATION, Microsoft. Michael Cameron; Hugh Williams. Query suggestions for no result web searches. US, set. 2013. US8583670B2, Depósito: 4 out. 2007. Concessão: 9 abr. 2009. Patent. Disponível em: https://patents.google.com/patent/US8583670B2/en.

CORPORATION, Microsoft. Eric D. Brill; Silviu-Petru Cucerzan. Systems and methods for spell checking. US, set. 2005. EP1577793A2, Depósito: 14 mar. 2005. Concessão: 21 nov. 2005. Patent. Disponível em: https://patents.google.com/patent/EP1577793A2/en.

DAMERAU, Fred J. A technique for computer detection and correction of spelling errors. Communications of the ACM, Association for Computing Machinery (ACM), v. 7, n. 3, p. 171–176, mar. 1964.

DAVIDSON, Leon. Retrieval of misspelled names in an airlines passenger record system. Communications of the ACM, Association for Computing Machinery (ACM), v. 5, n. 3, p. 169–171, mar. 1962. DOI: 10.1145/366862.366913. Disponível em: https://doi.org/10.1145/366862.366913.

DEFFNER, Renate; EDER, Klaus; GEIGER, Hans. Word Recognition as a First Step Towards Natural Language Processing with Artificial Neural Networks. In: KONNEKTIONISMUS in Artificial Intelligence und Kognitionsforschung. [S.l.]: Springer Berlin Heidelberg, 1990. p. 221–225. DOI: 10.1007/978-3-642-76070-9_27. Disponível em: https://doi.org/10.1007/978-3-642-76070-9_27.

DUNLAVEY, Michael R. Letter to the Editor: On Spelling Correction and Beyond. ACM, v. 24, n. 9, p. 608–608, set. 1981.

ELLIOTT, R. J. Annotating spelling list words with affixation classes. [S.l.], dez. 1988.

GLANTZ, Herbert T. On the recognition of information with a digital computer. In: PROCEEDINGS of the 1956 11th ACM national meeting. [S.l.]: ACM Press, 1956. DOI: 10.1145/800258.808966. Disponível em: https://doi.org/10.1145/800258.808966.

GOLDING, Andrew R.; SCHABES, Yves. Combining Trigram-based and feature-based methods for context-sensitive spelling correction. In: PROCEEDINGS of the 34th annual meeting on Association for Computational Linguistics. [S.l.]: Association for Computational Linguistics, 1996. DOI: 10.3115/981863.981873. Disponível em: https://doi.org/10.3115/981863.981873.

GORIN, Ralph E. SPELL: Spelling Check and Correction Program. [S.l.: s.n.], 1974. Disponível em: https://www.saildart.org/allow/SPELL.REG%5C%5bUP,DOC%5C%5d. Acesso em: 8 jul. 2020.

GORIN, Ralph E.; WILLISSON, Pace; KUENNING, Geoff. Ispell. [S.l.: s.n.], 1971. Disponível em: https://www.cs.hmc.edu/%7B%5C~%7Dgeoff/ispell.html. Acesso em: 8 jul. 2020.

GUPTA, Prabhakar. A Context-Sensitive Real-Time Spell Checker with Language Adaptability. In: 2020. IEEE 14th International Conference on Semantic Computing (ICSC). [S.l.]: IEEE, fev. 2020. DOI: 10.1109/icsc.2020.00023.

HODGE, Victoria J.; AUSTIN, Jim. A comparison of a novel neural spell checker and standard spell checking algorithms. Pattern Recognition, Elsevier BV, v. 35, n. 11, p. 2571–2580, nov. 2002. DOI: 10.1016/s0031-3203(01)00174-1. Disponível em: https://doi.org/10.1016/s0031-3203(01)00174-1.

HODGE, Victoria J.; AUSTIN, Jim. A comparison of standard spell checking algorithms and a novel binary neural approach. IEEE Transactions on Knowledge and Data Engineering, Institute of Electrical e Electronics Engineers (IEEE), v. 15, n. 5, p. 1073–1081, set. 2003. DOI: 10.1109/tkde.2003.1232265. Disponível em: https://doi.org/10.1109/tkde.2003.1232265.

INC., Google. Noam Shazeer. Method of spell-checking search queries. US, mar. 2007. US7194684B1, Depósito: 9 abr. 2002. Concessão: 20 mar. 2007. Patent. Disponível em: https://patents.google.com/patent/US7194684B1/en.

INGELS, Peter. Connected Text Recognition Using Layered HMMs and Token Passing. CoRR, cmp-lg/9607036, 1996. Disponível em: http://arxiv.org/abs/cmp-lg/9607036.

JURAFSKY, Daniel; MARTIN, James H. Speech and Language Processing: An introduction to Natural Language Processing. [S.l.]: Prentice Hall, jan. 2008. ISBN 0131873210.

KERNIGHAN, Mark D.; CHURCH, Kenneth W.; GALE, William A. A spelling correction program based on a noisy channel model. In: PROCEEDINGS of the 13th conference on Computational linguistics. [S.l.]: Association for Computational Linguistics, 1990. DOI: 10.3115/997939.997975. Disponível em: https://doi.org/10.3115/997939.997975.

KIM, Jong Yong; SHAWE-TAYLOR, John. An approximate string-matching algorithm. Theoretical

Computer Science, Elsevier BV, v. 92, n. 1, p. 107–117, jan. 1992. DOI: 10.1016/0304-3975(92)90138-6. Disponível em: https://doi.org/10.1016/0304-3975(92)90138-6.

KIM, Jong Yong; SHAWE-TAYLOR, John. Fast string matching using an n-gram algorithm. Software: Practice and Experience, Wiley, v. 24, n. 1, p. 79–88, jan. 1994. DOI: 10.1002/spe.4380240105. Disponível em: https://doi.org/10.1002/spe.4380240105.

KNUTH, Donald Ervin. The Art of Computer Programming. [S.l.]: Addison-Wesley, 1973. v. 3.

KUČERA, Henry; FRANCIS, Winthrop Nelson. Computational analysis of present-day American English. [S.l.]: Brown University Press, 1967.

KUKICH, Karen. Spelling correction for the telecommunications network for the deaf. Communications of the ACM, Association for Computing Machinery (ACM), v. 35, n. 5, p. 80–90, mai. 1992. DOI: 10.1145/129875.129882. Disponível em: https://doi.org/10.1145/129875.129882.

KUKICH, Karen. Technique for automatically correcting words in text. ACM Computing Surveys, Association for Computing Machinery (ACM), v. 24, n. 4, p. 377–439, dez. 1992. DOI: 10.1145/146370.146380.

LIU, Lon-Mu et al. Adaptive post-processing of OCR text via knowledge acquisition. In: PROCEEDINGS of the 19th annual conference on Computer Science - CSC ’91. [S.l.]: ACM Press, 1991.

LUCCHESI, Cláudio L.; KOWALTOWSKI, Tomasz. Applications of finite automata representing large vocabularies. Software: Practice and Experience, Wiley, v. 23, n. 1, p. 15–30, jan. 1993. DOI: 10.1002/spe.4380230103. Disponível em: https://doi.org/10.1002/spe.4380230103.

MAHONEY, Michael Sean. An Oral History of Unix. [S.l.]: Zenodo, 1998. DOI: 10.5281/zenodo.2525530.

MCILROY, Malcolm Douglas. Development of a Spelling List. IEEE Transactions on Communications, Institute of Electrical e Electronics Engineers (IEEE), v. 30, n. 1, p. 91–99, jan. 1982. DOI: 10.1109/tcom.1982.1095395. Disponível em: https://doi.org/10.1109/tcom.1982.1095395.

MCMAHON, Lee E.; CHERRY, Lorinda L.; MORRIS, Robert. Statistical Text Processing. The Bell System Technical Journal, v. 57, n. 6, p. 2137–2154, 1978.

MIN, Kyongho; WILSON, William. Syntactic Recovery and Spelling Correction of Ill-formed Sentences. In: 3RD Conference of the Australasian Cognitive Science (CogSci95). [S.l.: s.n.], mar. 1995. p. 1–10.

MITTON, Roger. Fifty years of spellchecking. Writing Systems Research, Informa UK Limited, v. 2, n. 1, p. 1–7, jan. 2010. DOI: 10.1093/wsr/wsq004.

MITTON, Roger. Spelling checkers, spelling correctors and the misspellings of poor spellers. Information Processing & Management, Elsevier BV, v. 23, n. 5, p. 495–505, jan. 1987.

MORRIS, Robert; CHERRY, Lorinda L. Computer detection of typographical errors. IEEE Transactions on Professional Communication, Institute of Electrical e Electronics Engineers (IEEE), PC-18, n. 1, p. 54–56, mar. 1975. DOI: 10.1109/tpc.1975.6593963.

NAGATA, Masaaki. Japanese OCR error correction using character shape similarity and statistical language model. In: PROCEEDINGS of the 36th annual meeting on Association for Computational Linguistics. [S.l.]: Association for Computational Linguistics, 1998. p. 922–928.

NORVIG, Peter. How to write a spelling corrector. [S.l.: s.n.], 2007. http://norvig.com/spell-correct.html.

ODELL, Margaret King. The profit in records management. Systems, New York, v. 20, 1956.

OFLAZER, Kemal. Error-tolerant finite-state recognition with applications to morphological analysis and spelling correction. Computational Linguistics, v. 22, n. 1, p. 73–89, 1996.

OFLAZER, Kemal; GÜZEY, Cemaleddin. Spelling correction in agglutinative languages. In: PROCEEDINGS of the fourth conference on Applied natural language processing. [S.l.]: Association for Computational Linguistics, 1994. DOI: 10.3115/974358.974406. Disponível em: https://doi.org/10.3115/974358.974406.

PETERSON, James L. Computer programs for detecting and correcting spelling errors. Communications of the ACM, Association for Computing Machinery (ACM), v. 23, n. 12, p. 676–687, dez. 1980. DOI: 10.1145/359038.359041. Disponível em: https://doi.org/10.1145/359038.359041.

PHILIPS, Lawrence. Hanging on the metaphone. Computer Language, v. 7, n. 12, p. 39–43, 1990.

PHILIPS, Lawrence. The Double Metaphone Search Algorithm. C/C++ Users J., CMP Media, Inc., USA, v. 18, n. 6, p. 38–43, jun. 2000. ISSN 1075-2838.

PILÁN, Ildikó; VOLODINA, Elena. Exploring word embeddings and phonological similarity for the unsupervised correction of language learner errors. In: LATECH@COLING. [S.l.: s.n.], 2018.

PILAR ANGELES, Maria del; ESPINO-GAMEZ, Adrian; GIL-MONCADA, Jonathan. Comparison of a Modified Spanish Phonetic, Soundex, and Phonex coding functions during data matching process. In: 2015 International Conference on Informatics, Electronics & Vision (ICIEV). [S.l.]: IEEE, jun. 2015. DOI: 10.1109/iciev.2015.7334028. Disponível em: https://doi.org/10.1109/iciev.2015.7334028.

POLLOCK, Joseph J.; ZAMORA, Antonio. Automatic spelling correction in scientific and scholarly text. Communications of the ACM, Association for Computing Machinery (ACM), v. 27, n. 4, p. 358–368, abr. 1984. DOI: 10.1145/358027.358048.

ROBBINS, Arnold; BEEBE, Nelson H. F. Classic Shell Scripting. [S.l.]: O’Reilly Media, 2005. ISBN

RUSSELL, Robert C. Robert C. Russell. Index. 1918. US 1261167A, Depósito: 25 out. 1917. Concessão: 02 abr. 1918. Disponível em: https://patents.google.com/patent/US1261167A/en.

RUSSELL, Robert C. Robert C. Russell. Index. 1922. US 1435663A, Depósito: 28 nov. 1921. Concessão: 14 nov. 1922. Disponível em: https://patents.google.com/patent/US1435663A/en.

SAMUELSSON, Axel. Weighting Edit Distance to Improve Spelling Correction in Music Entity Search. Jun. 2017. Diss. (Mestrado) – KTH Royal Institute of Technology, Stockholm.

SHANNON, Claude E. A mathematical theory of communication. Bell Syst. Tech. J., v. 27, n. 3, p. 379–423, 1948.

TAGHVA, Kazem; STOFSKY, Eric. OCRSpell: an interactive spelling correction system for OCR errors in text. International Journal on Document Analysis and Recognition, Springer Science e Business Media LLC, v. 3, n. 3, p. 125–137, mar. 2001. DOI: 10.1007/pl00013558. Disponível em: https://doi.org/10.1007/pl00013558.

TAYLOR, W. D. Grope: A spelling error correction tool. [S.l.], 1981.

TONG, Xiang; EVANS, David A. A Statistical Approach to Automatic OCR Error Correction in Context. In: PROCEEDINGS of the Fourth Workshop on Very Large Corpora (WVLC-4). [S.l.: s.n.], 1996. p. 88–100.

VERBERNE, Suzan. Context-sensitive spell checking based on word trigram probabilities. 2002. Diss. (Mestrado) – University of Nijmegen.

WING, Alan M.; BADDELEY, Alan D. Spelling errors in handwriting: A corpus and distributional analysis. In: FRITH, Uta (Ed.). Cognitive Processes in Spelling. London: Academic Press, 1980. p. 251–285.

YANNAKOUDAKIS, Emmanuel J.; FAWTHROP, David. An intelligent spelling error corrector. Information Processing & Management, Elsevier BV, v. 19, n. 2, p. 101–108, jan. 1983. DOI: 10.1016/0306-4573(83)90046-8. Disponível em: https://doi.org/10.1016/0306-4573(83)90046-8.

YANNAKOUDAKIS, Emmanuel J.; FAWTHROP, David. The rules of spelling errors. Information Processing & Management, Elsevier BV, v. 19, n. 2, p. 87–99, jan. 1983. DOI: 10.1016/0306-4573(83)90045-6. Disponível em: https://doi.org/10.1016/0306-4573(83)90045-6.

YOUNG, Charlene W.; EASTMAN, Caroline M.; OAKMAN, Robert L. An analysis of ill-formed input in natural language queries to document retrieval systems. Information Processing & Management, Elsevier BV, v. 27, n. 6, p. 615–622, jan. 1991. DOI: 10.1016/0306-4573(91)90002-4. Disponível em: https://doi.org/10.1016/0306-4573(91)90002-4.

YUNUS, Ahmed; MASUM, Md. A Context Free Spell Correction Method using Supervised Machine Learning Algorithms. International Journal of Computer Applications, Foundation of Computer Science, v. 176, n. 27, p. 36–41, jun. 2020. DOI: 10.5120/ijca2020920288. Disponível em: https://doi.org/10.5120/ijca2020920288.

ZAMORA, Elena M.; POLLOCK, Joseph J.; ZAMORA, Antonio. The use of trigram analysis for spelling error detection. Information Processing & Management, Elsevier BV, v. 17, n. 6, p. 305–316, jan. 1981. DOI: 10.1016/0306-4573(81)90044-3. Disponível em: https://doi.org/10.1016/0306-4573(81)90044-3.

Published

2021-02-09

Issue

Section

Linguistics and Technology

How to Cite

Developing a spell checker. Texto Livre, Belo Horizonte-MG, v. 14, n. 1, p. e26469, 2021. DOI: 10.35699/1983-3652.2021.26469. Disponível em: https://periodicos.ufmg.br/index.php/textolivre/article/view/26469. Acesso em: 19 dec. 2024.

Similar Articles

1-10 of 18

You may also start an advanced similarity search for this article.

Most read articles by the same author(s)