Developing a spell checker
DOI:
https://doi.org/10.35699/1983-3652.2021.26469Keywords:
Spell Checke, Spelling, Orthography, Affixes, Computational LinguisticsAbstract
Spell checkers are ubiquitous computational tools that help us in correctly writing texts or messages andimproving information inquiry and data mining. The present work presents the history of development of spellcheckers and illustrates how, in a simple way, it is possible to create an efficient spell checker from Norvig’sproposal. We also highlight some tools and how they are used in the development of spell checkers, such asaffix removal and n-gram computation. Moreover, we present an implementation of Norvig’s spell checkerand its performance in automatic correction for different spelling error data sets. Also, in a comparison ofspell checkers performance, we expose that it is worth removing affixes.
References
ARAUJO, Leonardo. leolca/spellcheck v0.1-alpha. [S.l.]: Zenodo, 2020. DOI: 10.5281/ZENODO.3235670. Disponível em:https://zenodo.org/record/3235670. Acesso em: 27 nov. 2020.
ATKINSON, Kevin.Aspell. [S.l.: s.n.], 1998. Disponível em:http://aspell.net/. Acesso em: 8 jul. 2020.
AVANÇO, Lucas Vinicius; DURAN, Magali Sanches. Towards a Phonetic Brazilian Portuguese Spell Checker.In: PROCEEDINGS of ToRPorEsp Workshop PROPOR 2014. [S.l.: s.n.], 2014.
BEIDER, Alexander; MORSE, Stephen P. An Alternative to Soundex with Fewer False Hits. 2008. Disponível em:https://stevemorse.org/phonetics/bmpm.htm. Acesso em: 8 jul. 2020.
BEIDER, Alexander; MORSE, Stephen P. Phonetic Matching: A Better Soundex. 2010. Disponível em: http://stevemorse.org/phonetics/bmpm2.htm. Acesso em: 8 jul. 2020.
CACHO, Jorge Ramon Fonseca.Improving OCR Post Processing with Machine Learning Tools. Ago. 2012.Tese (Doutorado) – University of Nevada, Las Vegas.CASTRO, Daniel.Métodos para la corrección ortográfica automática del español. 2012. Tese (Doutorado) –Universidad de Oriente, Santiago de Cuba. Disponível em:http://www.cerpamid.co.cu/sitio/files/tesis_master_daniel.pdf
CHURCH, Kenneth W.; GALE, William A. Probability scoring for spelling correction. Statistics and
Computing, Springer Science e Business Media LLC, v. 1, n. 2, p. 93–103, dez. 1991. DOI: 10.1007/bf01889984.
CORPORATION, Microsoft. Michael Cameron; Hugh Williams. Query suggestions for no result web searches. US, set. 2013. US8583670B2, Depósito: 4 out. 2007. Concessão: 9 abr. 2009. Patent. Disponível em: https://patents.google.com/patent/US8583670B2/en.
CORPORATION, Microsoft. Eric D. Brill; Silviu-Petru Cucerzan. Systems and methods for spell checking. US, set. 2005. EP1577793A2, Depósito: 14 mar. 2005. Concessão: 21 nov. 2005. Patent. Disponível em: https://patents.google.com/patent/EP1577793A2/en.
DAMERAU, Fred J. A technique for computer detection and correction of spelling errors. Communications of the ACM, Association for Computing Machinery (ACM), v. 7, n. 3, p. 171–176, mar. 1964.
DAVIDSON, Leon. Retrieval of misspelled names in an airlines passenger record system. Communications of the ACM, Association for Computing Machinery (ACM), v. 5, n. 3, p. 169–171, mar. 1962. DOI: 10.1145/366862.366913. Disponível em: https://doi.org/10.1145/366862.366913.
DEFFNER, Renate; EDER, Klaus; GEIGER, Hans. Word Recognition as a First Step Towards Natural Language Processing with Artificial Neural Networks. In: KONNEKTIONISMUS in Artificial Intelligence und Kognitionsforschung. [S.l.]: Springer Berlin Heidelberg, 1990. p. 221–225. DOI: 10.1007/978-3-642-76070-9_27. Disponível em: https://doi.org/10.1007/978-3-642-76070-9_27.
DUNLAVEY, Michael R. Letter to the Editor: On Spelling Correction and Beyond. ACM, v. 24, n. 9, p. 608–608, set. 1981.
ELLIOTT, R. J. Annotating spelling list words with affixation classes. [S.l.], dez. 1988.
GLANTZ, Herbert T. On the recognition of information with a digital computer. In: PROCEEDINGS of the 1956 11th ACM national meeting. [S.l.]: ACM Press, 1956. DOI: 10.1145/800258.808966. Disponível em: https://doi.org/10.1145/800258.808966.
GOLDING, Andrew R.; SCHABES, Yves. Combining Trigram-based and feature-based methods for context-sensitive spelling correction. In: PROCEEDINGS of the 34th annual meeting on Association for Computational Linguistics. [S.l.]: Association for Computational Linguistics, 1996. DOI: 10.3115/981863.981873. Disponível em: https://doi.org/10.3115/981863.981873.
GORIN, Ralph E. SPELL: Spelling Check and Correction Program. [S.l.: s.n.], 1974. Disponível em: https://www.saildart.org/allow/SPELL.REG%5C%5bUP,DOC%5C%5d. Acesso em: 8 jul. 2020.
GORIN, Ralph E.; WILLISSON, Pace; KUENNING, Geoff. Ispell. [S.l.: s.n.], 1971. Disponível em: https://www.cs.hmc.edu/%7B%5C~%7Dgeoff/ispell.html. Acesso em: 8 jul. 2020.
GUPTA, Prabhakar. A Context-Sensitive Real-Time Spell Checker with Language Adaptability. In: 2020. IEEE 14th International Conference on Semantic Computing (ICSC). [S.l.]: IEEE, fev. 2020. DOI: 10.1109/icsc.2020.00023.
HODGE, Victoria J.; AUSTIN, Jim. A comparison of a novel neural spell checker and standard spell checking algorithms. Pattern Recognition, Elsevier BV, v. 35, n. 11, p. 2571–2580, nov. 2002. DOI: 10.1016/s0031-3203(01)00174-1. Disponível em: https://doi.org/10.1016/s0031-3203(01)00174-1.
HODGE, Victoria J.; AUSTIN, Jim. A comparison of standard spell checking algorithms and a novel binary neural approach. IEEE Transactions on Knowledge and Data Engineering, Institute of Electrical e Electronics Engineers (IEEE), v. 15, n. 5, p. 1073–1081, set. 2003. DOI: 10.1109/tkde.2003.1232265. Disponível em: https://doi.org/10.1109/tkde.2003.1232265.
INC., Google. Noam Shazeer. Method of spell-checking search queries. US, mar. 2007. US7194684B1, Depósito: 9 abr. 2002. Concessão: 20 mar. 2007. Patent. Disponível em: https://patents.google.com/patent/US7194684B1/en.
INGELS, Peter. Connected Text Recognition Using Layered HMMs and Token Passing. CoRR, cmp-lg/9607036, 1996. Disponível em: http://arxiv.org/abs/cmp-lg/9607036.
JURAFSKY, Daniel; MARTIN, James H. Speech and Language Processing: An introduction to Natural Language Processing. [S.l.]: Prentice Hall, jan. 2008. ISBN 0131873210.
KERNIGHAN, Mark D.; CHURCH, Kenneth W.; GALE, William A. A spelling correction program based on a noisy channel model. In: PROCEEDINGS of the 13th conference on Computational linguistics. [S.l.]: Association for Computational Linguistics, 1990. DOI: 10.3115/997939.997975. Disponível em: https://doi.org/10.3115/997939.997975.
KIM, Jong Yong; SHAWE-TAYLOR, John. An approximate string-matching algorithm. Theoretical
Computer Science, Elsevier BV, v. 92, n. 1, p. 107–117, jan. 1992. DOI: 10.1016/0304-3975(92)90138-6. Disponível em: https://doi.org/10.1016/0304-3975(92)90138-6.
KIM, Jong Yong; SHAWE-TAYLOR, John. Fast string matching using an n-gram algorithm. Software: Practice and Experience, Wiley, v. 24, n. 1, p. 79–88, jan. 1994. DOI: 10.1002/spe.4380240105. Disponível em: https://doi.org/10.1002/spe.4380240105.
KNUTH, Donald Ervin. The Art of Computer Programming. [S.l.]: Addison-Wesley, 1973. v. 3.
KUČERA, Henry; FRANCIS, Winthrop Nelson. Computational analysis of present-day American English. [S.l.]: Brown University Press, 1967.
KUKICH, Karen. Spelling correction for the telecommunications network for the deaf. Communications of the ACM, Association for Computing Machinery (ACM), v. 35, n. 5, p. 80–90, mai. 1992. DOI: 10.1145/129875.129882. Disponível em: https://doi.org/10.1145/129875.129882.
KUKICH, Karen. Technique for automatically correcting words in text. ACM Computing Surveys, Association for Computing Machinery (ACM), v. 24, n. 4, p. 377–439, dez. 1992. DOI: 10.1145/146370.146380.
LIU, Lon-Mu et al. Adaptive post-processing of OCR text via knowledge acquisition. In: PROCEEDINGS of the 19th annual conference on Computer Science - CSC ’91. [S.l.]: ACM Press, 1991.
LUCCHESI, Cláudio L.; KOWALTOWSKI, Tomasz. Applications of finite automata representing large vocabularies. Software: Practice and Experience, Wiley, v. 23, n. 1, p. 15–30, jan. 1993. DOI: 10.1002/spe.4380230103. Disponível em: https://doi.org/10.1002/spe.4380230103.
MAHONEY, Michael Sean. An Oral History of Unix. [S.l.]: Zenodo, 1998. DOI: 10.5281/zenodo.2525530.
MCILROY, Malcolm Douglas. Development of a Spelling List. IEEE Transactions on Communications, Institute of Electrical e Electronics Engineers (IEEE), v. 30, n. 1, p. 91–99, jan. 1982. DOI: 10.1109/tcom.1982.1095395. Disponível em: https://doi.org/10.1109/tcom.1982.1095395.
MCMAHON, Lee E.; CHERRY, Lorinda L.; MORRIS, Robert. Statistical Text Processing. The Bell System Technical Journal, v. 57, n. 6, p. 2137–2154, 1978.
MIN, Kyongho; WILSON, William. Syntactic Recovery and Spelling Correction of Ill-formed Sentences. In: 3RD Conference of the Australasian Cognitive Science (CogSci95). [S.l.: s.n.], mar. 1995. p. 1–10.
MITTON, Roger. Fifty years of spellchecking. Writing Systems Research, Informa UK Limited, v. 2, n. 1, p. 1–7, jan. 2010. DOI: 10.1093/wsr/wsq004.
MITTON, Roger. Spelling checkers, spelling correctors and the misspellings of poor spellers. Information Processing & Management, Elsevier BV, v. 23, n. 5, p. 495–505, jan. 1987.
MORRIS, Robert; CHERRY, Lorinda L. Computer detection of typographical errors. IEEE Transactions on Professional Communication, Institute of Electrical e Electronics Engineers (IEEE), PC-18, n. 1, p. 54–56, mar. 1975. DOI: 10.1109/tpc.1975.6593963.
NAGATA, Masaaki. Japanese OCR error correction using character shape similarity and statistical language model. In: PROCEEDINGS of the 36th annual meeting on Association for Computational Linguistics. [S.l.]: Association for Computational Linguistics, 1998. p. 922–928.
NORVIG, Peter. How to write a spelling corrector. [S.l.: s.n.], 2007. http://norvig.com/spell-correct.html.
ODELL, Margaret King. The profit in records management. Systems, New York, v. 20, 1956.
OFLAZER, Kemal. Error-tolerant finite-state recognition with applications to morphological analysis and spelling correction. Computational Linguistics, v. 22, n. 1, p. 73–89, 1996.
OFLAZER, Kemal; GÜZEY, Cemaleddin. Spelling correction in agglutinative languages. In: PROCEEDINGS of the fourth conference on Applied natural language processing. [S.l.]: Association for Computational Linguistics, 1994. DOI: 10.3115/974358.974406. Disponível em: https://doi.org/10.3115/974358.974406.
PETERSON, James L. Computer programs for detecting and correcting spelling errors. Communications of the ACM, Association for Computing Machinery (ACM), v. 23, n. 12, p. 676–687, dez. 1980. DOI: 10.1145/359038.359041. Disponível em: https://doi.org/10.1145/359038.359041.
PHILIPS, Lawrence. Hanging on the metaphone. Computer Language, v. 7, n. 12, p. 39–43, 1990.
PHILIPS, Lawrence. The Double Metaphone Search Algorithm. C/C++ Users J., CMP Media, Inc., USA, v. 18, n. 6, p. 38–43, jun. 2000. ISSN 1075-2838.
PILÁN, Ildikó; VOLODINA, Elena. Exploring word embeddings and phonological similarity for the unsupervised correction of language learner errors. In: LATECH@COLING. [S.l.: s.n.], 2018.
PILAR ANGELES, Maria del; ESPINO-GAMEZ, Adrian; GIL-MONCADA, Jonathan. Comparison of a Modified Spanish Phonetic, Soundex, and Phonex coding functions during data matching process. In: 2015 International Conference on Informatics, Electronics & Vision (ICIEV). [S.l.]: IEEE, jun. 2015. DOI: 10.1109/iciev.2015.7334028. Disponível em: https://doi.org/10.1109/iciev.2015.7334028.
POLLOCK, Joseph J.; ZAMORA, Antonio. Automatic spelling correction in scientific and scholarly text. Communications of the ACM, Association for Computing Machinery (ACM), v. 27, n. 4, p. 358–368, abr. 1984. DOI: 10.1145/358027.358048.
ROBBINS, Arnold; BEEBE, Nelson H. F. Classic Shell Scripting. [S.l.]: O’Reilly Media, 2005. ISBN
RUSSELL, Robert C. Robert C. Russell. Index. 1918. US 1261167A, Depósito: 25 out. 1917. Concessão: 02 abr. 1918. Disponível em: https://patents.google.com/patent/US1261167A/en.
RUSSELL, Robert C. Robert C. Russell. Index. 1922. US 1435663A, Depósito: 28 nov. 1921. Concessão: 14 nov. 1922. Disponível em: https://patents.google.com/patent/US1435663A/en.
SAMUELSSON, Axel. Weighting Edit Distance to Improve Spelling Correction in Music Entity Search. Jun. 2017. Diss. (Mestrado) – KTH Royal Institute of Technology, Stockholm.
SHANNON, Claude E. A mathematical theory of communication. Bell Syst. Tech. J., v. 27, n. 3, p. 379–423, 1948.
TAGHVA, Kazem; STOFSKY, Eric. OCRSpell: an interactive spelling correction system for OCR errors in text. International Journal on Document Analysis and Recognition, Springer Science e Business Media LLC, v. 3, n. 3, p. 125–137, mar. 2001. DOI: 10.1007/pl00013558. Disponível em: https://doi.org/10.1007/pl00013558.
TAYLOR, W. D. Grope: A spelling error correction tool. [S.l.], 1981.
TONG, Xiang; EVANS, David A. A Statistical Approach to Automatic OCR Error Correction in Context. In: PROCEEDINGS of the Fourth Workshop on Very Large Corpora (WVLC-4). [S.l.: s.n.], 1996. p. 88–100.
VERBERNE, Suzan. Context-sensitive spell checking based on word trigram probabilities. 2002. Diss. (Mestrado) – University of Nijmegen.
WING, Alan M.; BADDELEY, Alan D. Spelling errors in handwriting: A corpus and distributional analysis. In: FRITH, Uta (Ed.). Cognitive Processes in Spelling. London: Academic Press, 1980. p. 251–285.
YANNAKOUDAKIS, Emmanuel J.; FAWTHROP, David. An intelligent spelling error corrector. Information Processing & Management, Elsevier BV, v. 19, n. 2, p. 101–108, jan. 1983. DOI: 10.1016/0306-4573(83)90046-8. Disponível em: https://doi.org/10.1016/0306-4573(83)90046-8.
YANNAKOUDAKIS, Emmanuel J.; FAWTHROP, David. The rules of spelling errors. Information Processing & Management, Elsevier BV, v. 19, n. 2, p. 87–99, jan. 1983. DOI: 10.1016/0306-4573(83)90045-6. Disponível em: https://doi.org/10.1016/0306-4573(83)90045-6.
YOUNG, Charlene W.; EASTMAN, Caroline M.; OAKMAN, Robert L. An analysis of ill-formed input in natural language queries to document retrieval systems. Information Processing & Management, Elsevier BV, v. 27, n. 6, p. 615–622, jan. 1991. DOI: 10.1016/0306-4573(91)90002-4. Disponível em: https://doi.org/10.1016/0306-4573(91)90002-4.
YUNUS, Ahmed; MASUM, Md. A Context Free Spell Correction Method using Supervised Machine Learning Algorithms. International Journal of Computer Applications, Foundation of Computer Science, v. 176, n. 27, p. 36–41, jun. 2020. DOI: 10.5120/ijca2020920288. Disponível em: https://doi.org/10.5120/ijca2020920288.
ZAMORA, Elena M.; POLLOCK, Joseph J.; ZAMORA, Antonio. The use of trigram analysis for spelling error detection. Information Processing & Management, Elsevier BV, v. 17, n. 6, p. 305–316, jan. 1981. DOI: 10.1016/0306-4573(81)90044-3. Disponível em: https://doi.org/10.1016/0306-4573(81)90044-3.
Downloads
Published
Issue
Section
License
Copyright (c) 2020 Texto Livre: Linguagem e Tecnologia
This work is licensed under a Creative Commons Attribution 4.0 International License.
This is an open access article that allows unrestricted use, distribution and reproduction in any medium as long as the original article is properly cited.