Identifying Parallel Web Pages
Keywords:classification, parallel corpora, similarity functions
AbstractResearch on statistical machine translation and corpus-based approaches for cross-language information retrieval depend on the availability of multilingual data, particularly in the form of parallel corpora (collections of equivalent texts in two or more languages). However, the scarcity of parallel corpora limits the development of these applications. The Web is a vast repository of multilingual information, which has motivated research aimed at mining corpora from it. In this article, we present PPLocator an approach for locating parallel Web pages. PPLocator was designed to be effective while keeping a low processing cost, thus it avoids making exhaustive pairwise comparisons in order to identify the candidate pairs. In addition, it tries to minimize the number of pages that need to be downloaded during the intra-site crawl. An important characteristic of our approach is that it does not rely on resources such as dictionaries, translators, or language identifiers. PPLocator demands little effort from the human expert. Experiments using real Web data from over 284K pages attest for the viability of PPLocator. The results show superiority in relation to a baseline system in terms of both recall and precision, despite the fact that the baseline uses more resources.
Download data is not yet available.