Transliteration is the process of mapping source language phonemes or graphemes into target language approximations. It is useful in many cross-language applications. Recently, there is increasing concerns about automatic transliteration, especially transliteration between the languages with significant distinctions in their character representations, eg. English and Chinese. Although there have been many English-Chinese cross-language applications, the transliteration between the two languages has not been studied comprehensively. This thesis concentrates on English-Chinese name transliteration. We compare the performance of several transliteration methods. The contributions of this work are summarized as follows: (1) Viewing the transliteration problem as a sequence-tagging problem, we tested three machine learning methods: Memory-based learning, Maximum Entropy model and Conditional Random Fields and we compared the performance of each method on different feature sets. Conditional Random Fields proves to give the best performance under identical conditions. (2) We also apply two machine translation approaches: the phrase-based model and the N-gram model to the transliteration problem. The influence of the size of the language models on the transliteration performance is also studied. Experiments show that the reranking capability of language models is crucial to the transliteration performance. (3) The above results show that the performance difference among the statistical models are small, with a significant amount of overlapping in the transliteration results, which usually contains a majority of the correct names and it is only a matter of ranking to select the best one. To compensate for the shortcomings of the statistical models, we design a name transliteration system that integrates statistical models with web mining, and implement the transliteration module within the framework.
修改评论