Click here to Skip to main content
15,860,844 members
Articles / General Programming / Algorithms

Develop Your Own Language Translation System

Rate me:
Please Sign up or sign in to vote.
5.00/5 (5 votes)
10 Aug 2010CPOL4 min read 150K   17   58
Understanding of Example Based Machine Translation (EBMT) system and how to create your own using exisiting tools

Abstract

This article describes the development of Example Based Machine Translation (EBMT) system using Java on Linux platform for translation from one language to another. In this particular case, I will be translating English sentences to Hindi. The principle of translating in EBMT is simple: a system decides an appropriate translation of an input sentence by analyzing the pre-translated sentences in the database. Therefore, the larger the database of pre-translated sentences, greater will be the accuracy of the EBMT system.

This article is greatly inspired by the works of Ralf Brown and Balakrishnan who have done extensive research in this field.

Introduction and Background

Example based translation is essentially translation by analogy. This means that if an EBMT system is given a set of sentences in the source language (from which one is translating) and their corresponding translations in the target language, the system can use these examples to translate other such similar source language sentences into target language sentences. The basic premise is that, if a previously translated sentence occurs again, the same translation is likely to be correct again.

Software Used

Developing your own machine translation is a difficult task. However, there are some tools that can help accelerate the process. I used the following tools in my EBMT system:

  1. Moses Decoder
  2. Giza++
  3. IRST-LM

Block Diagram

EBMT.png

Description

I divided the entire EBMT system into four modules.

1. Module I: Exact Match Algorithm

In this module, the input English sentence is first checked with every sentence in the available bilingual corpora for an exact match. If found, the corresponding Hindi sentence is retrieved and displayed as output.

In the case when the input is a paragraph, then the input is first broken down into sentences, and each sentence is taken one by one and translated.

2. Module II: Sentence Rule Based Translation

Every language has some grammar that describes how the words in the sentences should be organized. For instance, consider English vs. Hindi. English follows Subject-Verb-Object (SVO) linguistic topology while Hindi follows Subject-Object-Verb (SOV) topology. To illustrate this example, compare the following two sentences:

English: Anshul plays football

Hindi: Anshul football khelta hai

This module converts the input language into tokenized format. For example, the above English sentence is converted to

<Subject> plays <Object>

This helps in generalizing the translation process.

Besides this, there are many other linguistic rules that must be taken into consideration while translating sentences.

3. Module III: Phrase Decoder

When the first modules fail to translate, we divide the sentences into phrases against which we run algorithms based on statistical machine translation to find the most probable translated output of the input sentence.

Mathematically, we try to find out:

H*= arg max<sub>H</sub>P(H/E)            -(1)

I know this sounds complicated, so let me explain how we came to this equation.

According to the famous Bayes Law (Probability),

P(A/B) = P(B/A) * P(A)/P(B) 

In this case, we need to find that translated sentence A which has max probability of being the correct translation for a given input sentence B. Since we are looking for the most likely outcome A* for an event, given a fixed event B, P(B) is constant and doesn't play a role.

Thus, we want:

=> A* = arg max<sub>A</sub> P(A/B)

=> A*=arg max<sub>A</sub> P(B/A)*P(A)/P(B)

=> A*= arg max<sub>A</sub> P(B/A)*P(A)         -same as (1)   

This module tries to find the most probable Hindi translation of an English sentence by trying to find phrase H that would maximize P(E/H)*P(H). Phrases like these are clubbed together to complete the sentence.

Note:

  • P(H)=[Language model probability]:

    I used IRST-Language Model that measures fluency and probability of Hindi sentence and provide a set of fluent sentences to test for potential translation.

  • P(E/H)=[translation model probability H->E]:

    I used Giza++ that measures faithfulness, Probability of an (English, Hindi) pair given a Hindi sentence and test if a given fluent sentence is a translation.

  • arg maxH

    I used Moses Decoder that uses heuristic search to effectively and efficiently find H*.

4. Module IV: Word Decoder

This is the last attempt by EBMT to translate the input sentence. When Module III also fails to translate, EBMT breaks the sentence into words. For every word, it tries to seek the dictionary translation and simply stitches the outputs into a translated sentence.

Setup of EBMT

Basic preparation of an EBMT system requires you to do the following:

  1. Develop a bilingual corpora having pretranslated sentences from source language to destination language.
  2. Once you have a decent size corpora, then you need to install Giza++, Moses and IRST on your system.
  3. IRST requires monolingual file as well. This can easily be created by separating the bilingual corpora.
  4. Finally, you need to train your corpora with giza++. At the backhand, shell scripts and Perl scripts are run that compute probabilities and generate various files such as alignment file, translation table, fertility file, distoration table, etc.
EBMT2.png

Result

Training with Giza++ took 1.5 days. After which my EBMT system was ready!

EBMT4.png

Future Work

Machine translation is a research field with a lot of work already done and a lot more yet to be done. I merely demonstrated how you can use existing tools to create your own machine translation system. This is my first step towards innovation and I have a long way to go...

History

  • 11th August, 2010: Initial post

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)


Written By
India India
This member has not yet provided a Biography. Assume it's interesting and varied, and probably something to do with programming.

Comments and Discussions

 
QuestionNeed help Pin
Member 1521728626-May-21 3:23
Member 1521728626-May-21 3:23 
Praisefor Source code of language conversion Pin
Shubham Vishnalya22-Mar-21 20:36
Shubham Vishnalya22-Mar-21 20:36 
GeneralRequest for source code Pin
Member 1472670825-Jan-20 19:44
Member 1472670825-Jan-20 19:44 
GeneralRe: Request for source code Pin
OriginalGriff25-Jan-20 19:46
mveOriginalGriff25-Jan-20 19:46 
QuestionSource code Pin
chinucoder14-Jan-20 5:17
chinucoder14-Jan-20 5:17 
AnswerRe: Source code Pin
OriginalGriff14-Jan-20 5:20
mveOriginalGriff14-Jan-20 5:20 
QuestionNeed help. Hope for a revert back from you. Pin
Member 1459767119-Sep-19 0:18
Member 1459767119-Sep-19 0:18 
Questionsource code Pin
Member 1456636022-Aug-19 19:19
Member 1456636022-Aug-19 19:19 
Questionsource code Pin
miramero22-Oct-18 7:13
miramero22-Oct-18 7:13 
Questionsoucecode Pin
Member 140106607-Oct-18 4:01
Member 140106607-Oct-18 4:01 
Generalsource code Pin
Member 1378969320-Apr-18 0:55
Member 1378969320-Apr-18 0:55 
Questionto respected sir Pin
Member 1374404323-Mar-18 21:08
Member 1374404323-Mar-18 21:08 
QuestionSource code Pin
Member 1368391918-Feb-18 4:03
Member 1368391918-Feb-18 4:03 
Questionsource code Pin
merocs10-Feb-18 7:56
merocs10-Feb-18 7:56 
QuestionSource code Pin
Member 1323648726-Oct-17 6:53
Member 1323648726-Oct-17 6:53 
QuestionHelp: Source Code Pin
Member 1336617717-Aug-17 11:29
Member 1336617717-Aug-17 11:29 
QuestionSource Code Pin
Member 1328754130-Jun-17 16:48
Member 1328754130-Jun-17 16:48 
Questionsource code Pin
xorthomson17-Jun-17 15:04
xorthomson17-Jun-17 15:04 
Questionsource code Pin
Member 1308028123-Mar-17 8:49
Member 1308028123-Mar-17 8:49 
Questionsource code Pin
Member 1307988123-Mar-17 5:19
Member 1307988123-Mar-17 5:19 
QuestionNeed Help Pin
Member 129324643-Jan-17 1:44
Member 129324643-Jan-17 1:44 
QuestionSource Code Pin
Member 1287578428-Nov-16 17:31
Member 1287578428-Nov-16 17:31 
QuestionMore details about concept Pin
Member 1257704827-Sep-16 5:57
Member 1257704827-Sep-16 5:57 
QuestionRequest for help Pin
mohitgarg16033-Feb-16 0:14
mohitgarg16033-Feb-16 0:14 
QuestionRequest for source code Pin
Hammadh Abdul Rahman25-Nov-15 4:15
Hammadh Abdul Rahman25-Nov-15 4:15 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.