Dictionary-based machine translation


can use a method based on dictionary entries, which means that the words will be translated as a dictionary does – word by word, usually without much correlation of meaning between them. Dictionary lookups may be done with or without morphological analysis or lemmatisation. While this approach to machine translation is probably the least sophisticated, dictionary-based machine translation is ideally suitable for the translation of long lists of phrases on the subsentential level, e.g. inventories or simple catalogs of products and services.
It can also be used to expedite manual translation, if the person carrying it out is fluent in both languages and therefore capable of correcting syntax and grammar.

LMT

LMT is a Prolog-based machine-translation system that works
on specially made bilingual dictionaries, such as the Collins English-German
, which have been rewritten in an indexed form which is easily readable by
computers. This method uses a structured lexical data base in order to
correctly identify word categories from the source language, thus constructing
a coherent sentence in the target language, based on rudimentary morphological
analysis. This system uses "frames" to identify the position a certain word
should have, from a syntactical point of view, in a sentence. This "frames" are
mapped via language conventions, such as UDICT in the case of English.
In its early form LMT uses three lexicons,
accessed simultaneously: source, transfer and target, although it is possible
to encapsulate this whole information in a single lexicon. The program uses a
lexical configuration consisting of two main elements. The first element is a
hand-coded lexicon addendum which contains possible incorrect translations. The
second element consist of various bilingual and monolingual dictionaries
regarding the two languages which are the source and target languages.

Example-Based & Dictionary-Based Machine Translation

This method of Dictionary-Based Machine translation explores
a different paradigm from systems such as LMT. An example-based machine translation system is supplied with only a "sentence-aligned bilingual corpus".
Using this data the translating program generates a "word-for-word bilingual
dictionary" which is used for further translation.
Whilst this system would generally be regarded as a whole
different way of machine translation than Dictionary-Based Machine Translation,
it is important to understand the complementing nature of this paradigms. With
the combined power inherent in both systems, coupled with the fact that a
Dictionary-Based Machine Translation works best with a "word-for-word bilingual
dictionary" lists of words it demonstrates the fact that a coupling of this two
translation engines would generate a very powerful translation tool that is,
besides being semantically accurate, capable of enhancing its own
functionalities via perpetual feedback loops.
A system which combines both paradigms in a way similar to
what was described in the previous paragraph is the Pangloss Example-Based
Machine Translation engine machine translation engine. PanEBMT uses a
correspondence table between languages to create its corpus. Furthermore,
PanEBMT supports multiple incremental operations on its corpus, which facilitates
a biased translation used for filtering purposes.

Parallel Text Processing

Douglas Hofstadter through his "Le Ton beau de Marot: In
Praise of the Music of Language" proves what a complex task translation is. The
author produced and analysed dozens upon dozens of possible translations for an
eighteen line French poem, thus revealing complex inner workings of syntax,
morphology and meaning. Unlike most translation engines who choose a single
translation based on back to back comparison of the texts in both the source
and target languages, Douglas Hofstadter's work prove the inherent level of
error which is present in any form of translation, when the meaning of the
source text is too detailed or complex. Thus the problem of text alignment and
"statistics of language" is brought to attention.
This discrepancies led to Martin Kay's views on translation
and translation engines as a whole. As Kay puts it "More substantial successes
in these enterprises will require a sharper image of the world than any that
can be made out simply from the statistics of language use" . Thus Kay
has brought back to light the question of meaning inside language and the
distortion of meaning through processes of translation.

Lexical Conceptual Structure

One of the possible uses of Dictionary-Based Machine Translation is facilitating "Foreign Language Tutoring". This can be achieved by using Machine-Translation technology as well as linguistics, semantics and morphology to produce "Large-Scale Dictionaries" in virtually any given language. Development in lexical semantics and computational linguistics during the time period between 1990 and 1996 made it possible for "natural language processing" to flourish, gaining new capabilities, nevertheless benefiting machine translation in general.
"Lexical Conceptual Structure" is a representation
that is language independent. It is mostly used in foreign language tutoring,
especially in the natural language processing element of FLT. LCS has also
proved to be an indispensable tool for machine translation of any kind, such as
Dictionary-Based Machine Translation. Overall one of the primary goals of LCS
is "to demonstrate that synonymous verb senses share distributional patterns".

"DKvec"

"DKvec is a method for extracting bilingual lexicons, from
noisy parallel corpora based on arrival distances of words in noisy parallel
corpora". This method has emerged in response to two problems plaguing the
statistical extraction of bilingual lexicons: " How can noisy parallel
corpora be used? How can non-parallel yet comparable corpora be used?"
The "DKvec" method has proven invaluable for machine
translation in general, due to the amazing success it has had in trials
conducted on both English – Japanese and English – Chinese noisy parallel
corpora. The figures for accuracy "show a 55.35% precision from a small corpus
and 89.93% precision from a larger corpus". With such impressive numbers it is
safe to assume the immense impact that methods such as "DKvec" has had in the
evolution of machine translation in general, especially Dictionary-Based
Machine Translation.
Algorithms used for extracting parallel corpora in a
bilingual format exploit the following rules in order to achieve a satisfactory
accuracy and overall quality:
  1. Words have one sense per corpus
  2. Words have single translation per corpus
  3. No missing translations in the target document
  4. Frequencies of bilingual word occurrences are comparable
  5. Positions of bilingual word occurrences are comparable
This methods can be used to generate, or to look for, occurrence patterns which in turn are used to produce binary occurrence vectors which are used by the "DKvec" method.

History of Machine Translation

The history of machine translation starts around the
mid 1940s. Machine translations was probably the first time computers were
used for non-numerical purposes. Machine translation enjoyed a fierce research interest
during the 1950s and 1960s, which was followed by a stagnation until the 1980s.
After the 1980s, machine translation became mainstream again, enjoying an even
bigger popularity than in the 1950s and 1960s as well as rapid expansion,
largely based on the text corpora approach.
The basic concept of machine translation can be traced back
to the 17th century in the speculations surrounding "universal
languages and mechanical dictionaries". The first true practical machine
translation suggestions were made in 1933 by Georges Artsrouni in France and Petr
Trojanskij in Russia. Both had patented machines that they believed could be
used for translating meaning from a language to another. "In June 1952, the
first MT conference was convened at MIT by Yehoshua Bar-Hillel". On 7 January 1954 a Machine Translation convention in New York, sponsored by IBM,
served at popularizing the field. The conventions popularity came from the
translation of short English sentences into Russian. This engineering feat
mesmerised the public and the governments of both the US and USSR who
therefore stimulated large-scale funding in machine translation research.
Although the enthusiasm for machine translation was extremely high, technical
and knowledge limitations led to disillusions regarding what machine
translation was actually capable of doing, at least at that time. Thus machine
translation lost in popularity until the 1980s, when advances in linguistics
and technology helped revitalise the interest in this field.

Translingual information retrieval

"Translingual information retrieval consists of
providing a query in one language and searching document collections in one or
more different languages". Most methods of TLIR can be quantified into two
categories, namely statistical-IR approaches and query translation. Machine
translation based TLIR works in one of two ways. Either the query is translated
in the target language, or the original query is used to search while the
collection of possible results is translated in the query language and used for
cross-reference. Both methods have pros and cons, namely:
All this points prove the fact that Dictionary-Based machine translation is the most efficient and reliable form of translation when working with TLIR. This is because the process "looks up each query term in a general-purpose bilingual dictionary, and uses all its possible translations."

Machine Translation of Very Close Languages

The examples of RUSLAN, a dictionary-based machine
translation system between Czech and Russian and CESILKO, a Czech – Slovak
dictionary-based machine translation system, shows that in the case of very
close languages simpler translation methods are more efficient, fast and
reliable.
The RUSLAN system was made in order to prove the hypotheses
that related languages are easier to translate. The system development started
in 1985 and was terminated five years later due to lack of further funding. The
lessons taught by the RUSLAN experiment are that a transfer-based approach of
translation retains its quality regardless of how close the languages are. The
main two bottlenecks of "full-fledged transfer-based systems" are complexity
and unreliability of syntactic analysis.

Multilingual Information Retrieval MLIR

"Information Retrieval systems rank documents according to
statistical similarity measures based on the co-occurrence of terms in queries
and documents". The MLIR system was created and optimised in such a way that
facilitates dictionary based translation of queries. This is because of the
fact that queries tend to be short, a couple of words, which, despite not
providing a lot of context it is a more feasible than translating whole
documents, due to practical reasons. Despite all this, the MLIR system is
highly dependent on a lot of resources such as automated language detection
software.

Key words

Linguistics = n. The study of the nature, structure, and variation of language, including phonetics, phonology, morphology, syntax, semantics, socio-linguistics, and pragmatics.
computational linguistics = The branch of linguistics in which the techniques of computer science are applied to the analysis and synthesis of language and speech.
Syntax noun = a. the study of the rules for the formation of grammatical sentences in a language; b. the study of the patterns of formation of sentences and phrases from words; c. the rules or patterns so studied; Computers. the grammatical rules and structural patterns governing the ordered use of appropriate words and symbols for issuing commands, writing code, etc., in a particular software application or programming language.