EDIT: I just did another search, and finally came up with some reference to a very similar idea here (“Dialog-Based Machine Translation”), though the implementation appears to be somewhat different: http://wam.inrialpes.fr/publications/2005/DocEng05-Choumane.pdf
- They’re talking about integration with a machine translation system per se; I’m talking about pre-tagging the source text to make future automated translation easier (though providing round-trip access to Google Translate or similar would be a very helpful adjunct part of the tool, to know which parts of the document need disambiguation).
- They talk about maintaining a parallel document of some kind, using tags in the source document to reference it; I propose that it would be simpler to maintain only tags directly within the source document, and that this approach would also make later automated translation of full web pages (integrated with other styling etc.) easier.
- They talk about the system telling the user when it’s confused and asking questions, which it then maintains in an “answer tree”; I propose that authors won’t have access to that information, and will just need to review the round-trip translation to understand where confusion is arising.
- Besides which, theirs is just an academic paper; if it’s been implemented in some commercial product, I doubt many people (aside from professional translators) are using it. I want this to be an extremely widespread, cheap mechanism that any website could use.
I’ve had this idea for a while (least a few years now, maybe 5 or more? I’d have to recover some old computers to see when I first noted it down).
The basic idea is this: provide a way for authors of online material to “tag” their texts with disambiguation information that would help translation engines more easily glean the meaning of the original text. Some advantages of such a system:
- No knowledge of other languages would be needed for an author to improve the translatability of their text.
- The tags, once entered in the source language, could ease the automatic translation into any and all target languages.
- It would not require actually changing or rewriting the source text – just tagging with additional information.
- Any translation engine that understands the standard could take advantage of the additional information. Human translators could benefit from the additional information as well.
Specifically, provide a syntax (probably for xml and html, such as spans with appropriate “data” attributes) for tagging groups of one or more words within a text with disambiguation information. This could be of the form of: a code for “proper noun, don’t translate”, a reference to a specific meaning (“noun, sense 2” – though probably in the form of a unique identifier for that particular entry) within a specified online dictionary, a reference to an idiom dictionary to define a phrase, a reference to another word within the sentence (“he” refers to “Chuck”), etc.
The translation engine (any translation engine) could then take advantage of this embedded metadata to better understand the source text’s meaning, and thus translate more accurately into other languages.
The implementation on the authoring end might be a text editor with a “translation helper” plug-in. The author could select text (one or more words), and use the translation helper to add a disambiguation, such as selecting “proper noun”, selecting the appropriate dictionary meaning (which the helper would look up automatically based on the selected word), search for an appropriate idiom, enter a replacement word or synonym, etc.
This could be supplemented by a “round-trip” translation tool, which translates the text into a selected target language, then back into the original language. Authors could then concentrate on areas that produce the most confused output – that is, we don’t want them to have to laboriously tag everything on the page, just the problem areas. Similarly, they could start with one language, then check in other languages to see if additional disambiguation is needed.
As time goes on and translation engines like Google Translate get “smarter” at gleaning meaning from context, the need for such tagging might be reduced. But in the meantime, it could also help with machine learning, ie, the translation engine guesses, then compares its guess to the entered tag to see if its guess was correct.
Again, the key difference here compared to “assisted” translation systems is that the “operator” needs no knowledge of any but the source language. This isn’t about providing hints to help translate into any particular language, but rather hints as to the meaning of the source text.
On the other hand, I wouldn’t want this to get bogged down in more-general efforts to promote the “semantic web” for uses other than translation. Therein lies a burial in the bowels of an obscure W3C proposal or RFC.
There are several reasons I think Google should be the main drivers of such a standard:
- They are the biggest online translator (as far as I know), and hence would be the largest user of the resulting data.
- They are capable of providing and hosting the needed online “tagging” dictionaries for authors to reference.
- They are capable of driving ad-hoc standards like new HTML attributes (see ‘rel=”nofollow”‘, “canonical” meta tag).
In fact, since I am not interested in (or capable of) implementing machine translation engines or online dictionaries myself, I don’t see how this idea can go anywhere without Google. (Yes, I know about Yahoo’s Babelfish and Microsoft’s Bing Translator, and that there are others, but I think they are all bit players compared to Google, and also not the ones to drive a standard. And I’d rather see this implemented quickly than debated in committee for the next ten years.)
Google Translations team: does this sound at all interesting?