<?xml version="1.0" encoding="UTF-8"?><!-- generator="wordpress.com" -->
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	>

<channel>
	<title>machine-translation &amp;laquo; WordPress.com Tag Feed</title>
	<link>http://wordpress.com/tag/machine-translation/</link>
	<description>Feed of posts on WordPress.com tagged "machine-translation"</description>
	<pubDate>Sun, 07 Sep 2008 04:53:08 +0000</pubDate>

	<generator>http://wordpress.com/tags/</generator>
	<language>en</language>

<item>
<title><![CDATA[Does This Translate?]]></title>
<link>http://dkseto.wordpress.com/?p=1813</link>
<pubDate>Wed, 27 Aug 2008 15:54:05 +0000</pubDate>
<dc:creator>Dan Seto</dc:creator>
<guid>http://dkseto.wordpress.com/?p=1813</guid>
<description><![CDATA[I know of an office that is looking into providing real-time machine translation (MT) of its main we]]></description>
<content:encoded><![CDATA[<p>I know of an office that is looking into providing real-time machine translation (MT) of its main website. Setting aside the question of whether this is a GoodThing(r), I was curious as to the general state of MT.</p>
<p>The two free services I know of are <a href="http://translate.google.com/translate_t?sl=fr&#38;tl=en">Google Translate</a> and <A href="http://babelfish.yahoo.com/?fr=bf-home">Yahoo! Babel Fish</a> (there are probably others so if you know of one that does better than these, please leave a comment). I took a paragraph from an article on the <A href="http://fr.wikipedia.org/wiki/Fran%C3%A7ais">French language on Wikipedia</a> and copied it into each service. Below is the orginal paragraph, followed by the Babel Fish translation and then the Google (Depending on the character set used, the original paragraph may have strange characters displayed. I apologize if that occurs).
<p>I don't speak French so I don't know how close the translation is to the original, but it does appear things are moving along and perhaps MT is at a point where is useful, if not perfect.</p>
<h3>Original Wikipedia Paragraph</h3>
<p>Le français est une langue romane parlée principalement en France, dont elle est originaire (la « langue d'oïl »), ainsi qu'au Canada (principalement au Québec, dans le nord du Nouveau-Brunswick et dans l'est et le nord-est de l'Ontario, et aussi au Manitoba, en Saskatchewan, en Alberta, et en Colombie-Britannique), en Belgique (en Région wallonne et en Région de Bruxelles-Capitale) et en Suisse (en Romandie). Le français est parlé comme deuxième ou troisième langue dans d'autres régions du monde, au total dans 51 pays du monde ayant pour la plupart fait partie des anciens empires coloniaux français et belge.</p>
<h3>Yahoo! Babel Fish Translation</h3>
<p>French is a Romance language spoken mainly in France, in which it is originating (the “language d' oil”), thus qu' in Canada (mainly in Quebec, in the north of New Brunswick and l' is and the North-East of l' Ontario, and also in Manitoba, in Saskatchewan, in Alberta, and a Colombia-British), Belgium (in Walloon region and Area of Brussels-Capital) and in Switzerland (in Romandie). French is spoken like second or third language in d' other areas of the world, on the whole in 51 countries of the world for the majority having belonged to the old French and Belgian colonial empires.</p>
<h3>Google Translation</h3>
<p>The french is a Romance language spoken mainly in France, where it originates (the "langue d'oïl") and Canada (mainly in Quebec, northern NB and in the east and north - eastern Ontario and Manitoba, Saskatchewan, Alberta and BC), Belgium (Walloon Region and Brussels-Capital Region) and Switzerland (Romandie). The french is spoken as a second or third language in other regions of the world total in 51 countries around the world have for the most part former colonial empires french and Belgium.</p>
<p><strong><em>Aloha!</em></strong></p>
]]></content:encoded>
</item>
<item>
<title><![CDATA[the ultimate translation ... error ]]></title>
<link>http://papilio.wordpress.com/?p=113</link>
<pubDate>Sun, 24 Aug 2008 20:50:29 +0000</pubDate>
<dc:creator>Joana Job</dc:creator>
<guid>http://papilio.wordpress.com/?p=113</guid>
<description><![CDATA[So next time I hear &#8220;machine translation is fine with me&#8221; I&#8217;ll be sure to shove th]]></description>
<content:encoded><![CDATA[<p><strong>So next time I hear "machine translation is fine with me" I'll be sure to shove this up your ass!</strong></p>
<p><img class="aligncenter size-full wp-image-114" src="http://papilio.wordpress.com/files/2008/08/translateservererror.jpg" alt="" width="596" height="446" />Don't get me wrong! Machine Translation does serve a purpose, but it does require supervision as well!</p>
<p>Anyway, just so you know, this is a restaurant ... where the main dish is probably "translation errors with rice".</p>
]]></content:encoded>
</item>
<item>
<title><![CDATA[Vint Cerf on the Future of the Internet]]></title>
<link>http://wave4.wordpress.com/?p=449</link>
<pubDate>Sun, 17 Aug 2008 20:24:14 +0000</pubDate>
<dc:creator>Mark P. Line</dc:creator>
<guid>http://wave4.wordpress.com/?p=449</guid>
<description><![CDATA[Vint Cerf on the Future of the Internet

I happen to be a computational linguist, among other things]]></description>
<content:encoded><![CDATA[<p><a href="http://www.guardian.co.uk/commentisfree/2008/aug/17/internet.google" target="_blank">Vint Cerf on the Future of the Internet</a></p>
<ul>
<li>I happen to be a computational linguist, among other things, and I disagree with Cerf's assessment of the present and near-future success of automatic translation -- you can't translate correctly if you can't think. Computers can't think, and we won't know how to make them think for quite some time.</li>
<li>His assessment of the impact of mobile computing is spot-on, though, and it's a point that's often not on the radar in western industrialized countries with hot and cold running PC's.</li>
</ul>
]]></content:encoded>
</item>
<item>
<title><![CDATA[Towards Machine Translation Friendly Sites ]]></title>
<link>http://elbesy.wordpress.com/?p=3</link>
<pubDate>Thu, 14 Aug 2008 09:41:50 +0000</pubDate>
<dc:creator>elbesy</dc:creator>
<guid>http://elbesy.wordpress.com/?p=3</guid>
<description><![CDATA[


 




Read this first: Writing for the web: Who is reading your text? 
 Unless we spread the awa]]></description>
<content:encoded><![CDATA[<table class="contentpaneopen" border="0">
<tbody>
<tr>
<td class="createdate" valign="top"> </td>
</tr>
<tr>
<td valign="top">
<div id="ajtrans109600" class="transdoc">
<p><a href="http://elbes.com/machine-translation/writing-for-the-web">Read this first: Writing for the web: Who is reading your text? </a></p>
<p><a href="http://www.elbes.com/machine-translation/"><img class="alignright" style="margin:3px;" src="http://elbes.com/images/elbes/mt_thumb.jpg" alt="Machine Translation Ready" /></a> Unless we spread the awareness of the existence of Machine Translation (MT), no progress can be made in this domain. Collaboration between writers and MT is vital for the latter to succeed. From here,  let's encourage web writers and authors to tailor their texts according to the basic needs of MT.  Once the text is polished and prepared for MT, authors can then tell their readers that their pages are "<strong>Machine Translation Friendly</strong>", or "Machine Translation Ready".<span> </span>For this purpose, adding a small etiquette will distinguish your friendly site from other unfriendly sites and your particular content from the rest of content. It would be very useful to place on your front page if all your content is "<strong>MT-Friendly</strong>" or on the pages that you think they meet the following minimum requirements:</p>
<p><strong>1. Short paragraphs</strong><br />
If Machine Translation is using Google AJAX Language API, then your paragraph should be less than 500 characters (including spaces); the rest will be dropped!</p>
<p><strong>2. Short phrases</strong><br />
Avoid long phrases as much as you can.</p>
<p><strong>3. Clear phrases</strong><br />
Avoid ambiguous usage of linguistic components (subject, verb, object, etc.). Unclear phrases will produce erroneous translations. Use your knowledge of a second language to imagine translation scenarios; how this would translate into that language.</p>
<p><strong>4. Correct spelling, no typos</strong><br />
Wrongly spelt words will not be recognized by MT systems. These are the "white socks" of your elegant site! Little extra effort will give your site the suitable socks.</p>
<p><strong>5. No slang or invented words</strong><br />
Words which do not exist in known dictionaries will be ignored by MT systems. No MT Data Base will contain all your vocabulary.</p>
<p><strong>6. No highly elaborated terminology</strong><br />
Try to simplify the usage of special terminology as much as you can; use common synonyms when possible.</p>
<p><strong>7. No acronyms or abbreviations at all</strong><br />
Acronyms and abbreviations can be translated into anything; they should be avoided completely.</p>
<p><strong>8. Don't mix languages</strong><br />
Mixing languages in your text is confusing for humans; for the machine it can be a nightmare. If you have to include text from other languages, declare the language of the text  in your HTML code (i.e. &#60;span lang="fr"&#62;Bonjour&#60;/span&#62;).</p>
<p><strong>9. Image position</strong><br />
If your images have "titles", try to place them at the beginning of your text, or at the end; otherwise MT system will think that this is the title of your article.</p>
<p><strong>10. Clean pages</strong><br />
Clean HTML code will make the life of MT much easier; messy code can cause your translated text to break.</p>
<hr size="1" />You can use this code on your site, if the whole site is Machine Translation Friendly:</div>
<p><a href="http://www.elbes.com/machine-translation/"><img style="margin:0;" src="http://www.elbes.com/images/elbes/mt_thumb.jpg" alt="Machine Translation Ready" width="176" height="93" /></a></p>
<div class="code">&#60;a href="http://www.elbes.com/machine-translation/"&#62;&#60;img style="float: right; margin: 3px;" title="Machine Translation Friendly" src="http://www.elbes.com/images/elbes/mt_thumb.jpg" alt="Machine Translation Ready" width="176" height="93" /&#62;&#60;/a&#62;</div>
<p>You can use this code on your pages or articles that you think they are Machine Translation Ready:</p>
<p><a href="http://www.elbes.com/machine-translation/"><img style="float:left;margin:0;" src="http://www.elbes.com/images/elbes/mtf.gif" alt="Machine Translation Ready" width="80" height="15" /></a></p>
<div class="code">&#60;a href="http://www.elbes.com/machine-translation/"&#62;&#60;img style="margin: 0px; float: left;" title="Machine Translation Friendly" src="http://www.elbes.com/images/elbes/mtf.gif" alt="Machine Translation Ready" width="80" height="15" /&#62;&#60;/a&#62;</div>
<p>Here you have other variations if you prefer, just rename the image in the code accordingly:</p>
<p><img style="float:left;margin:0;" src="http://www.elbes.com/images/elbes/mt.gif" alt="Machine Translation Ready" width="80" height="15" /><img style="float:left;margin:0;" src="http://www.elbes.com/images/elbes/mt_left.gif" alt="Machine Translation Ready" width="80" height="15" /><img style="margin:0;" src="http://www.elbes.com/images/elbes/mt2.gif" alt="Machine Translation Ready" width="80" height="15" /></p>
<p>Source: <a title="ELBES Multilingual Communication" href="http://elbes.com">ELBES Multilingual Communication - Machine Translation</a> (<a title="ELBES Multilingual Communication" href="http://multilingualism.org">Multilingualism.org</a>)</td>
</tr>
</tbody>
</table>
]]></content:encoded>
</item>
<item>
<title><![CDATA[Microsoft Releases Office Translator for Office 2003 and 2007]]></title>
<link>http://lostincode.wordpress.com/?p=106</link>
<pubDate>Mon, 11 Aug 2008 17:14:53 +0000</pubDate>
<dc:creator>ChrisCicc</dc:creator>
<guid>http://lostincode.wordpress.com/?p=106</guid>
<description><![CDATA[The Microsoft Research Machine Translation (MSR-MT) Team today annouced they have released the new O]]></description>
<content:encoded><![CDATA[<p>The Microsoft Research Machine Translation (MSR-MT) Team today annouced they have released the new Office Translator plugin for Office 2003 and 2007. <img class="alignright" src="http://blogs.msdn.com/blogfiles/translation/WindowsLiveWriter/NewfeaturesforJuly_97C6/image_thumb_1.png" alt="" width="460" height="211" /></p>
<p>I have yet to test out the tranlation capabilities, but when I do I'll post here!</p>
<p>In the meantime go read the instructions on how to install it manually now, or wait for the Windows Update!</p>
<p><a href="http://blogs.msdn.com/translation/archive/2008/08/06/office-document-translation.aspx">Read</a></p>
]]></content:encoded>
</item>
<item>
<title><![CDATA[Google Translation Center: The World's Largest Translation Memory]]></title>
<link>http://gigaom.com/?p=16515</link>
<pubDate>Tue, 05 Aug 2008 00:50:51 +0000</pubDate>
<dc:creator>Guest Column</dc:creator>
<guid>http://gigaom.com/?p=16515</guid>
<description><![CDATA[Disclosure: I am the founder of Der Mundo, a multilingual blogging service and translation    commun]]></description>
<content:encoded><![CDATA[<p><em>Disclosure: I am the founder of <a href="http://dermundo.com/" target="_blank">Der Mundo</a>, a multilingual blogging service and translation    community that combines human and machine translation (provided in part by    Google), and I have researched translation technology for more than 10 years    via the <a href="http://www.worldwidelexicon.org/" target="_blank">Worldwide    Lexicon</a> project. </em></p>
<p><a href="http://blogoscoped.com/archive/2008-08-04-n48.html" target="_blank">Blogoscoped reports</a> that Google is preparing to launch    Google Translation Center, a new translation tool for freelance and    professional translators. This is an interesting move, and it has broad    implications for the translation industry, which up until now has been    fragmented and somewhat behind the times, from a technology standpoint</p>
<p>Google has been investing significant resources in a multi-year effort to    develop its statistical machine translation technology. Statistical MT works    by comparing large numbers of parallel texts that have been translated between    languages and from these learns which words and phrases usually map to others    — similar to the way humans acquire language. The problem with statistical MT    is that it requires a large number of directly translated sentences. These are    hard to find, and because of this SMT systems use sources like the proceedings    from the European Parliament, United Nations, etc. Which are fine if you're    writing in bureaucrat-speak, but aren't so great for other texts. Google    Translation Center is a straightforward and very clever way to gather a large    corpus of parallel texts to train its machine translation systems.</p>
<p><!--more--></p>
<p>Part machine translator and part translation memory (a sort of search    engine for translation that helps translators to recall translations), GTC    will help translators by providing a free, global translation memory, and in    turn drive costs down by reducing the amount of work needed to complete a    text. It will help Google by providing an excellent source of high quality    parallel texts that can be fed back into the statistical translation    systems.</p>
<p>If Google releases an API for the translation management system, it could    establish a de facto standard for integrated machine translation and    translation memory, creating a language platform around which projects like    Der Mundo can build specialized applications and collect more training    data.</p>
<p>On the other hand, GTC could be bad news for translation service bureaus —    especially those that use proprietary translation management systems as a way    to hold customers and translators hostage. Most translation bureaus aren't    really technology companies and aren't very competent at building quality    software. Google Translation Center fills a void in the translation tools    market that was created when the few independent companies, such as Trados, were acquired.</p>
<p>For freelancers, GTC could be very good news; they could work directly with    clients and have access to high quality productivity tools. Overall this is a    welcome move that will force service providers to focus on quality, while    Google, which is competent at software, can focus on building tools. Google    has a pretty mixed track record with consumer-facing services outside its core    search business. But if it positions itself as a neutral service provider, it    could enable projects like Der Mundo and others to create powerful and    easy-to-use translation services for a broad range of industries.</p>
<p>Translation management is more complex than it appears, with different    practices in different industries. If you're translating a news story, you    want minimal cost and fast turnaround time (publish early, correct often). If    you're translating a product spec sheet, you're willing to spend more to have    it done right before it goes to press. Google would be smart to position GTC    as a utility for translators and to encourage service bureaus to standardize    around it, much as it did around earlier tools like Trados, and    much as it has done with their keyword ad business.    That strategy would also eliminate a potential conflict of interest, as    translation professionals are understandably wary of contributing to something    that could put them out of work, as well as avoid channel    conflicts with partners who will be their best advocates in selling to    various clients.</p>
<p>While it's my guess that Google has no intention of directly monetizing the    service (charging a commission on transactions it brokers would expose Google    to a billing and payment disbursal nightmare), the R&#38;D value of collecting    millions of parallel sentences in every language pair imaginable is    indisputable, and it will pay off in unforeseen ways. So, my guess is Google    will make this a free tool for the translation industry to use, and it will    figure the money part out later. It can afford to be patient.</p>
<p>Translation is a very difficult problem. If it weren't, it would have been    solved a long time ago. I <a href="http://www.oreillynet.com/etel/blog/2007/09/the_end_of_the_language_barrie.html" target="_blank">remain convinced</a> that a multilingual web will be a reality    in a short time, and that a menagerie of tools and services will emerge over    the next few years — some geared toward helping translators, some toward    building translation communities, and others that make publishing multilingual    sites and blogs easy and intuitive.</p>
<p>As these emerge, the web will begin translating itself, and <a href="http://www.oreillynet.com/etel/blog/2007/09/the_end_of_the_language_barrie.html">within a short    time</a>,    we'll be able to read content from sources worldwide just as we currently    explore the web in our own language  today.</p>
]]></content:encoded>
</item>
<item>
<title><![CDATA[Tradd.us - Contextual translator]]></title>
<link>http://leagueoftranslators.wordpress.com/?p=18</link>
<pubDate>Wed, 23 Jul 2008 22:38:44 +0000</pubDate>
<dc:creator>Paula Góes</dc:creator>
<guid>http://leagueoftranslators.wordpress.com/?p=18</guid>
<description><![CDATA[&#8220;Tradd.us is an innovative and creative approach to text translation. Powered by Google transl]]></description>
<content:encoded><![CDATA[<blockquote><p>"Tradd.us is an innovative and creative approach to text translation. Powered by Google translator and a range of other services we are able to translate, analyze, identify and contextualize important information from your content, in a very quick and organized way."</p></blockquote>
<p>To check it out, first you need to sign up <a href="http://tradd.us/">here</a>. Soon after your will receive an invitation by e-mail.</p></blockquote>
<p><a href="http://leagueoftranslators.files.wordpress.com/2008/07/tradd.jpg"><img class="aligncenter size-medium wp-image-19" src="http://leagueoftranslators.wordpress.com/files/2008/07/tradd.jpg?w=300" alt="" width="300" height="181" /></a></p>
<p>The tool uses semantic analysis to select interesting pieces of the translation and divide them into tabs, under which you get extra information, such as synonyms. It also helps you to find relevant information automatically by showing the best ranked sites for those therms selected, although it also shows paid links, apparently.</p>
<p>It seems to be an interesting tool, unfortunately, it didn't prove to be much useful for me at the moment, as the only available choice is translations from English into another language, and I would be more interested, work-wise, on the other way around. But let's experiment, translating the quote above into Portuguese:</p>
<blockquote><p><span>Tradd.us é uma abordagem inovadora e criativa ao texto da tradução. Powered by Google tradutora e uma série de outros serviços que são capazes de traduzir, analisar, identificar e contextualizar as informações importantes a partir do seu conteúdo, de uma maneira muito rápida e organizada</span></p></blockquote>
<p>Not too bad at all - only one mistake in the first sentence, an untranslated "<span>Powered by" in the second followed by a couple of little syntax mistakes. If I were translating it, it would be:</span></p>
<blockquote><p>Tradd.us é uma abordagem inovadora e criativa <em>para tradução de textos</em>. <em>Desenvolvida pelo</em> Google Translator e uma série de outros serviços, <em>somos</em> capazes de traduzir, analisar, identificar e contextualizar informações importantes <em>encontradas no</em> seu conteúdo, de uma maneira <em>bem</em> rápida e organizada</p></blockquote>
]]></content:encoded>
</item>
<item>
<title><![CDATA[In praise of machine translation]]></title>
<link>http://uffishthought.wordpress.com/?p=120</link>
<pubDate>Mon, 14 Jul 2008 21:03:46 +0000</pubDate>
<dc:creator>rip</dc:creator>
<guid>http://uffishthought.wordpress.com/?p=120</guid>
<description><![CDATA[The “Language Log” displays a wonderful photograph of a Chinese and English sign which attempts ]]></description>
<content:encoded><![CDATA[<p>The “Language Log” displays a wonderful photograph of a Chinese and English sign which attempts to show people the way to the dining-hall ...</p>
<p><a href="http://www.tulgeywood.de/?p=108"><em>[... continue ...]</em></a></p>
]]></content:encoded>
</item>
<item>
<title><![CDATA[real time translator]]></title>
<link>http://rozettasekihi.wordpress.com/?p=62</link>
<pubDate>Mon, 14 Jul 2008 09:16:53 +0000</pubDate>
<dc:creator>amanoh</dc:creator>
<guid>http://rozettasekihi.wordpress.com/?p=62</guid>
<description><![CDATA[so while i was walking home from class, i was suddenly hit with this idea that in the future speech ]]></description>
<content:encoded><![CDATA[<p>so while i was walking home from class, i was suddenly hit with this idea that in the future speech recognition and human interface software would be so advanced to the point that interpreting and translating will be left entirely up to computers and machines.</p>
<p>this idea crossed my mind when i was thinking back to how advanced tts has become in the past two years (especially those developed by microsoft's competitors) and when i was reminiscing back to the fact that it IS entirely possible to teach a computer how to speak a (human) language fluently, just impossible to give it its own tongue and a mind to communicate with other humans (and by communicating i mean the exchange of semantically and logically irrelevant language, like the ones only humans are capable of engaging in).</p>
<p>also the fact that a computer is supposeldy incapable of independent bias led me to believe that in the future translation and interpretation will all be outsourced to computers and machinery, unless the field of computational linguistics hits a huge brick wall and fails to progress from now until the end of time.</p>
<p>so in hypothesizing such an occurence of the future, i myself deduced the possible inner mechanism of a computer/program/machine capable of such a feat, and it scared me to think that such a machine could easily be built should the idea catch the attention of interested parties, or to think that there may already BE developers/inventors who could easily build such a mechanism... which led me to come up with a little blueprint of the machine of my own...</p>
<p>these are the key components of my "real time translator":</p>
<p>1. speech to text<br />
using speech-to-text technology, the machine will acoustically record and analyze the speech being spoken and transfer it into data, most likely in some form of text. <a href="http://www.brothersoft.com/downloads/speech-to-text.html">http://www.brothersoft.com/downloads/speech-to-text.html</a> is an example of speech to text technology being developed all around the world.</p>
<p>2. sentence breaker/pos tagger/word breaker<br />
after the speech is transformed into analyzable data, the spoken speech is analyzed by a sentence breaker which given its knowledge/background in the syntactic structure of the language being spoken, breaks down the speech cluster into sentences. after the speech is broken down into simple sentences, each word is separated and given a "part of speech" tag, depending on the word's placement within the sentence and the context of the sentence. The NLP project has demos for POS taggers and it is a widely known fact that Nuance and Microsoft have both been working on sentence breakers/word breakers for a long time now.</p>
<p>3. lexicalization<br />
after each word has been broken down and tagged with a part-of-speech, the word is then referenced to the main language lexicon, which is basically a huge dictionary that stores information regarding how each word is pronunced, its frequency in usage within the language, how the word is used in different parts of speech if such information is applicable and so on. after such information is acquired from the main language lexicon, it needs to be then cross-referenced to a lexicon containing the same information in the target language so that "translation" can take place.</p>
<p>4. pos tagger/syntax builder<br />
now that the "translated" data is available in the target language, another pos tagger needs to be applied in order to correctly label the new data, which will then be fed through a syntax builder in order for it to be correctly and accurately formed into a logical sentence in the target language.</p>
<p>5. text to speech<br />
once the sentence is completely translated into the target language and is found to be syntactically and semantically accurate, the sentence then needs to be fed through a text to speech engine which will then relay the speech back to the targeted audience. text to speech can be found everywhere in the modern computer age, anywhere from global navigation systems, registry id calls, and even in windows pc's which comes standard with a mediocre version of it in every copy (if you're bored, go to accesories &#62; accessibility options &#62; text-to-speech)</p>
<p>the understood difficulties of this project are numerous and tantamount in scale: the lexicon will have to be updated on a regular basis to account for new words, terms, and definitions; machine translation would mean that translations will often lack variety and be monotonous in nature; the problem of how to set the machine to deal with terms and data that may not be within the lexicon (i.e. names of people, location, new things that may seem obscure); the irregularity of language that will most definitely throw the machine off course; and also the huge amount of processing power required would make instant translation/interpretation very hard or almost impossible.</p>
<p>but as mentioned before, the benefits of such a machine would be endless as it would bridge countless gaps and holes that are duly formed because of language barriers, although it could effecitively mean that what was once a proud oral tradition of human kind will now be lost and permanantly outsourced to hearltess machines.</p>
<p>oh, and i'd be out of a job too, but that's beside the point...</p>
]]></content:encoded>
</item>
<item>
<title><![CDATA[English to Bengali Machine Translation System Anubadok is now 0.2]]></title>
<link>http://methopath.wordpress.com/?p=14</link>
<pubDate>Wed, 09 Jul 2008 01:27:55 +0000</pubDate>
<dc:creator>Golam Mortuza Hossain</dc:creator>
<guid>http://methopath.wordpress.com/?p=14</guid>
<description><![CDATA[After a gap of almost two years, I am happy to announce the second official release (version 0.2.0) ]]></description>
<content:encoded><![CDATA[<p>After a gap of almost two years, I am happy to announce the second official release (version 0.2.0) of <a href="http://anubadok.sourceforge.net/">Anubadok</a> a free (as in freedom) <a href="http://en.wikipedia.org/wiki/Machine_translation">machine translation</a> system for English to Bengali. Anubadok is written in <a href="http://www.perl.com/">Perl</a> and it uses <a href="http://www.cis.upenn.edu/~treebank/">Penn Treebank</a> annotation system for natural language processing. To run Anubadok 0.2.0, you need to have Part-of-Speech tagger <a href="http://gposttl.sourceforge.net/">GPoSTTL</a>  installed in your system. The Anubadok system can be accessed online using the interface  <a href="http://bengalinux.sourceforge.net/cgi-bin/anubadok/index.pl">Anubadok Online</a> run by <a href="http://www.bengalinux.org/">Ankur</a>.</p>
<p>First official release (ver. 0.1) of Anubadok was an experimental release which mainly served as a proof-of-concept for an open-source English to Bengali  <a href="http://en.wikipedia.org/wiki/Machine_translation">machine translation</a> system.</p>
<p>With the release of version 0.2.0, I am glad to upgrade its official tag from "an experimental software" to "a software under development" with clear-and-specific implementation targets. However given the nature of the project, there are no specific time-frames for future releases. Further, given machine translation is considered an open research topic in <a href="http://en.wikipedia.org/wiki/Computational_linguistics"> Computational Linguistic</a>, you should expect to see some surprises ;) even for well implemented situations. Specially, if you are comparing results of machine translations with human translations.</p>
<p>In English, there are four types of sentences: <strong>Declarative</strong>, <strong>Imperative</strong>, <strong>Interrogative</strong> and <strong>Exclamatory</strong>. These sentence types further fall into four basic sentence type:  <a href="http://en.wikipedia.org/wiki/Simple_sentence">Simple</a>,  <a href="http://en.wikipedia.org/wiki/Compound_sentence_(linguistics)">Compound</a>, <a href="http://en.wikipedia.org/wiki/Complex_sentence">Complex</a> and <a href="http://en.wikipedia.org/wiki/Complex-compound_sentence">Compound-Complex</a>. </p>
<p>The table below gives approximate status of implementation for each sentence type in the current release and inversely it gives the targets for future implementations.</p>
<table cellpadding="10" border="2" cellspacing="10">
<caption>
<strong>Status Table (Version: Anubadok-0.2.0 ) </strong><br />
</caption>
<tbody>
<tr>
<td></td>
<td>Declar. </td>
<td>Imper.</td>
<td>Interro.</td>
<td>Exclam.</td>
</tr>
<tr>
<td>Simple</td>
<td style="color:green;">W</td>
<td style="color:green;">W</td>
<td style="color:green;">W</td>
<td style="color:blue;">M</td>
</tr>
<tr>
<td>Compound</td>
<td style="color:blue;">M</td>
<td style="color:blue;">M</td>
<td style="color:blue;">M</td>
<td style="color:blue;">M</td>
</tr>
<tr>
<td>Complex</td>
<td style="color:red;">N  </td>
<td style="color:red;">N</td>
<td style="color:red;">N</td>
<td style="color:red;">N</td>
</tr>
<tr>
<td>Compound - Complex</td>
<td style="color:red;">N</td>
<td style="color:red;">N</td>
<td style="color:red;">N</td>
<td style="color:red;">N</td>
</tr>
</tbody>
</table>
<p><span style="color:green;">W</span>: Well implemented<br />
<span style="color:blue;">M</span>: Moderately implemented<br />
<span style="color:red;">N</span>: Not/Not-well implemented</p>
<p>Anubadok does not yet have any code to handle Complex or Compound-Complex sentences, not even moderately. This is where next push for development is needed.</p>
<p><strong>Few other salient features of this release:</strong></p>
<ul>
<li> The execution method of Anubadok system has been re-written. Anubadok itself has been implemented as Perl module. This means one can now access Anubadok in a Perl program directly by including Anubadok libraries (Perl modules) or in any other program by using appropriate Perl module wrapper.
<li> The notion of "testsuites" has been introduced for Anubadok.    For a given English sentence, it compares a machine translated     sentence with the expected Bengali sentence. This is quite an    useful tool while adding new features or doing some experimentations as it would ensure that already implemented    algorithm are not affected.
<li> Anubadok system can now handle several kinds of input    documents including plain text files, any XML documents,    HTML files with in-line javascript, CSS. Further, as  earlier, it is capable of translating Portable Object (PO)  files directly.
<li> Anubadok packaging has been completely reorganized to ensure    that it has the basic structure of a standard Perl package.    Consequently, Anubadok can be installed following the    method of standard Perl module installation.
<li>  Anubadok-0.2.0 comes with an updated dictionary having 15K+ entries in its database. This is almost double the number of entries it had in 0.1 release. Credit for this  goes to all the contributors of Ankur <a href="http://www.bengalinux.org/english-to-bengali-dictionary/">English to Bengali dictionary</a> project. Anubadok's dictionary  are  now updated regularly using <a href="http://www.bengalinux.org/english-to-bengali-dictionary/dumps/">database dumps</a> of Ankur E2B  dictionary.
<li> Anubadok has now moved to its new website hosted by    SourceForge.
<p> <a href="http://anubadok.sourceforge.net">http://anubadok.sourceforge.net</a></p>
<p>Latest source codes of Anubadok can be downloaded from the "trunk" branch of its  <a href="http://anubadok.svn.sourceforge.net/viewvc/anubadok/">SVN repository</a>.</p>
<li> <a href="http://bengalinux.sourceforge.net/cgi-bin/anubadok/index.pl"> Anubadok Online</a>, the online interface to Anubadok system,  has been upgraded substantially. It runs directly using SVN version of Anubadok engine. User  contributed new entries though this interface are submitted automatically to Ankur E2B dictionary project.
<li> A brief document is now available for download as a PDF file from <a href="http://anubadok.sourceforge.net/">its website</a>. It describes the internal working and the algorithm used by Anubadok system by considering specific example sentence.
</ul>
]]></content:encoded>
</item>
<item>
<title><![CDATA[Semantic Web Application]]></title>
<link>http://moeen.wordpress.com/?p=65</link>
<pubDate>Thu, 29 May 2008 20:21:02 +0000</pubDate>
<dc:creator>moeen</dc:creator>
<guid>http://moeen.wordpress.com/?p=65</guid>
<description><![CDATA[Issues regarding semantic web application.
1. The size of input text.
2. The text must be well forme]]></description>
<content:encoded><![CDATA[<p>Issues regarding semantic web application.</p>
<p>1. The size of input text.</p>
<p>2. The text must be well formed.</p>
<p>3. If the lexicon has to built, that could be treated as atomic or complex.</p>
<p>4. The treatment to the text can be divided  into semantics, lexicon atoms, grammar and context.</p>
<p>5. The loss of information during translation.</p>
<p>6. A flat layer between two natural language... a language</p>
]]></content:encoded>
</item>
<item>
<title><![CDATA[EPO MT in MIP]]></title>
<link>http://patenttranslations.wordpress.com/?p=29</link>
<pubDate>Wed, 28 May 2008 22:37:11 +0000</pubDate>
<dc:creator>patenttranslations</dc:creator>
<guid>http://patenttranslations.wordpress.com/?p=29</guid>
<description><![CDATA[Sorry for the acronym string. I couldn&#8217;t resist.
MIP (Managing Intellectual Property) is a tra]]></description>
<content:encoded><![CDATA[<p>Sorry for the acronym string. I couldn't resist.</p>
<p>MIP (<a href="http://www.managingip.com/">Managing Intellectual Property</a>) is a trade journal out of London that does a good job of providing global coverage. They have a free newsletter called <a href="https://www.managingip.com/Register.aspx?MIPWeek=true">MIP Week</a>, which is perfect for people like me, who want to stay abreast but are too cheap and lazy to read the whole magazine.</p>
<p>Yesterday they had <a href="http://www.managingip.com/Article.aspx?ArticleID=1937564&#38;LS=EMS182433">more</a> on a <a href="http://www.managingip.com/Article.aspx?ArticleID=1918807">story that they originally reported In April</a> -- specifically, a translation breakthrough for Community Patents. The idea is to start granting European patents, without demanding that applicants provide translations into all 23 official languages. People wanting to read the patent could then use a proposed MT (machine translation) system. Interestingly, MIP reports that the translations would have no legal value, and that for the 1% of patents that are litigated (man, that's a lot of litigated patents) human translations would prevail.</p>
<p>I can certainly imagine some juicy courtroom arguments when they start litigating patents without predetermined translations. It also makes me wonder about enforcement. If I spent money developing something that my research (reading MTed patents at the EPO) told me is not protected, and then I got sued for infringement because it turns out the MT translation was inaccurate, I would certainly feel mistreated. If my due diligence is expected to extend to procuring an accurate translation by myself, then the EPO is transferring translation costs to industry. If, on the other hand, due diligence doesn't extend that far, why should I be penalized?</p>
<p>In any case, it should be fun. It makes me wish I were a European legal translator.</p>
]]></content:encoded>
</item>
<item>
<title><![CDATA[Top 5 Natural Language Processing Applications]]></title>
<link>http://yooname.wordpress.com/?p=49</link>
<pubDate>Tue, 13 May 2008 12:31:23 +0000</pubDate>
<dc:creator>yooname</dc:creator>
<guid>http://yooname.wordpress.com/?p=49</guid>
<description><![CDATA[In the last decades, Natural Language Processing (NLP) has been equally hyped and criticized. All in]]></description>
<content:encoded><![CDATA[<p>In the last decades, Natural Language Processing (NLP) has been equally hyped and criticized. All in all, many applications emerged in the real world following intense and continued research and development. Here's a list of the most prominent success stories.</p>
<p>Given that this blog is about named entity recognition (NER), itself an NLP application, we would be biased at including NER to this list. As such, we've excluded ourselves from the chart-toppers ;)</p>
<p><strong>#5: Chat bots</strong></p>
<pre>"HELLO, MY NAME IS DOCTOR SBAITSO.</pre>
<pre>I AM HERE TO HELP YOU.
SAY WHATEVER IS ON YOUR MIND FREELY,
OUR CONVERSATION WILL BE KEPT IN THE STRICTEST CONFIDENCE.
MEMORY CONTENTS WILL BE WIPED CLEAN AFTER YOU LEAVE,</pre>
<pre>SO, TELL ME ABOUT YOUR PROBLEMS."</pre>
<p>The first time I chatted with <a href="http://en.wikipedia.org/wiki/Dr._Sbaitso">Dr. Sbaitso</a>, I was about 12 years old. Probably more than anything else, it has influenced my career path. Since then, chat bots such as <a href="http://en.wikipedia.org/wiki/ELIZA">ELIZA</a>, <a href="http://en.wikipedia.org/wiki/Artificial_Linguistic_Internet_Computer_Entity">A.L.I.C.E.</a> and <a href="http://en.wikipedia.org/wiki/Jabberwacky">Jabberwacky</a> propelled the art of conversational robots, leading to <a href="http://www.microsoft.com/serviceproviders/solutions/asa.mspx">Automated Service Agent</a> applications (see <a href="http://www.nextit.com/">NextIT</a>)</p>
<p>For its lasting impact on generations of NLP developers, and for the interesting improvements that ensued, Chat bots rank #5.</p>
<p><strong>#4: NLP-based search engines</strong></p>
<p><a href="http://www.ask.com/">Ask Jeeves</a> pioneered it, <a href="http://www.powerset.com/">Powerset</a> redefined it, but we are all somewhat skeptical when it comes to beating Google's classic vector space models and ranking techniques., Do we really need shallow NLP parsing to answer "When did Einstein die," or will <a href="http://scholar.google.com/scholar?hl=en&#38;q=the+One-Million+Fact+Extraction+Challenge">statistical fact extraction</a> suffice?</p>
<p>Though it is the Holy Grail of NLPers, it has not yet surpassed current information retrieval techniques. As such, NLP-based search engines rank #4.</p>
<p><strong>#3: Speech recognition</strong></p>
<p>Microsoft and Ford just teamed up to develop in-car speech recognition. But they forgot to include <a href="http://en.wikipedia.org/wiki/Electronic_Voice_Alert">Electronic Voice Alert</a>, a feature of mid-80s luxury Chrysler cars!</p>
<p>In all seriousness, automatic speech recognition (ASR) is a vital application for hand-free computing (for disabled persons or for certain circumstances, such as driving), and <a href="http://www.nuance.com/naturallyspeaking/">transcription</a>. It is also poised to revolutionize <a href="http://www.coveo.com/en/Products/CAVS.aspx">audio-video content</a> retrieval.</p>
<p>For where it came from, and for where it's going, ASR ranks #3.</p>
<p><strong>#2: Machine translation</strong></p>
<p><em>"It is apparent to me that the possibilities of the aeroplane, which two or three years ago were thought to hold the solution to the [flying machine] problem, have been exhausted, and that we must turn elsewhere."</em> - <a title="http://en.wikiquote.org/wiki/Incorrect_predictions#Airplanes" href="http://en.wikiquote.org/wiki/Incorrect_predictions#Airplanes">Thomas Edison, inventor, 1895</a></p>
<p>The "heavier-than-air" problem that once plagued flight technology is probably the <a href="http://apperceptual.wordpress.com/2008/01/07/artificial-intelligence-considered-as-heavier-than-air-flight/">best comparison we can make to AI</a> and machine translation (MT). It was long believed that MT would require a completely automatic understanding of human language before a resolution finally came. But today's <a href="http://www.google.com/translate_t">Google</a> and <a href="http://iit-iti.nrc-cnrc.gc.ca/projects-projets/portage_e.html">Government of Canada</a> systems surpass human translation abilities (can you translate from French to Chinese? Not me.) Their good level of precision makes them useful in many applications.</p>
<p>People are constantly <a href="http://www.news.com/8301-13577_3-9857280-36.html">pinpointing these systems' shortcomings,</a> but nobody would contest their second-place ranking on this list.</p>
<p><strong>#1: Knowledge discovery in texts</strong></p>
<p>Have you ever heard of software that finds new relationships and interactions between genes, proteins or cells? By mining large collections of scientific literature, NLP agents can discover and highlight novel and surprising knowledge.</p>
<p>What makes knowledge discovery so promising is the hope that, in the near future, we may monitor all these documents that are just too abundant to be processed manually. Early forms of knowledge discovery, such as <a href="http://en.wikipedia.org/wiki/Data_mining">data mining</a>, are already used for Business Intelligence (BI) and outside the NLP world, examples of <a href="http://www.popsci.com/scitech/article/2006-04/john-koza-has-built-invention-machine">machine-made inventions</a> already exist.</p>
<p>As a form of <a href="http://en.wikipedia.org/wiki/Technological_singularity">technological singularity</a>, and as an emerging field of research for NLP, knowledge discovery gets first place on this list of top NLP applications.</p>
]]></content:encoded>
</item>
<item>
<title><![CDATA[MT Eval with Binary Comparisons]]></title>
<link>http://ealdent.wordpress.com/?p=613</link>
<pubDate>Tue, 13 May 2008 04:53:39 +0000</pubDate>
<dc:creator>Jason Adams</dc:creator>
<guid>http://ealdent.wordpress.com/?p=613</guid>
<description><![CDATA[The standard way of doing human evaluations of machine translation (MT) quality for the past few yea]]></description>
<content:encoded><![CDATA[<p style="text-align:justify;">The standard way of doing human evaluations of machine translation (MT) quality for the past few years has been to have human judges grade each sentence of MT output against a reference translation on measures of adequacy and fluency.  Adequacy is the level at which the translation conveys the information contained in the original (source language) sentence.  Fluency is the level at which the translation conforms to the standards of the target language (in most cases, English).  The judges give each sentence a score for both in the range of 1-5, similar to a movie rating.   It became apparent early on that not even humans correlate well with each other.  One judge may be sparing with the number of 5's he gives out, while another may give them freely.  The same problem crops up in recommender systems, which I have <a href="http://mendicantbug.com/2007/12/14/netflix-prize-good-science-or-not/" target="_self">talked about in the past</a>.</p>
<p style="text-align:justify;">It matters how well judges can score MT output, because that is the evaluation standard by which automatic metrics for MT evaluation are judged.  The better an MT metric correlates with how human judges would rate sentences, the better.  This not only helps properly gauge the quality of one MT system over another, it drives improvements in MT systems.  If judges don't correlate well with each other, how can we expect automatic methods to correlate well with them?  The standard practice now is to normalize the judges' scores in order to help remove some of the bias in the way each judge uses the rating scale.</p>
<p style="text-align:justify;">Vilar et al. (2007) propose a new way of handling human assessments of MT quality:  binary system comparisons.  Instead of giving a rating on a scale of 1-5, they propose that judges compare the output from two MT systems and simply state which is better.  The definition of what constitutes "better" is left vague, but judges are instructed not to specifically look for adequacy or fluency.  By mixing up the sentences so that one judge is not judging the output of the same system (which could introduce additional bias), this method should simplify the task of evaluating MT quality while leading to better intercoder agreement.</p>
<p style="text-align:justify;">The results were favorable and the advantages of this method seem to outweigh the fact that it requires more comparisons than the previous method required ratings.  The total number of ratings for the previous method was two per sentence:  O(n), where <em>n</em> is the number of systems (the number of sentences is constant).  Binary system comparisons requires more ratings because the systems have to be ordered:  O(log n!).  In most MT comparison campaigns the difference is negligible, but it becomes increasingly pronounced as n increases.</p>
<p style="text-align:justify;">What would be interesting to me is a movie recommendation system that asks you a similar question:  which do you like better?  Of course, this means more work for you.  The standard approaches for collaborative filtering would have to change.  For example, doing singular value decomposition on a matrix of ratings would no longer be possible when all you have are comparisons between movies.  Also, people will still disagree with themselves (in theory).  You might say <em>National Treasure</em> was better than <em>Star Trek VI</em>, which was better than <em>Indiana Jones and the Last Crusade</em>, which was better than <em>National Treasure</em>.  You'd have to find some way to deal with cycles like this (ignoring it is one way).</p>
<h3>References</h3>
<p>Vilar, D., G. Leusch, H. Ney, and R. E. Banchs. 2007. Human Evaluation of Machine Translation Through Binary System Comparisons. In <em>Proceedings of the Second Workshop on Statistical Machine Translation</em>. 96-103. [<a href="http://acl.ldc.upenn.edu/W/W07/W07-0713.pdf" target="_blank">pdf</a>]</p>
]]></content:encoded>
</item>
<item>
<title><![CDATA[Seven Grand Tech Challenges]]></title>
<link>http://ealdent.wordpress.com/?p=597</link>
<pubDate>Thu, 17 Apr 2008 01:05:54 +0000</pubDate>
<dc:creator>Jason Adams</dc:creator>
<guid>http://ealdent.wordpress.com/?p=597</guid>
<description><![CDATA[According to Gartner, these will keep us busy for the next 25 years.

Eliminate need to recharge bat]]></description>
<content:encoded><![CDATA[<p style="text-align:justify;">According to Gartner, <a href="http://www.networkworld.com/news/2008/040908-gartner-it-challenges.html" target="_blank">these will keep us busy</a> for the next 25 years.</p>
<ol>
<li>Eliminate need to recharge batteries on wireless devices</li>
<li>Improved parallel processing (at the PL and OS levels)</li>
<li>Gesture detection</li>
<li>Speech-to-speech machine translation</li>
<li>Long term persistent storage</li>
<li>100-fold increase in programmer productivity</li>
<li>Identifying the financial consequences of IT investment</li>
</ol>
<p style="text-align:justify;">I think numbers 1-3, 5, and 6 are almost certainly doable (though they all lie outside of my expertise).  Number four will at least make very long strides towards being widespread and easy to use.  I seriously doubt it will be perfect (and by perfect I mean as good as a trained  translator).  Number 7 I have no idea about, but getting management to understand the exact benefits of IT has been elusive for the past twenty-five years.  I doubt IT managers even have that kind of understanding about it.  There are so many variables.  As humans are made to become more and more slaves to their corporate overlords (i mean protectors), perhaps prodcutivity will become more predictable.</p>
]]></content:encoded>
</item>
<item>
<title><![CDATA[Fukudome and the machine translation]]></title>
<link>http://jhockey.wordpress.com/?p=154</link>
<pubDate>Tue, 01 Apr 2008 18:48:48 +0000</pubDate>
<dc:creator>simoncurrie</dc:creator>
<guid>http://jhockey.wordpress.com/?p=154</guid>
<description><![CDATA[Cubs biggest acquisition of the off season, Kosuke Fukudome, had a near-perfect MLB debut opening da]]></description>
<content:encoded><![CDATA[<p>Cubs biggest acquisition of the off season, <a href="http://chicago.cubs.mlb.com/news/article.jsp?ymd=20080331&#38;content_id=2476242&#38;vkey=news_chc&#38;fext=.jsp&#38;c_id=chc" target="_blank">Kosuke Fukudome, had a near-perfect MLB debut opening day at Wrigley Field</a>, getting a double, a single, a walk, and the game tying homerun in the bottom of the 9th. But this being the Cubs, they lost game in extra innings.</p>
<p><a href="http://jhockey.wordpress.com/files/2008/04/guuzen3.jpg" title="guuzen3.jpg"><img src="http://jhockey.wordpress.com/files/2008/04/guuzen3.jpg" alt="guuzen3.jpg" /></a></p>
<p>In a funny case of mistranslation, it seems like someone had been handing out <a href="http://chicago.cubs.mlb.com/news/article.jsp?ymd=20080331&#38;content_id=2473014&#38;vkey=news_chc&#38;fext=.jsp&#38;c_id=chc" target="_blank">bilingual "IT'S GONNA HAPPEN" hand held signs </a>to Cubs fans as part of opening day promotions (this is the 100th season since their last World Series win for the Lovable Losers, and they have a pretty good team). But the actual Japanese printed on the signs "偶然だぞ" actually means "You Were Lucky".</p>
<p>So, it's almost ironic that Fukudome had a great night while the home fans were unintentionally telling their newest import that he was just lucky. Hahaha, machine translations. That's why us translators will still have work for the foreseeable future.</p>
<p>Google Tranlsate gives "偶然だぞ" for "It's gonna happen" and I guess that's what the person used without checking it first with any number of Japanese speakers in Chicago. It's on par with various <a href="http://engrish.com/recent.php" target="_blank">Engrish </a>found around Asia and <a href="http://www.hanzismatter.com/2004_10_01_archive.html" target="_blank">misused Chinese characters in the West</a>. It's at least kinda funny though.</p>
]]></content:encoded>
</item>
<item>
<title><![CDATA[WordNet 3.0 Vocabulary Helper]]></title>
<link>http://corpora.wordpress.com/?p=66</link>
<pubDate>Wed, 19 Mar 2008 02:20:43 +0000</pubDate>
<dc:creator>Warren</dc:creator>
<guid>http://corpora.wordpress.com/?p=66</guid>
<description><![CDATA[This seems like an interesting tool, WordNet 3.0 Vocabulary Helper. Wikipedia defines WordNet as som]]></description>
<content:encoded><![CDATA[<p>This seems like an interesting tool, <a href="http://poets.notredame.ac.jp/cgi-bin/wn">WordNet 3.0 Vocabulary Helper</a>. Wikipedia defines WordNet as something which "groups English words into sets of synonyms called synsets, provides short, general definitions, and records the various semantic relations between these synonym sets."</p>
<p>Created at Princeton University for research in Machine Translation. An offline version can be downloaded from the official <a href="http://wordnet.princeton.edu/obtain">Princeton University website</a>.</p>
]]></content:encoded>
</item>
<item>
<title><![CDATA[Is Systran going statistical?]]></title>
<link>http://ealdent.wordpress.com/?p=541</link>
<pubDate>Wed, 27 Feb 2008 21:30:13 +0000</pubDate>
<dc:creator>Jason Adams</dc:creator>
<guid>http://ealdent.wordpress.com/?p=541</guid>
<description><![CDATA[Systran is one of the oldest companies around that provide machine translation software.  They powe]]></description>
<content:encoded><![CDATA[<p align="justify">Systran is one of the oldest companies around that provide machine translation software.  They power some language-pairs of Microsoft's translation service, Altavista's Babelfish, and quite a few others (<a href="http://mendicantbug.com/2007/10/22/goodbye-systran/">including, until recently, Google</a>).  In the past, their software has been rule-based, so translation is done with a bilingual dictionary and a set of rules of how to change text from one language into another.  Based on a recent bevy of jobs postings on <a href="http://linguistlist.org/" target="_blank">Linguist List</a>, it appears they are going statistical.  Maybe they have been for a while, I don't know, since I don't actually follow what they do, but this piqued my interest.</p>
<p>If your interest is piqued too, the listings are for:</p>
<ol>
<li><a href="http://linguistlist.org/issues/19/19-656.html" target="_blank">Research Scientist in computational linguistics</a></li>
<li><a href="http://linguistlist.org/issues/19/19-655.html" target="_blank">Program manager</a></li>
<li><a href="http://linguistlist.org/issues/19/19-654.html" target="_blank">Software Engineer</a></li>
</ol>
<p>And, of course, salary ranges are not provided.</p>
]]></content:encoded>
</item>
<item>
<title><![CDATA[Syntactic Features for MT Eval]]></title>
<link>http://ealdent.wordpress.com/?p=529</link>
<pubDate>Sun, 24 Feb 2008 05:57:35 +0000</pubDate>
<dc:creator>Jason Adams</dc:creator>
<guid>http://ealdent.wordpress.com/?p=529</guid>
<description><![CDATA[Stepping back in time in MT Eval from my last post, Liu and Gildea (2005) were among the first to re]]></description>
<content:encoded><![CDATA[<p align="justify">Stepping back in time in MT Eval from my <a href="http://mendicantbug.com/2008/02/20/labeled-dependencies-in-mt-evaluation/">last post</a>, Liu and Gildea (2005) were among the first to really bring syntactic information to evaluating machine translation output.  They proposed three metrics for evaluating machine hypotheses:  the subtree metric (STM), the tree kernel metric (TKM), and the headword chain metric (HWCM).  STM and TKM also had variants for dependency trees, which HWCM relies on.  Owczarzak et al. (2007) extended HWCM from dependency parses to LFG parses.  HWCM has attracted more attention since it showed better correlation at the sentence level than either STM and TKM (both versions) and outperformed BLEU on longer n-grams.  It's interesting to note, though, that the dependency-based tree kernel metric performed best of all at the corpus level.  Sentence level granularity is typically more important for helping you tune your MT system.</p>
<p align="justify">The subtree metric is a fairly straightforward idea.  You begin by parsing both the hypothesis and the reference sentences using a parser like Charniak or Collins to get a Penn TreeBank style phrase structure tree.  You then compare all subtrees in the hypothesis to the reference trees, thresholding the number of matches by the best match in the reference trees.   The formula is given below:</p>
<div style="text-align:center;"><img src="http://ealdent.wordpress.com/files/2008/02/stm_formula.png" alt="subtree metric formula" /></div>
<p align="justify">The tree kernel metric uses convolution kernels discussed by Collins and Duffy (2001).  For the specifics of this method, I refer you to the respective papers (and I may post on it at a later date), but the general idea is that you can transform structured data (a tree) into a feature vector by using the <a href="http://en.wikipedia.org/wiki/Kernel_trick" target="_blank">kernel trick</a>.  Finding all subtrees of a tree can be exponential in the size of the sentence, which would make computation infeasible for large sentences.  The kernel trick lets us operate in this exponentially-high-dimensional space with a polynomial time algorithm.  Once we have constructed the feature vectors for the hypothesis and refernece trees, we can score them with their cosine similarity:</p>
<div style="text-align:center;"><img src="http://ealdent.wordpress.com/files/2008/02/tkm_formula.png" alt="tree kernel metric" /></div>
<p align="justify">H(T1) and H(T2) are vectors with non-zero values for subtrees (dimensions) that appear in each tree, so the dot product of the two is the number of subtrees in common.  The score is computed as the maximum cosine similarity between the hypothesis and the references.</p>
<p align="justify">Finally, the headword chain metric (HWCM) relies on dependency parses, which I touched on in my previous post.</p>
<div align="justify">
<blockquote><p>In dependency grammars, a tree is built by linking a word to its head. So a determiner would be linked to the noun it modifies, the direct object would be linked to the verb, etc. Each link of this sort is a headword chain of length 2. As you build up the tree, you can construct longer and longer headword chains.</p></blockquote>
</div>
<p align="justify">The HWCM score is calculated just like the STM except by comparing headword chains.  The difference between the HWCM and the dependency version of the STM is that STM considers all subtrees whereas HWCM only looks at direct mother-daughter relations (no cousins or sisters).</p>
<h3>References</h3>
<p align="justify">Michael Collins and Nigel Duffy. 2001. Convolution kernels for natural language. In <i>Advances in Neural Information Processing Systems</i>.</p>
<p align="justify">Ding Liu and Daniel Gildea.   2005.  Syntactic Features for Evaluation of Machine Translation. In <i>Proceedings of the Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization at the Association for Computational Linguistics Conference 2005,</i> Ann Arbor, Michigan.</p>
<p align="justify">Karolina Owczarzak, Josef van Genabith, and Andy Way.   2007.  Labelled Dependencies in Machine Translation Evaluation.  In <i>Proceedings of the Second Workshop on Statistical Machine Translation</i>, pages 104-111, Prague, June 2007.</p>
]]></content:encoded>
</item>

</channel>
</rss>
