Bilingual Terminology Mining - IIT Bombay

Guide : Prof. Pushpak Bhattacharyya
Bilingual Terminology Mining
By:Munish Minia (07D05016)PriyankSharma (07D05017)
IntroductionMultilingual Terminology Mining ChainTerm ExtractionTerm AlignmentDirect Context-Vector MethodTranslation of Lexical UnitsLinguistic ResourcesComparable CorporaBilingual DictionaryConclusion
Text mining research generally adopts “big is beautiful” approachJustified by the need of large amount of data in order to make use of statistic or stochastic methods[1]Hypothesis : The quality rather than the quantity of the corpus matters more in terminology miningWeb is used as a Comparable CorpusThe comparability of the corpus should not only be based on the domain or the sub-domain, but also on the type of discourse
1 : Manning andSchütze, 1999
Multi lingual Terminology Mining Chain
Architecture :
Source language documents
Target language documents
Terminology Extraction
Terminology Extraction
Lexical alignment process
Bilingual Dictionary
Terms to be translated
Translated terms
A comparable corpora is taken as inputOutput is a list of single- and multi-word candidate terms along with their candidate translationsProcesses involved :Term ExtractionTerm AlignmentDirect Context-Vector MethodTranslation of lexical units
Term Extraction
Terminological units extracted are MW terms whose syntactic patterns, expressed using POS tags, correspond either to a canonical or a variation structureFor French main patterns areNNN Prep N et NadjFor Japanese main patterns areNNNSuffAdjNPrefNVariants handled areMorphological for both French and JapaneseSyntactical for FrenchCompounding for Japanese
Variants handled in French
Morphological Variant : Morphological modification of one of the components of base formSyntactical Variant : the insertion of another word into the components of the base formCompounding Variant : the agglutination of another word to one of the components of base formExample :sécrétiond’insuline(insulin secretion)Base form : N Prep N patternMorphological variant :sécrétionsd’insuline(insulin secretions)Syntactic variant :sécrétionpancréatiqued’insuline(pancreatic insulin secretion)Syntactic variant :sécrétionsde peptide etd’insuline(insulin and peptide secretion)
Variants handled in Japanese
Example the MWT (insulin secretion) appears in following form:Base form : NNpattern :Compounding variant : agglutination of a word at the end of the base form : (insulin secretion ability)
Term Alignment
It aligns source MWT’s with target SWT’s or MWT’sDirect Context-Vector method:Collect all lexical units in the context of lexical unit ‘i’ in a window of size ‘n’ words around ‘i’For each lexical unit of source and target languageObtain a context-vector Vi, which gathers the set of co-occurrences units j associated with the number of times that j andioccur togetherNormalize context vectorMutual informationLog-likelihood
Term Alignment: Direct Context-Vector Method
Using Bilingual Dictionary, translate the lexical units of the source context-vectorFor a word to be translated, compute the similarity between the translated context-vector and all target vectors through vector distanceCandidate translations of a lexical units are the target lexical units closest to the translated context-vector acc. to the vector distance
Term Alignment: Translation
Translation of lexical unitsDepends on the coverage of bilingual dictionaryIf bilingual dictionary provides several translations for a lexical unit, consider all of them but weight the different translations by their frequency in the target languageFor a MW, possible translations are generated by using compositional methodIf it is not possible to translate all compositions of MW, MWT is not taken into account in the translation process
Composition methods for French and Japanese
For JapaneseFatiguechronique(chronic fatigue)for fatigue four translations are possible :two translations forchronique:We generate all combinations[2]of translated elements and select those which refer to an existing MWT in the target languageFor FrenchFor a multi-word of length n[3], produce all the combinations of MW Unit elements of length less than or equal to nSyndrome de fatiguechronique(chronic fatigue disease) yields the four possible combinations:[Syndrome de fatiguechronique][Syndrome de fatigue] [chronique][Syndrome] [fatiguechronique][Syndrome] [fatigue] [chronique]A direct translation of subpart of the MW is done if present in bilingual dictionary90% of the candidate terms provided by term extraction are composed of only two content words, so limiting to the combination 4th
2:Grefenstette, 19993:Robitailleet al.,2006
Are sets of texts in different languages, that are not translations of each otherShare some characteristics or features : topic, period, media, author, discourseOne of the clearest is ICE -- the International Corpus of English=1, ; (Greenbaum1991) Corpora of around one million words in each of many varieties of English aroundthe world
A bilingual dictionary or translation dictionary is a specialized dictionary used to translate words or phrases from one language to another.Bilingual dictionaries can beUnidirectional, meaning that they list the meanings of words of one language in anotherBidirectional, allowing translation to and from both languages.
More frequent a term and its translation, the better is the quality of alignmentThe discourse categorization of documents allows lexical acquisition to increase precisionIncluding discourse, results in candidate translations of better quality even if the corpus size is reduced by halfGives rise to data sparsity problemData sparsity problem can be partially solved by using comparable corpora of high quality
References terminology mining - using brain, not brawn comparable corpora [Emmanuel Morin, B´eatrice Daille, Koichi Takeuchi, andKyoKageura2007]Automatic Extraction of Bilingual Terms from Comparable Corpora in a Popular Science Domain X.Saralegi, I. San Vicente, A.Gurrutxaga
Thank You





