WP Leader: University of Edinburgh (UEDIN)
The aim of this work package is to develop machine translation systems optimized for the translation of MOOC text types for all language pairs addressed by the project. Given the diversity of languages, each system will use model components that target the specific problems that arise: syntactic divergence, differences in morphological richness, use of different writing systems (Latin, Cyrillic, etc.), and amount of training data. The need to develop systems for different text types and lack of in domain training data poses significant difficulties that will be addressed.
WP4 is divided into 3 Tasks. More specifically,
Task 4.1: Development of phrase-based MT systems (UEDIN) (M1-M36)
The canonical phrase-based model provides a robust basis for the development of statistical machine translation systems for all language pairs addressed by the project. These models will be extended to take advantage of additional data and tools to address the specific linguistic problems of each language (e.g., morphology, different script, etc.).
Task 4.2: Development of syntax-based MT systems (UEDIN, UBER) (M1-M36)
For language pairs with the required resources (especially syntactic parsers), syntax-based machine translation systems will be developed. The main focus here is on German, Czech and Chinese. We expect that this task will benefit from parallel research and development in related projects.
Task 4.3: Adaptation of machine translation systems to MOOC text types (UEDIN) (M13-M36)
Since the majority of the training data will come from out of domain sources, and the text types in MOOC (lecture transcripts, supporting material, discussion forums) present unique challenges, significant work is required to optimize the machine translation systems to the requirements of the project: normalisation of training data, data selection and weighting, domain and topic models, etc.