WP Leader: Humboldt Universität zu Berlin (UBER)
The overall goal of this work package is to collect and process linguistically the (parallel) data on which the MT systems developed in WP 4 will be trained and tested. Certain partners are responsible for specific language pairs:
- UBER: English-German, English-Portuguese, English-Bulgarian
- DCU: English-Croatian, English-Chinese, English-Polish, English-Italian
- UEDIN: English-Czech, English-Russian
- IURC: English-Greek
- University of Tilburg: English-Dutch
The partners will gather and process (POS tagging, chunking, dependency parsing, etc.) as much (parallel) data as possible. We expect most of these data to be out-of-domain with respect to the MOOCs targeted in the project. Since this may affect negatively the performance of MT, resource bootstrapping and parallel corpora development via crowdsourcing will be performed as well. All data obtained will be converted to a uniform format which will facilitate the development of MT systems in WP 4.