The exponential growth of the Web is resulting in vast amounts of online content. However, the information expressed therein is not at easy reach: what we typically browse is only an infinitesimal part of the Web. And even if we had time to read all the Web we could not understand it, as most of it is written in languages we do not speak.

Computers, instead, have the power to process the entire Web. But, in order to ”read” it, that is perform machine reading, they still have to face the hard problem of Natural Language Understanding, i.e., automatically making sense of human language. To tackle this long-lasting challenge in Natural Language Processing (NLP), the task of semantic parsing has recently gained popularity. This aims at creating structured representations of meaning for an input text. However, current semantic parsers require supervision, binding them to the language of interest and hindering their extension to multiple languages.

MOUSSE proposes a research program to investigate radically new directions for enabling multilingual semantic parsing, without the heavy requirement of annotating training data for each new language. The key intuitions of our proposal are treating multilinguality as a resource rather than an obstacle and embracing the knowledge-based paradigm which allows supervision in the machine learning sense to be replaced with efficacious use of lexical knowledge resources.

In stage 1 of the project we will acquire a huge network of language-independent, structured semantic representations of sentences. In stage 2, we will leverage this resource to develop innovative algorithms that perform semantic parsing in any language. These two stages are mutually beneficial, progressively enriching less-resourced languages and contributing towards leveling the playing field for all languages.

Contract. no 726487


Exemplification Modeling: Can You Give Me an Example, Please?
Starting from (word, definition) pairs, we present a neural architecture capable of automatically generating usage examples for the word according to the requested semantics. It is possible to create high-quality sense-tagged data which cover the full range of meanings in any inventory of interest, and their interactions within sentences. The use of generated data as training corpus for Word Sense Disambiguation enables outperforming the current state of the art.
The first automatically-built, large-scale resource for lexical substitution. Through an automated approach, we are finally able to extract substitutes for words in context, building a large-scale dataset that allows simple models to be finetuned on lexical substitution, achieving results that compete with complex state-of-the-art models.
SGL: Speaking the Graph Languages of Semantic Parsing via Multilingual Translation
A state-of-the-art approach to cross-framework and cross-lingual semantic parsing, where we frame the task as multilingual NMT. This pushes the overall performances further up thanks to transfer learning and, besides, enables the usage of a single shared model.
A cross-lingual large-scale evaluation benchmark for the WSD task featuring sense-annotated development and test sets in 18 languages from six different linguistic families, together with language-specific silver training data.
A neural seq2seq model which contextualizes a target expression in a sentence by generating an ad hoc definition. The work is a unified approach to computational lexical-semantic tasks, encompassing state-of-the-art Word Sense Disambiguation, Definition Modeling and Word-in-Context.
A Survey on Multilingual Sense-Annotated Corpora for Word Sense Disambiguation
A survey picturing the main challenges in the field of multilingual Word Sense Disambiguation and highlighting the most important efforts in mitigating the knowledge-acquisition bottleneck problem.
A multilingual Word Sense Disambiguation system powered by SyntagNet. Designed to be fast, reliable, and easily accessible, SyntagRank allows the automatic labeling of concepts and named entities within the input sentence by exploiting the syntagmatic relations between them.
A manually-curated large-scale lexical-semantic combination database which associates pairs of concepts with pairs of co-occurring words, hence capturing sense distinctions evoked by syntagmatic relations. The database currently covers 78,000 noun-verb and noun-noun lexical combinations, with 88,019 semantic combinations linking 20,626 WordNet 3.0 unique synsets with a relation edge.
Sense Distribution Learning: EnDI and DaD
Two knowledge-based approaches for learning sense distributions from raw text data. Both approaches proved to attain state-of-the-art results in predicting the Most Frequent Sense of a word and to effectively scale to different languages.
Description: A multilingual sense-annotated resource, automatically built via the joint disambiguation of the Europarl parallel corpus in 21 languages, with almost 123 million sense annotations for over 155 thousand distinct concepts and entities, drawn from the multilingual sense inventory of BabelNet.
MultiMirror: Neural Cross-lingual Word Alignment for Multilingual Word Sense Disambiguation
Sense projection approach for multilingual WSD. Based upon a novel neural model for word alignment, MultiMirror automatically generates sense-annotated datasets in multiple languages that lead a simple mBERT-powered classifier to surpass the previous state of the art on standard benchmarks in Multilingual WSD.
Generating Senses and RoLes: An End-to-End Model for Dependency- and Span-based Semantic Role Labeling
A state-of-the-art approach for end-to-end Semantic Role Labeling based on joint generation of senses and roles that rivals the long-standing best-performing sequence labeling approaches.
Unifying Cross-Lingual Semantic Role Labeling with Heterogeneous Linguistic Resources
A state-of-the-art approach to exploit heterogeneous data to perform cross-lingual Semantic Role Labeling over 100 languages with 6 different inventories. Recipient of the NAACL 2021 outstanding paper award.
A simple, seq2seq symmetric Text-to-AMR parsing / AMR-to-Text generation approach which exploits a pre-trained encoder-decoder and achieves state-of-the-art performance in both tasks.
A cross-lingual AMR parser that exploits the existing training data in English to transfer semantic representations across languages. The achieved results shed light on the applicability of AMR as an interlingua and set the state of the art in Chinese, German, Italian and Spanish cross-lingual AMR parsing. Furthermore, a detailed qualitative analysis shows that the proposed parser can overcome common translation divergences among languages.
A methodology to create effective multimodal sense embeddings starting from BabelPic. We prove such embeddings to improve the performance of a Word Sense Disambiguation architecture over a strong unimodal baseline.
CluBERT is a multilingual approach to inducing the distributions of word senses from a corpus of raw sentences by clustering words' occurrences according to their contextual representation.
A novel coarse-grained sense inventory of 45 labels shared across lemmas and parts-of-speech. CSI labels are highly descriptive, allowing humans to easily annotate data. Moreover, when used as sense inventory for WSD, CSI leads a supervised model to reach great performances without making the disambiguation task trivial.
Embedding Words and Senses Together via Joint Knowledge-Enhanced Training
A model which exploits large corpora and knowledge from semantic networks in order to produce a unified vector space of word and sense embeddings.
A large-scale high-quality corpus of disambiguated definitions in multiple languages, comprising sense annotations of both concepts and named entities from a wide-coverage unified sense inventory.
Ten Years of BabelNet: A Survey
BabelNet is now ten years old. In this timeframe it has been functioning as a repository of knowledge in hundreds of different languages. In this survey we document several applications enabled by BabelNet as well as discuss the most fruitful future development directions for the NLP and AI communities.
ESC: Redesigning WSD with Extractive Sense Comprehension
A redesigned approach to Word Sense Disambiguation through Extractive Reading Comprehension. Our system, ESCHER, achieves unprecedented performances on a number of different benchmarks and settings.
Bridging the Gap in Multilingual Semantic Role Labeling: a Language-Agnostic Approach
A fully language-agnostic SRL model that does away with morphological and syntactic features to achieve robustness across languages. Our approach outperforms the current state of the art in 6 languages, especially whenever a scarce amount of training data is available.
A semi-supervised approach for producing contextualized multilingual sense representations for all the concepts in a language vocabulary. ARES’ embeddings achieve state-of-the-art results on the English and multilingual Word Sense Disambiguation task, and competitive results in the Word-in-Context task.
MuLaN (Multilingual Label propagatioN) is a label propagation technique tailored to Word Sense Disambiguation and capable of automatically producing sense-tagged training datasets in multiple languages, jointly leveraging contextualized word embeddings and the multilingual information enclosed in knowledge bases.
A neural supervised Word Sense Disambiguation system that is able to incorporate both synset embeddings and the WordNet graph. State-of-the-art results, going for the first time beyond the 80% performance, in English, French, German, Italian and Spanish benchmarks!
A platform created with the aim of making Semantic Role Labeling more accessible to a wider audience: with InVeRo, users can easily annotate sentences with intelligible verbs and roles.
A language-independent method for automatically producing multilingual sense-annotated datasets on a large scale by leveraging Wikipedia's inner structure.
A Transformer-based architecture for contextualized embeddings which makes use of a co-attentive layer to produce more deeply bidirectional representations, better-fitting for the WSD task. As a result, a WSD system trained with QBERT beats the state of the art.
Neural Sequence Learning Models for Word Sense Disambiguation
An in-depth study on end-to-end neural architectures tailored to the WSD task, from bidirectional Long Short-Term Memory to encoder-decoder models.
Recent Trends in Word Sense Disambiguation: A Survey
An up-to-date survey on what’s in modern Word Sense Disambiguation, covering training and test data, as well as automatic classification systems with a focus on the recent trend of hybridization between supervised and knowledge-based algorithms.
Framing Word Sense Disambiguation as a Multi-Label Problem for Model-Agnostic Knowledge Integration
A simple approach to take into account multiple sense annotations for a target word in context. Framing Word Sense Disambiguation as a multi-label classification problem also provides an effective way to seamlessly integrate relational knowledge into a model.
Language-independent vector representations of concepts which place multilinguality at their core while retaining explicit relationships between concepts. Conception achieves state-of-the-art performance in multilingual and cross-lingual word similarity and English Word Sense Disambiguation.
A large multilingual benchmark for the Word in Context task, featuring gold standards in 12 new languages from varied language families and with different degrees of resource availability. XL-WiC opens room for evaluating the lexical-semantic capabilities of neural models on different scenarios such as zero-shot and cross-lingual transfer.
The first multimodal dataset with a focus on non-concrete nominal and verbal concepts which is also linked to WordNet and BabelNet. BabelPic is enhanced with a methodology for the automatic extension of its coverage to any BabelNet synsets.
A novel large-scale manually-crafted semantic resource for wide-coverage, intelligible and scalable Semantic Role Labeling. Its goal is to manually cluster WordNet synsets that share similar semantics into a set of semantically-coherent frames.
A knowledge-based approach for producing sense embeddings in multiple languages that lie in a space comparable with that of BERT contextualized word representations.
LSTMEmbed: Learning Word and Sense Representations from a LargeSemantically Annotated Corpus with Long Short-Term Memories
A study on the capabilities of bidirectional LSTM models to learn representations of word senses from semantically annotated corpora.
A knowledge-based approach for producing large amount of sense-annotated corpora in virtually more than 200 languages. Train-O-Matic paved the way to supervised Word Sense Disambiguation in languages other than English where manually-annotated data are not available.


Rocco Tripodi
Simone Conia
Andrea Di Fabio
Najla Kalach
Federico Martelli
Giuliano Panzironi
Martina Piromalli
Valentina Pyatkin
Gabriele Tola
Alessandro Zinnai