The exponential growth of the Web is resulting in vast amounts of online content. However, the information expressed therein is not at easy reach: what we typically browse is only an infinitesimal part of the Web. And even if we had time to read all the Web we could not understand it, as most of it is written in languages we do not speak.

Computers, instead, have the power to process the entire Web. But, in order to ”read” it, that is perform machine reading, they still have to face the hard problem of Natural Language Understanding, i.e., automatically making sense of human language. To tackle this long-lasting challenge in Natural Language Processing (NLP), the task of semantic parsing has recently gained popularity. This aims at creating structured representations of meaning for an input text. However, current semantic parsers require supervision, binding them to the language of interest and hindering their extension to multiple languages.



MOUSSE proposes a research program to investigate radically new directions for enabling multilingual semantic parsing, without the heavy requirement of annotating training data for each new language. The key intuitions of our proposal are treating multilinguality as a resource rather than an obstacle and embracing the knowledge-based paradigm which allows supervision in the machine learning sense to be replaced with efficacious use of lexical knowledge resources.

In stage 1 of the project we will acquire a huge network of language-independent, structured semantic representations of sentences. In stage 2, we will leverage this resource to develop innovative algorithms that perform semantic parsing in any language. These two stages are mutually beneficial, progressively enriching less-resourced languages and contributing towards leveling the playing field for all languages.

Contract. no 726487

Projects

InVeRo
A platform created with the aim of making Semantic Role Labeling more accessible to a wider audience: with InVeRo, users can easily annotate sentences with intelligible verbs and roles.
website
SyntagNet
A manually-curated large-scale lexical-semantic combination database which associates pairs of concepts with pairs of co-occurring words, hence capturing sense distinctions evoked by syntagmatic relations. The database currently covers 78,000 noun-verb and noun-noun lexical combinations, with 88,019 semantic combinations linking 20,626 WordNet 3.0 unique synsets with a relation edge.
Sense Distribution Learning: EnDI and DaD
Two knowledge-based approaches for learning sense distributions from raw text data. Both approaches proved to attain state-of-the-art results in predicting the Most Frequent Sense of a word and to effectively scale to different languages.
EuroSense
Description: A multilingual sense-annotated resource, automatically built via the joint disambiguation of the Europarl parallel corpus in 21 languages, with almost 123 million sense annotations for over 155 thousand distinct concepts and entities, drawn from the multilingual sense inventory of BabelNet.
VerbAtlas
A novel large-scale manually-crafted semantic resource for wide-coverage, intelligible and scalable Semantic Role Labeling. Its goal is to manually cluster WordNet synsets that share similar semantics into a set of semantically-coherent frames.
CSI
A novel coarse-grained sense inventory of 45 labels shared across lemmas and parts-of-speech. CSI labels are highly descriptive, allowing humans to easily annotate data. Moreover, when used as sense inventory for WSD, CSI leads a supervised model to reach great performances without making the disambiguation task trivial.
Embedding Words and Senses Together via Joint Knowledge-Enhanced Training
A model which exploits large corpora and knowledge from semantic networks in order to produce a unified vector space of word and sense embeddings.
SenseDefs
A large-scale high-quality corpus of disambiguated definitions in multiple languages, comprising sense annotations of both concepts and named entities from a wide-coverage unified sense inventory.
OneSeC
A language-independent method for automatically producing multilingual sense-annotated datasets on a large scale by leveraging Wikipedia's inner structure.
QBERT
A Transformer-based architecture for contextualized embeddings which makes use of a co-attentive layer to produce more deeply bidirectional representations, better-fitting for the WSD task. As a result, a WSD system trained with QBERT beats the state of the art.
Neural Sequence Learning Models for Word Sense Disambiguation
An in-depth study on end-to-end neural architectures tailored to the WSD task, from bidirectional Long Short-Term Memory to encoder-decoder models.
SensEmBERT
A knowledge-based approach for producing sense embeddings in multiple languages that lie in a space comparable with that of BERT contextualized word representations.
LSTMEmbed: Learning Word and Sense Representations from a LargeSemantically Annotated Corpus with Long Short-Term Memories
A study on the capabilities of bidirectional LSTM models to learn representations of word senses from semantically annotated corpora.
Train-O-Matic
A knowledge-based approach for producing large amount of sense-annotated corpora in virtually more than 200 languages. Train-O-Matic paved the way to supervised Word Sense Disambiguation in languages other than English where manually-annotated data are not available.

People

Rocco Tripodi
Simone Conia
Andrea Di Fabio
Najla Kalach
Federico Martelli
Giuliano Panzironi
Martina Piromalli
Valentina Pyatkin
Gabriele Tola
Alessandro Zinnai