MOUSSE: Multilingual, Open-text Unified Syntax-independent SEmantics

Events

Workshop on Ten Years of BabelNet and Multilingual Neurosymbolic Natural Language Understanding

4-5^th JULY 2022

The event aims at disseminating research, fostering discussion as well as encouraging scientific contamination across perspectives of neuro-symbolic processing of language.
We will also take this opportunity to celebrate 10 years of BabelNet.
The workshop is co-organized by Sapienza, Babelscape and Accademia della Crusca. It is co-funded by the ERC, the European Language Grid and the AI Journal.

See the full program here

About

The exponential growth of the Web is resulting in vast amounts of online content. However, the information expressed therein is not at easy reach: what we typically browse is only an infinitesimal part of the Web. And even if we had time to read all the Web we could not understand it, as most of it is written in languages we do not speak.

Computers, instead, have the power to process the entire Web. But, in order to ”read” it, that is perform machine reading, they still have to face the hard problem of Natural Language Understanding, i.e., automatically making sense of human language. To tackle this long-lasting challenge in Natural Language Processing (NLP), the task of semantic parsing has recently gained popularity. This aims at creating structured representations of meaning for an input text. However, current semantic parsers require supervision, binding them to the language of interest and hindering their extension to multiple languages.

MOUSSE proposes a research program to investigate radically new directions for enabling multilingual semantic parsing, without the heavy requirement of annotating training data for each new language. The key intuitions of our proposal are treating multilinguality as a resource rather than an obstacle and embracing the knowledge-based paradigm which allows supervision in the machine learning sense to be replaced with efficacious use of lexical knowledge resources.

In stage 1 of the project we will acquire a huge network of language-independent, structured semantic representations of sentences. In stage 2, we will leverage this resource to develop innovative algorithms that perform semantic parsing in any language. These two stages are mutually beneficial, progressively enriching less-resourced languages and contributing towards leveling the playing field for all languages.

Contract. no 726487

Projects

Biases in Large Language Models

It discusses data selection bias and various forms of social bias present in the text generated by these models. The study explores bias related to gender, age, sexual orientation, ethnicity, religion, and culture. The paper concludes by suggesting approaches to measure, reduce, and address these biases.

Bias

Language Models

paper

MultiNERD

This paper presents a methodology for automatically generating a comprehensive multilingual NER dataset. The dataset addresses 10 languages, 15 NER categories, and 2 textual genres. It includes disambiguation information for multilingual entity linking and image URLs for multimodal systems.

Named Entity Recognition

Entity Linking

paper

website

MATESE

We propose MATESE metrics for automatic machine translation evaluation, leveraging the Multidimensional Quality Metrics (MQM) framework. MATESE reframes evaluation as a sequence tagging problem, achieving high correlation with human judgments and interpretability. MATESE-QE enables reference-free evaluation in challenging settings.

Machine Translation

Evaluation of machine translation

paper

Probing LMs for Predicate Argument Structures

This study investigates the ability of current pretrained language models to capture predicate-argument structures, providing insights into how we can take advantage of such knowledge to improve Semantic Role Labeling systems.

Semantic Role Labeling

Predicate Argument Structures

paper

website

GeneSis

GeneSis is the first generative approach to English lexical substitution. With this novel approach, we reach state-of-the-art results; moreover, we can effortlessly generate silver data for the task. We assess the quality of the generated resources both qualitatively and quantitatively, showing that the released datasets can help supervised models improve.

Lexical Substitution

Language Generation

paper

website

NER4EL

Currently available systems for Entity Linking often require pretraining on massive amounts of data in order to achieve state-of-the-art results. In this paper, we address this issue and present several ways to exploit Named Entity Recognition to narrow the gap between Entity Linking systems trained on large and small datasets.

Named Entity Recognition

Entity Linking

paper

website

SIR

Sense-enhanced Information Retrieval (SIR) brings Word Sense Disambiguation and Information Retrieval closer and provides additional semantic information for the query via sense definitions. This semantic information leads to improvements over a baseline that does not access semantics in multiple languages.

Information Retrieval

Word Sense Disambiguation

Sense-enhanced Information Retrieval

paper

website

Exemplification Modeling: Can You Give Me an Example, Please?

Starting from (word, definition) pairs, we present a neural architecture capable of automatically generating usage examples for the word according to the requested semantics. It is possible to create high-quality sense-tagged data which cover the full range of meanings in any inventory of interest, and their interactions within sentences. The use of generated data as training corpus for Word Sense Disambiguation enables outperforming the current state of the art.

Exemplification Modeling

Natural Language Generation

paper

website

ALaSca

The first automatically-built, large-scale resource for lexical substitution. Through an automated approach, we are finally able to extract substitutes for words in context, building a large-scale dataset that allows simple models to be finetuned on lexical substitution, achieving results that compete with complex state-of-the-art models.

Lexical Substitution

Dataset

paper

website

SGL: Speaking the Graph Languages of Semantic Parsing via Multilingual Translation

A state-of-the-art approach to cross-framework and cross-lingual semantic parsing, where we frame the task as multilingual NMT. This pushes the overall performances further up thanks to transfer learning and, besides, enables the usage of a single shared model.

AMR Parsing

Cross-Lingual AMR Parsing

paper

website

XL-WSD

A cross-lingual large-scale evaluation benchmark for the WSD task featuring sense-annotated development and test sets in 18 languages from six different linguistic families, together with language-specific silver training data.

Word Sense Disambiguation

Multilinguality

Lexical Semantics

Evaluation

paper

website

Generationary

A neural seq2seq model which contextualizes a target expression in a sentence by generating an ad hoc definition. The work is a unified approach to computational lexical-semantic tasks, encompassing state-of-the-art Word Sense Disambiguation, Definition Modeling and Word-in-Context.

Natural Language Generation

Word Sense Disambiguation

Definition Modeling

paper

website

A Survey on Multilingual Sense-Annotated Corpora for Word Sense Disambiguation

A survey picturing the main challenges in the field of multilingual Word Sense Disambiguation and highlighting the most important efforts in mitigating the knowledge-acquisition bottleneck problem.

Multilinguality

Word Sense Disambiguation

paper

SyntagRank

A multilingual Word Sense Disambiguation system powered by SyntagNet. Designed to be fast, reliable, and easily accessible, SyntagRank allows the automatic labeling of concepts and named entities within the input sentence by exploiting the syntagmatic relations between them.

Word Sense Disambiguation

paper

website

SyntagNet

A manually-curated large-scale lexical-semantic combination database which associates pairs of concepts with pairs of co-occurring words, hence capturing sense distinctions evoked by syntagmatic relations. The database currently covers 78,000 noun-verb and noun-noun lexical combinations, with 88,019 semantic combinations linking 20,626 WordNet 3.0 unique synsets with a relation edge.

Word Sense Disambiguation

Knowledge Base

paper

website

Sense Distribution Learning: EnDI and DaD

Two knowledge-based approaches for learning sense distributions from raw text data. Both approaches proved to attain state-of-the-art results in predicting the Most Frequent Sense of a word and to effectively scale to different languages.

Word Sense Disambiguation

Sense Distribution

paper

website

EuroSense

Description: A multilingual sense-annotated resource, automatically built via the joint disambiguation of the Europarl parallel corpus in 21 languages, with almost 123 million sense annotations for over 155 thousand distinct concepts and entities, drawn from the multilingual sense inventory of BabelNet.

Corpus

Word Sense Disambiguation

paper

website

BMR

BMR is a new language-independent formalism that abstracts away from language-specific constraints thanks to two multilingual semantic resources, BabelNet and VerbAtlas. To put our formalism into practice, we also created BMR 1.0, the first dataset labeled according to BMR.

Semantic Parsing

Cross-linguality

paper

website

Entity Disambiguation with Entity Definitions

This paper tackles the limitation of using only Wikipedia titles as textual representations in Entity Disambiguation (ED). It explores more expressive representations to improve disambiguation performance.

Entity Disambiguation

Entity Definitions

paper

website

EUREKA

EUREKA is an ensemble-based approach for automatic euphemism detection. EUREKA includes data correction, an expanded corpus (EuphAug), leveraging model representations of Potentially Euphemistic Terms (PETs), and utilizing semantically close sentences.

Euphemism

Ensemble learning

Data Augmentation

paper

AMuSE-WSD

AMuSE-WSD provides an easy way to disambiguate text in 40 languages thanks to its state-of-the-art multilingual neural model and its intuitive API. Take advantage of AMuSE-WSD to integrate sense knowledge into multilingual downstream applications!

Word Sense Disambiguation

Multilinguality

paper

website

UniteD-SRL

UniteD-SRL is a new benchmark for multilingual and cross-lingual Semantic Role Labeling. Differently from previous efforts, UniteD-SRL provides parallel gold-standard development and test sets annotated with a single inventory, VerbAtlas. UniteD-SRL is available in 4 languages: English, Chinese, French and Spanish.

Semantic Role Labeling

Multilinguality

Cross-Linguality

paper

website

SPRING

SPRING Online Services provide a Web interface and RESTful APIs for our state-of-the-art AMR parsing and generation system, SPRING (Symmetric PaRsIng aNd Generation). The Web interface has been developed to be easily used by the Natural Language Processing community, as well as by the general public. It provides, among other things, a highly interactive visualization platform and a feedback mechanism to obtain user suggestions for further improvements of the system’s output.

Semantic Parsing

Abstract Meaning Representation

Text Generation

paper

website

MultiMirror: Neural Cross-lingual Word Alignment\\for Multilingual Word Sense Disambiguation

Sense projection approach for multilingual WSD. Based upon a novel neural model for word alignment, MultiMirror automatically generates sense-annotated datasets in multiple languages that lead a simple mBERT-powered classifier to surpass the previous state of the art on standard benchmarks in Multilingual WSD.

Annotation Projection

Word Alignment

Multilingual WSD

paper

website

Generating Senses and RoLes: An End-to-End Model for Dependency- and Span-based Semantic Role Labeling

A state-of-the-art approach for end-to-end Semantic Role Labeling based on joint generation of senses and roles that rivals the long-standing best-performing sequence labeling approaches.

Semantic Role Labeling

Natural Language Generation

paper

website

Unifying Cross-Lingual Semantic Role Labeling with Heterogeneous Linguistic Resources

A state-of-the-art approach to exploit heterogeneous data to perform cross-lingual Semantic Role Labeling over 100 languages with 6 different inventories. Recipient of the NAACL 2021 outstanding paper award.

Semantic Role Labeling

Cross-Linguality

paper

website

SPRING

A simple, seq2seq symmetric Text-to-AMR parsing / AMR-to-Text generation approach which exploits a pre-trained encoder-decoder and achieves state-of-the-art performance in both tasks.

Natural Language Generation

Semantic Parsing

Abstract Meaning Representation

paper

website

XL-AMR

A cross-lingual AMR parser that exploits the existing training data in English to transfer semantic representations across languages. The achieved results shed light on the applicability of AMR as an interlingua and set the state of the art in Chinese, German, Italian and Spanish cross-lingual AMR parsing. Furthermore, a detailed qualitative analysis shows that the proposed parser can overcome common translation divergences among languages.

Multilinguality

Semantic Parsing

Transfer Learning

paper

website

EViLBERT

A methodology to create effective multimodal sense embeddings starting from BabelPic. We prove such embeddings to improve the performance of a Word Sense Disambiguation architecture over a strong unimodal baseline.

Multimodal Learning

Sense Embeddings

paper

website

CluBERT

CluBERT is a multilingual approach to inducing the distributions of word senses from a corpus of raw sentences by clustering words' occurrences according to their contextual representation.

Distribution Learning

Word Sense Disambiguation

paper

website

CSI

A novel coarse-grained sense inventory of 45 labels shared across lemmas and parts-of-speech. CSI labels are highly descriptive, allowing humans to easily annotate data. Moreover, when used as sense inventory for WSD, CSI leads a supervised model to reach great performances without making the disambiguation task trivial.

Word Sense Disambiguation

paper

website

Embedding Words and Senses Together via Joint Knowledge-Enhanced Training

A model which exploits large corpora and knowledge from semantic networks in order to produce a unified vector space of word and sense embeddings.

Word Sense Disambiguation

paper

SenseDefs

A large-scale high-quality corpus of disambiguated definitions in multiple languages, comprising sense annotations of both concepts and named entities from a wide-coverage unified sense inventory.

Corpus

Glosses

paper

website

Reducing Disambiguation Biases

This paper addresses the challenge of polysemous word disambiguation in Neural Machine Translation (NMT). It proposes a novel approach for creating high-precision sense-annotated parallel corpora and a tailored fine-tuning strategy that utilizes these annotations during training.

Machine Translation

paper

website

Nibbling at the hard core of WSD

An in-depth study on modern approaches and evaluations for Word Sense Disambiguation. As a result, we outline what is currently missing and put forward a novel benchmark to measure future progress in WSD.

Word Sense Disambiguation

paper

website

SRL4E

SRL4E is a unified evaluation framework focused on Semantic Role Labeling for Emotions, which unifies several datasets tagged with emotions and their semantic roles by using a common labeling scheme.

Semantic Role Labeling

Emotion Classification

Benchmark

paper

website

DiBiMT

DiBiMT is the first fully-manually curated benchmark for understanding and measuring Disambiguation Biases in Neural Machine Translation. Recipient of the Best Resource Paper award at ACL2022.

Machine Translation

Disambiguation Bias

paper

website

InVeRo-XL

InVeRo-XL is the first prepackaged end-to-end system for cross-lingual Semantic Role Labeling. You can use InVeRo-XL to get predicate sense and semantic role annotations in more than 40 languages and 7 predicate-argument structure inventories!

Semantic Role Labeling

Multilinguality

Cross-Linguality

paper

website

Integrating Personalized PageRank into Neural Word Sense Disambiguation

We improve EWISER (Bevilacqua and Navigli, 2020) by incorporating an online neural approximated PageRank. Our method exploits the global graph structure while keeping space requirements linear in the number of edges. We obtain strong improvements, matching the current state of the art.

Word Sense Disambiguation

paper

website

Ten Years of BabelNet: A Survey

BabelNet is now ten years old. In this timeframe it has been functioning as a repository of knowledge in hundreds of different languages. In this survey we document several applications enabled by BabelNet as well as discuss the most fruitful future development directions for the NLP and AI communities.

Knowledge Bases

Survey

Multilinguality

paper

ESC: Redesigning WSD with Extractive Sense Comprehension

A redesigned approach to Word Sense Disambiguation through Extractive Reading Comprehension. Our system, ESCHER, achieves unprecedented performances on a number of different benchmarks and settings.

Word Sense Disambiguation

paper

website

Bridging the Gap in Multilingual Semantic Role Labeling: a Language-Agnostic Approach

A fully language-agnostic SRL model that does away with morphological and syntactic features to achieve robustness across languages. Our approach outperforms the current state of the art in 6 languages, especially whenever a scarce amount of training data is available.

Multilinguality

Semantic Role Labeling

paper

website

ARES

A semi-supervised approach for producing contextualized multilingual sense representations for all the concepts in a language vocabulary. ARES’ embeddings achieve state-of-the-art results on the English and multilingual Word Sense Disambiguation task, and competitive results in the Word-in-Context task.

Sense Embeddings

Word Sense Disambiguation

paper

website

MuLaN

MuLaN (Multilingual Label propagatioN) is a label propagation technique tailored to Word Sense Disambiguation and capable of automatically producing sense-tagged training datasets in multiple languages, jointly leveraging contextualized word embeddings and the multilingual information enclosed in knowledge bases.

Multilinguality

Word Sense Disambiguation

paper

website

EWISER

A neural supervised Word Sense Disambiguation system that is able to incorporate both synset embeddings and the WordNet graph. State-of-the-art results, going for the first time beyond the 80% performance, in English, French, German, Italian and Spanish benchmarks!

Word Sense Disambiguation

Lexical Semantics

paper

website

InVeRo

A platform created with the aim of making Semantic Role Labeling more accessible to a wider audience: with InVeRo, users can easily annotate sentences with intelligible verbs and roles.

Semantic Role Labeling

paper

website

OneSeC

A language-independent method for automatically producing multilingual sense-annotated datasets on a large scale by leveraging Wikipedia's inner structure.

Word Sense Disambiguation

Dataset

paper ACL

paper LREC

website

QBERT

A Transformer-based architecture for contextualized embeddings which makes use of a co-attentive layer to produce more deeply bidirectional representations, better-fitting for the WSD task. As a result, a WSD system trained with QBERT beats the state of the art.

Word Sense Disambiguation

paper

website

Neural Sequence Learning Models for Word Sense Disambiguation

An in-depth study on end-to-end neural architectures tailored to the WSD task, from bidirectional Long Short-Term Memory to encoder-decoder models.

Word Sense Disambiguation

Neural Networks

paper

ID10M

This paper focuses on the automatic identification and understanding of idioms, a challenging task in Natural Language Understanding. It introduces a novel multilingual Transformer-based system for idiom identification, along with a high-quality training dataset in 10 languages and a manually curated evaluation benchmark.

Idiom Identification

paper

website

ExtEnD

ExtEnD is a reformulation as a text extraction problem of Entity Disambiguation, the task of linking a mention in context with its most suitable entity in a reference knowledge base.

Entity Disambiguation

paper

website

BMR

This paper introduces an innovative, fully semantic meaning representation that transcends existing formalisms. By abstracting text into meaning and incorporating language-independent concepts and semantic relations, we address key limitations and enable cross-modal connections in Natural Language Understanding. Our approach fosters multilingual communication across texts, images, videos, speech, sound, and logical formulas.

Semantic Parsing

Cross-linguality

paper

STEPS

STEPS is an end-to-end approach for free-form event process typing, understanding human intent by generating action-object sequences, outperforming previous methods.

Natural Language Understanding

Event Process Typing

Text Generation

paper

website

VDM

Visual Definition Modeling (VDM) enhances the Definition Modeling paradigm by grounding textual definitions to visual representations. This paper introduces DEMETER and DIONYSUS, two benchmarks that require models to generate textual definitions for target words or object patches given an image as context. We evaluate six different baselines, showing the complexity of the benchmarks and the effectiveness of a text-only encoder-decoder model.

Visual Modeling

Text Generation

paper

website

ConSeC

CONtinuous SEnse Comprehension (ConSeC) is a novel approach to Word Sense Disambiguation: leveraging an extractive re-framing of this task as a text extraction problem, we introduce a feedback loop strategy that allows the disambiguation of a target word to be conditioned not only on its context but also on the explicit senses assigned to nearby words. Using this novel approach, ConSeC sets a new state of the art both in English and Multilingual WSD!

Word Sense Disambiguation

Lexical Semantics

paper

website

WikiNEuRal: Combined Neural and Knowledge-based Silver Data Creation for Multilingual NER

We exploit the texts of Wikipedia and introduce a new methodology based on the effective combination of knowledge-based approaches and neural models, together with a novel domain adaptation technique, to produce high-quality training corpora for NER in multiple languages.

Named Entity Recognition

Multilinguality

paper

website

Recent Trends in Word Sense Disambiguation: A Survey

An up-to-date survey on what’s in modern Word Sense Disambiguation, covering training and test data, as well as automatic classification systems with a focus on the recent trend of hybridization between supervised and knowledge-based algorithms.

Word Sense Disambiguation

Survey

paper

Framing Word Sense Disambiguation as a Multi-Label Problem for Model-Agnostic Knowledge Integration

A simple approach to take into account multiple sense annotations for a target word in context. Framing Word Sense Disambiguation as a multi-label classification problem also provides an effective way to seamlessly integrate relational knowledge into a model.

Word Sense Disambiguation

Knowledge Integration

paper

website

Conception

Language-independent vector representations of concepts which place multilinguality at their core while retaining explicit relationships between concepts. Conception achieves state-of-the-art performance in multilingual and cross-lingual word similarity and English Word Sense Disambiguation.

Multilinguality

BabelNet

Meaning Representation

paper

website

XL-WiC

A large multilingual benchmark for the Word in Context task, featuring gold standards in 12 new languages from varied language families and with different degrees of resource availability. XL-WiC opens room for evaluating the lexical-semantic capabilities of neural models on different scenarios such as zero-shot and cross-lingual transfer.

Multilinguality

Word Sense Disambiguation

Evaluation

paper

BabelPic

The first multimodal dataset with a focus on non-concrete nominal and verbal concepts which is also linked to WordNet and BabelNet. BabelPic is enhanced with a methodology for the automatic extension of its coverage to any BabelNet synsets.

Multimodal Learning

paper

website

VerbAtlas

A novel large-scale manually-crafted semantic resource for wide-coverage, intelligible and scalable Semantic Role Labeling. Its goal is to manually cluster WordNet synsets that share similar semantics into a set of semantically-coherent frames.

Semantic Role Labeling

paper

website

SensEmBERT

A knowledge-based approach for producing sense embeddings in multiple languages that lie in a space comparable with that of BERT contextualized word representations.

Word Sense Disambiguation

Sense Embeddings

paper

website

LSTMEmbed: Learning Word and Sense Representations from a LargeSemantically Annotated Corpus with Long Short-Term Memories

A study on the capabilities of bidirectional LSTM models to learn representations of word senses from semantically annotated corpora.

Neural Networks

paper

website

Train-O-Matic

A knowledge-based approach for producing large amount of sense-annotated corpora in virtually more than 200 languages. Train-O-Matic paved the way to supervised Word Sense Disambiguation in languages other than English where manually-annotated data are not available.

Word Sense Disambiguation

Corpus

paper

website

People