Events

Workshop on Ten Years of BabelNet and Multilingual Neurosymbolic Natural Language Understanding
4-5th JULY 2022
The event aims at disseminating research, fostering discussion as well as encouraging scientific contamination across perspectives of neuro-symbolic processing of language.
We will also take this opportunity to celebrate 10 years of BabelNet.
The workshop is co-organized by Sapienza, Babelscape and Accademia della Crusca. It is co-funded by the ERC, the European Language Grid and the AI Journal.

About

The exponential growth of the Web is resulting in vast amounts of online content. However, the information expressed therein is not at easy reach: what we typically browse is only an infinitesimal part of the Web. And even if we had time to read all the Web we could not understand it, as most of it is written in languages we do not speak.

Computers, instead, have the power to process the entire Web. But, in order to ”read” it, that is perform machine reading, they still have to face the hard problem of Natural Language Understanding, i.e., automatically making sense of human language. To tackle this long-lasting challenge in Natural Language Processing (NLP), the task of semantic parsing has recently gained popularity. This aims at creating structured representations of meaning for an input text. However, current semantic parsers require supervision, binding them to the language of interest and hindering their extension to multiple languages.



MOUSSE proposes a research program to investigate radically new directions for enabling multilingual semantic parsing, without the heavy requirement of annotating training data for each new language. The key intuitions of our proposal are treating multilinguality as a resource rather than an obstacle and embracing the knowledge-based paradigm which allows supervision in the machine learning sense to be replaced with efficacious use of lexical knowledge resources.

In stage 1 of the project we will acquire a huge network of language-independent, structured semantic representations of sentences. In stage 2, we will leverage this resource to develop innovative algorithms that perform semantic parsing in any language. These two stages are mutually beneficial, progressively enriching less-resourced languages and contributing towards leveling the playing field for all languages.

Contract. no 726487

Projects

Biases in Large Language Models
It discusses data selection bias and various forms of social bias present in the text generated by these models. The study explores bias related to gender, age, sexual orientation, ethnicity, religion, and culture. The paper concludes by suggesting approaches to measure, reduce, and address these biases.
MultiNERD
This paper presents a methodology for automatically generating a comprehensive multilingual NER dataset. The dataset addresses 10 languages, 15 NER categories, and 2 textual genres. It includes disambiguation information for multilingual entity linking and image URLs for multimodal systems.
MATESE
We propose MATESE metrics for automatic machine translation evaluation, leveraging the Multidimensional Quality Metrics (MQM) framework. MATESE reframes evaluation as a sequence tagging problem, achieving high correlation with human judgments and interpretability. MATESE-QE enables reference-free evaluation in challenging settings.
Probing LMs for Predicate Argument Structures
This study investigates the ability of current pretrained language models to capture predicate-argument structures, providing insights into how we can take advantage of such knowledge to improve Semantic Role Labeling systems.
GeneSis
GeneSis is the first generative approach to English lexical substitution. With this novel approach, we reach state-of-the-art results; moreover, we can effortlessly generate silver data for the task. We assess the quality of the generated resources both qualitatively and quantitatively, showing that the released datasets can help supervised models improve.
NER4EL
Currently available systems for Entity Linking often require pretraining on massive amounts of data in order to achieve state-of-the-art results. In this paper, we address this issue and present several ways to exploit Named Entity Recognition to narrow the gap between Entity Linking systems trained on large and small datasets.
SIR
Sense-enhanced Information Retrieval (SIR) brings Word Sense Disambiguation and Information Retrieval closer and provides additional semantic information for the query via sense definitions. This semantic information leads to improvements over a baseline that does not access semantics in multiple languages.
Exemplification Modeling: Can You Give Me an Example, Please?
Starting from (word, definition) pairs, we present a neural architecture capable of automatically generating usage examples for the word according to the requested semantics. It is possible to create high-quality sense-tagged data which cover the full range of meanings in any inventory of interest, and their interactions within sentences. The use of generated data as training corpus for Word Sense Disambiguation enables outperforming the current state of the art.
ALaSca
The first automatically-built, large-scale resource for lexical substitution. Through an automated approach, we are finally able to extract substitutes for words in context, building a large-scale dataset that allows simple models to be finetuned on lexical substitution, achieving results that compete with complex state-of-the-art models.
SGL: Speaking the Graph Languages of Semantic Parsing via Multilingual Translation
A state-of-the-art approach to cross-framework and cross-lingual semantic parsing, where we frame the task as multilingual NMT. This pushes the overall performances further up thanks to transfer learning and, besides, enables the usage of a single shared model.
XL-WSD
A cross-lingual large-scale evaluation benchmark for the WSD task featuring sense-annotated development and test sets in 18 languages from six different linguistic families, together with language-specific silver training data.
Generationary
A neural seq2seq model which contextualizes a target expression in a sentence by generating an ad hoc definition. The work is a unified approach to computational lexical-semantic tasks, encompassing state-of-the-art Word Sense Disambiguation, Definition Modeling and Word-in-Context.
A Survey on Multilingual Sense-Annotated Corpora for Word Sense Disambiguation
A survey picturing the main challenges in the field of multilingual Word Sense Disambiguation and highlighting the most important efforts in mitigating the knowledge-acquisition bottleneck problem.
SyntagRank
A multilingual Word Sense Disambiguation system powered by SyntagNet. Designed to be fast, reliable, and easily accessible, SyntagRank allows the automatic labeling of concepts and named entities within the input sentence by exploiting the syntagmatic relations between them.
SyntagNet
A manually-curated large-scale lexical-semantic combination database which associates pairs of concepts with pairs of co-occurring words, hence capturing sense distinctions evoked by syntagmatic relations. The database currently covers 78,000 noun-verb and noun-noun lexical combinations, with 88,019 semantic combinations linking 20,626 WordNet 3.0 unique synsets with a relation edge.
Sense Distribution Learning: EnDI and DaD
Two knowledge-based approaches for learning sense distributions from raw text data. Both approaches proved to attain state-of-the-art results in predicting the Most Frequent Sense of a word and to effectively scale to different languages.
EuroSense
Description: A multilingual sense-annotated resource, automatically built via the joint disambiguation of the Europarl parallel corpus in 21 languages, with almost 123 million sense annotations for over 155 thousand distinct concepts and entities, drawn from the multilingual sense inventory of BabelNet.
BMR
BMR is a new language-independent formalism that abstracts away from language-specific constraints thanks to two multilingual semantic resources, BabelNet and VerbAtlas. To put our formalism into practice, we also created BMR 1.0, the first dataset labeled according to BMR.
Entity Disambiguation with Entity Definitions
This paper tackles the limitation of using only Wikipedia titles as textual representations in Entity Disambiguation (ED). It explores more expressive representations to improve disambiguation performance.
EUREKA
EUREKA is an ensemble-based approach for automatic euphemism detection. EUREKA includes data correction, an expanded corpus (EuphAug), leveraging model representations of Potentially Euphemistic Terms (PETs), and utilizing semantically close sentences.
AMuSE-WSD
AMuSE-WSD provides an easy way to disambiguate text in 40 languages thanks to its state-of-the-art multilingual neural model and its intuitive API. Take advantage of AMuSE-WSD to integrate sense knowledge into multilingual downstream applications!
UniteD-SRL
UniteD-SRL is a new benchmark for multilingual and cross-lingual Semantic Role Labeling. Differently from previous efforts, UniteD-SRL provides parallel gold-standard development and test sets annotated with a single inventory, VerbAtlas. UniteD-SRL is available in 4 languages: English, Chinese, French and Spanish.
SPRING
SPRING Online Services provide a Web interface and RESTful APIs for our state-of-the-art AMR parsing and generation system, SPRING (Symmetric PaRsIng aNd Generation). The Web interface has been developed to be easily used by the Natural Language Processing community, as well as by the general public. It provides, among other things, a highly interactive visualization platform and a feedback mechanism to obtain user suggestions for further improvements of the system’s output.
MultiMirror: Neural Cross-lingual Word Alignment\\for Multilingual Word Sense Disambiguation
Sense projection approach for multilingual WSD. Based upon a novel neural model for word alignment, MultiMirror automatically generates sense-annotated datasets in multiple languages that lead a simple mBERT-powered classifier to surpass the previous state of the art on standard benchmarks in Multilingual WSD.
Generating Senses and RoLes: An End-to-End Model for Dependency- and Span-based Semantic Role Labeling
A state-of-the-art approach for end-to-end Semantic Role Labeling based on joint generation of senses and roles that rivals the long-standing best-performing sequence labeling approaches.
Unifying Cross-Lingual Semantic Role Labeling with Heterogeneous Linguistic Resources
A state-of-the-art approach to exploit heterogeneous data to perform cross-lingual Semantic Role Labeling over 100 languages with 6 different inventories. Recipient of the NAACL 2021 outstanding paper award.
SPRING
A simple, seq2seq symmetric Text-to-AMR parsing / AMR-to-Text generation approach which exploits a pre-trained encoder-decoder and achieves state-of-the-art performance in both tasks.
XL-AMR
A cross-lingual AMR parser that exploits the existing training data in English to transfer semantic representations across languages. The achieved results shed light on the applicability of AMR as an interlingua and set the state of the art in Chinese, German, Italian and Spanish cross-lingual AMR parsing. Furthermore, a detailed qualitative analysis shows that the proposed parser can overcome common translation divergences among languages.
EViLBERT
A methodology to create effective multimodal sense embeddings starting from BabelPic. We prove such embeddings to improve the performance of a Word Sense Disambiguation architecture over a strong unimodal baseline.
CluBERT
CluBERT is a multilingual approach to inducing the distributions of word senses from a corpus of raw sentences by clustering words' occurrences according to their contextual representation.
CSI
A novel coarse-grained sense inventory of 45 labels shared across lemmas and parts-of-speech. CSI labels are highly descriptive, allowing humans to easily annotate data. Moreover, when used as sense inventory for WSD, CSI leads a supervised model to reach great performances without making the disambiguation task trivial.
Embedding Words and Senses Together via Joint Knowledge-Enhanced Training
A model which exploits large corpora and knowledge from semantic networks in order to produce a unified vector space of word and sense embeddings.
SenseDefs
A large-scale high-quality corpus of disambiguated definitions in multiple languages, comprising sense annotations of both concepts and named entities from a wide-coverage unified sense inventory.
Reducing Disambiguation Biases
This paper addresses the challenge of polysemous word disambiguation in Neural Machine Translation (NMT). It proposes a novel approach for creating high-precision sense-annotated parallel corpora and a tailored fine-tuning strategy that utilizes these annotations during training.
Nibbling at the hard core of WSD
An in-depth study on modern approaches and evaluations for Word Sense Disambiguation. As a result, we outline what is currently missing and put forward a novel benchmark to measure future progress in WSD.
SRL4E
SRL4E is a unified evaluation framework focused on Semantic Role Labeling for Emotions, which unifies several datasets tagged with emotions and their semantic roles by using a common labeling scheme.
DiBiMT
DiBiMT is the first fully-manually curated benchmark for understanding and measuring Disambiguation Biases in Neural Machine Translation. Recipient of the Best Resource Paper award at ACL2022.
InVeRo-XL
InVeRo-XL is the first prepackaged end-to-end system for cross-lingual Semantic Role Labeling. You can use InVeRo-XL to get predicate sense and semantic role annotations in more than 40 languages and 7 predicate-argument structure inventories!
Integrating Personalized PageRank into Neural Word Sense Disambiguation
We improve EWISER (Bevilacqua and Navigli, 2020) by incorporating an online neural approximated PageRank. Our method exploits the global graph structure while keeping space requirements linear in the number of edges. We obtain strong improvements, matching the current state of the art.
Ten Years of BabelNet: A Survey
BabelNet is now ten years old. In this timeframe it has been functioning as a repository of knowledge in hundreds of different languages. In this survey we document several applications enabled by BabelNet as well as discuss the most fruitful future development directions for the NLP and AI communities.
ESC: Redesigning WSD with Extractive Sense Comprehension
A redesigned approach to Word Sense Disambiguation through Extractive Reading Comprehension. Our system, ESCHER, achieves unprecedented performances on a number of different benchmarks and settings.
Bridging the Gap in Multilingual Semantic Role Labeling: a Language-Agnostic Approach
A fully language-agnostic SRL model that does away with morphological and syntactic features to achieve robustness across languages. Our approach outperforms the current state of the art in 6 languages, especially whenever a scarce amount of training data is available.
ARES
A semi-supervised approach for producing contextualized multilingual sense representations for all the concepts in a language vocabulary. ARES’ embeddings achieve state-of-the-art results on the English and multilingual Word Sense Disambiguation task, and competitive results in the Word-in-Context task.
MuLaN
MuLaN (Multilingual Label propagatioN) is a label propagation technique tailored to Word Sense Disambiguation and capable of automatically producing sense-tagged training datasets in multiple languages, jointly leveraging contextualized word embeddings and the multilingual information enclosed in knowledge bases.
EWISER
A neural supervised Word Sense Disambiguation system that is able to incorporate both synset embeddings and the WordNet graph. State-of-the-art results, going for the first time beyond the 80% performance, in English, French, German, Italian and Spanish benchmarks!
InVeRo
A platform created with the aim of making Semantic Role Labeling more accessible to a wider audience: with InVeRo, users can easily annotate sentences with intelligible verbs and roles.
OneSeC
A language-independent method for automatically producing multilingual sense-annotated datasets on a large scale by leveraging Wikipedia's inner structure.
QBERT
A Transformer-based architecture for contextualized embeddings which makes use of a co-attentive layer to produce more deeply bidirectional representations, better-fitting for the WSD task. As a result, a WSD system trained with QBERT beats the state of the art.
Neural Sequence Learning Models for Word Sense Disambiguation
An in-depth study on end-to-end neural architectures tailored to the WSD task, from bidirectional Long Short-Term Memory to encoder-decoder models.
ID10M
This paper focuses on the automatic identification and understanding of idioms, a challenging task in Natural Language Understanding. It introduces a novel multilingual Transformer-based system for idiom identification, along with a high-quality training dataset in 10 languages and a manually curated evaluation benchmark.
ExtEnD
ExtEnD is a reformulation as a text extraction problem of Entity Disambiguation, the task of linking a mention in context with its most suitable entity in a reference knowledge base.
BMR
This paper introduces an innovative, fully semantic meaning representation that transcends existing formalisms. By abstracting text into meaning and incorporating language-independent concepts and semantic relations, we address key limitations and enable cross-modal connections in Natural Language Understanding. Our approach fosters multilingual communication across texts, images, videos, speech, sound, and logical formulas.
STEPS
STEPS is an end-to-end approach for free-form event process typing, understanding human intent by generating action-object sequences, outperforming previous methods.
VDM
Visual Definition Modeling (VDM) enhances the Definition Modeling paradigm by grounding textual definitions to visual representations. This paper introduces DEMETER and DIONYSUS, two benchmarks that require models to generate textual definitions for target words or object patches given an image as context. We evaluate six different baselines, showing the complexity of the benchmarks and the effectiveness of a text-only encoder-decoder model.
ConSeC
CONtinuous SEnse Comprehension (ConSeC) is a novel approach to Word Sense Disambiguation: leveraging an extractive re-framing of this task as a text extraction problem, we introduce a feedback loop strategy that allows the disambiguation of a target word to be conditioned not only on its context but also on the explicit senses assigned to nearby words. Using this novel approach, ConSeC sets a new state of the art both in English and Multilingual WSD!
WikiNEuRal: Combined Neural and Knowledge-based Silver Data Creation for Multilingual NER
We exploit the texts of Wikipedia and introduce a new methodology based on the effective combination of knowledge-based approaches and neural models, together with a novel domain adaptation technique, to produce high-quality training corpora for NER in multiple languages.
Recent Trends in Word Sense Disambiguation: A Survey
An up-to-date survey on what’s in modern Word Sense Disambiguation, covering training and test data, as well as automatic classification systems with a focus on the recent trend of hybridization between supervised and knowledge-based algorithms.
Framing Word Sense Disambiguation as a Multi-Label Problem for Model-Agnostic Knowledge Integration
A simple approach to take into account multiple sense annotations for a target word in context. Framing Word Sense Disambiguation as a multi-label classification problem also provides an effective way to seamlessly integrate relational knowledge into a model.
Conception
Language-independent vector representations of concepts which place multilinguality at their core while retaining explicit relationships between concepts. Conception achieves state-of-the-art performance in multilingual and cross-lingual word similarity and English Word Sense Disambiguation.
XL-WiC
A large multilingual benchmark for the Word in Context task, featuring gold standards in 12 new languages from varied language families and with different degrees of resource availability. XL-WiC opens room for evaluating the lexical-semantic capabilities of neural models on different scenarios such as zero-shot and cross-lingual transfer.
BabelPic
The first multimodal dataset with a focus on non-concrete nominal and verbal concepts which is also linked to WordNet and BabelNet. BabelPic is enhanced with a methodology for the automatic extension of its coverage to any BabelNet synsets.
VerbAtlas
A novel large-scale manually-crafted semantic resource for wide-coverage, intelligible and scalable Semantic Role Labeling. Its goal is to manually cluster WordNet synsets that share similar semantics into a set of semantically-coherent frames.
SensEmBERT
A knowledge-based approach for producing sense embeddings in multiple languages that lie in a space comparable with that of BERT contextualized word representations.
LSTMEmbed: Learning Word and Sense Representations from a LargeSemantically Annotated Corpus with Long Short-Term Memories
A study on the capabilities of bidirectional LSTM models to learn representations of word senses from semantically annotated corpora.
Train-O-Matic
A knowledge-based approach for producing large amount of sense-annotated corpora in virtually more than 200 languages. Train-O-Matic paved the way to supervised Word Sense Disambiguation in languages other than English where manually-annotated data are not available.

People

Rocco Tripodi
Simone Conia
Andrea Di Fabio
Najla Kalach
Federico Martelli
Giuliano Panzironi
Martina Piromalli
Valentina Pyatkin
Gabriele Tola
Alessandro Zinnai