Publications
This is a list of my scientific and technical publications. (My publications in humour, poetry, and recreational linguistics are in a separate list.)
This list is also available as a BibTeX file.

Overview of JOKER 2023 Automatic Wordplay Analysis Task 1 – pun detection.
In Mohammad Aliannejadi, Guglielmo Faggioli, Nicola Ferro, and Michalis Vlachos, editors, Working Notes of CLEF 2023 – Conference and Labs of the Evaluation Forum, volume 3497 of CEUR Workshop Proceedings (ISSN 1613-0073), pages 1785–1803, October 2023.
This paper presents details of
Task 1 of the JOKER-2023 Track, which aims to detect sentences in
English, French, and Spanish that contain wordplay. With applications in
humour generation, sentiment analysis, conversational agents, content
filtering, and linguistic creativity, this task is still challenging despite
significant recent progress in information retrieval and natural language
processing. Building on the lessons learned from last year's edition of the
JOKER track, our overall goal is to foster progress in the automatic
interpretation, generation, and translation of wordplay in English, Spanish,
and French. In this paper, we define our task and describe our approaches to
corpus creation and evaluation in the three languages. We then present an
overview of the participating systems, including summaries of their
approaches and a comparison of their performance.
@inproceedings{ermakova2023overviewtask1,
author = {Liana Ermakova and Tristan Miller and Anne-Gwenn
Bosser and Victor Manuel {Palma Preciado} and Grigori Sidorov and Adam
Jatowt},
editor = {Mohammad Aliannejadi and Guglielmo Faggioli and
Nicola Ferro and Michalis Vlachos},
title = {Overview of {JOKER} 2023 {Automatic} {Wordplay}
{Analysis} {Task}~1~-- Pun Detection},
booktitle = {{Working}
{Notes} of {CLEF}~2023~-- {Conference} and {Labs} of the {Evaluation}
{Forum}},
volume = {3497},
pages = {1785--1803},
series = {CEUR Workshop Proceedings},
month = oct,
year = {2023},
issn = {1613-0073},
}
Overview of JOKER 2023 Automatic Wordplay Analysis Task 2 – pun location and interpretation.
In Mohammad Aliannejadi, Guglielmo Faggioli, Nicola Ferro, and Michalis Vlachos, editors, Working Notes of CLEF 2023 – Conference and Labs of the Evaluation Forum, volume 3497 of CEUR Workshop Proceedings (ISSN 1613-0073), pages 1804–1817, October 2023.
This paper presents an overview
of Task 2 of the JOKER-2023 track on automatic wordplay analysis. The
goal of the JOKER track series is to bring together linguists, translators,
and computer scientists to foster progress in the automatic interpretation,
generation, and translation of wordplay. Task 2 is focussed on pun
location and interpretation. Automatic pun interpretation is important for
advancing natural language understanding, enabling humor generation, aiding
in translation and cross-linguistic understanding, enhancing information
retrieval, and contributing to the field of computational creativity. In this
overview, we present the general setup of the shared task we organized as
part of the CLEF-2023 evaluation campaign, the participants' approaches, and
the quantitative results.
@inproceedings{ermakova2023overviewtask2,
author = {Liana Ermakova and Tristan Miller and Anne-Gwenn
Bosser and Victor Manuel {Palma Preciado} and Grigori Sidorov and Adam
Jatowt},
editor = {Mohammad Aliannejadi and Guglielmo Faggioli and
Nicola Ferro and Michalis Vlachos},
title = {Overview of {JOKER} 2023 {Automatic} {Wordplay}
{Analysis} {Task}~2~-- Pun Location and Interpretation},
booktitle = {{Working}
{Notes} of {CLEF}~2023~-- {Conference} and {Labs} of the {Evaluation}
{Forum}},
volume = {3497},
pages = {1804--1817},
series = {CEUR Workshop Proceedings},
month = oct,
year = {2023},
issn = {1613-0073},
}
Overview of JOKER 2023 Automatic Wordplay Analysis Task 3 – pun translation.
In Mohammad Aliannejadi, Guglielmo Faggioli, Nicola Ferro, and Michalis Vlachos, editors, Working Notes of CLEF 2023 – Conference and Labs of the Evaluation Forum, volume 3497 of CEUR Workshop Proceedings (ISSN 1613-0073), pages 1818–1827, October 2023.
This paper provides a
comprehensive overview of Task 3 of the JOKER-2023 track. The
overarching objective of the JOKER track series is to facilitate
collaboration among linguists, translators, and computer scientists to
advance the development of automatic interpretation, generation, and
translation of wordplay. Task 3 specifically concentrates on the
automatic translation of puns from English into French and Spanish. In this
overview, we outline the overall structure of the shared task that we
organized as part of the CLEF-2023 evaluation campaign. We discuss the
approaches employed by the participants and present and analyze the results
they achieved.
@inproceedings{ermakova2023overviewtask3,
author = {Liana Ermakova and Tristan Miller and Anne-Gwenn
Bosser and Victor Manuel {Palma Preciado} and Grigori Sidorov and Adam
Jatowt},
editor = {Mohammad Aliannejadi and Guglielmo Faggioli and
Nicola Ferro and Michalis Vlachos},
title = {Overview of {JOKER} 2023 {Automatic} {Wordplay}
{Analysis} {Task}~3~-- Pun Translation},
booktitle = {{Working}
{Notes} of {CLEF}~2023~-- {Conference} and {Labs} of the {Evaluation}
{Forum}},
volume = {3497},
pages = {1818--1827},
series = {CEUR Workshop Proceedings},
month = oct,
year = {2023},
issn = {1613-0073},
}
Overview of JOKER – CLEF-2023 track on automatic wordplay analysis.
In Avi Arampatzis, Evangelos Kanoulas, Theodora Tsikrika, Stefanos Vrochidis, Anastasia Giachanou, Dan Li, Mohammad Aliannejadi, Michalis Vlachos, Guglielmo Faggioli, and Nicola Ferro, editors, Experimental IR Meets Multilinguality, Multimodality, and Interaction: Proceedings of the Fourteenth International Conference of the CLEF Association (CLEF 2023), volume 14163 of Lecture Notes in Computer Science (ISSN 0302-9743), pages 397–415, Cham, September 2023. Springer. ISBN 978-3-031-42448-9. DOI: 10.1007/978-3-031-42448-9_26.
The goal of the JOKER track series is
to bring together linguists, translators, and computer scientists to foster
progress on the automatic interpretation, generation, and translation of
wordplay. Building on lessons learned from last year's edition, JOKER-2023
held three shared tasks aligned with the human approaches to the translation
of wordplay, or more specifically of puns in English, French, and Spanish:
detection, location and interpretation, and finally translation. In this
paper, we define these three tasks and describe our approaches to corpus
creation and evaluation. We then present an overview of the participating
systems, including summaries of their approaches and a comparison of their
performance. As in JOKER-2022, this year's track also solicited contributions
making further use of our data (an “unshared task”), which we
also report on.
@inproceedings{ermakova2023overview,
author = {Liana Ermakova and Tristan Miller and Anne-Gwenn
Bosser and Victor Manuel {Palma Preciado} and Grigori Sidorov and Adam
Jatowt},
editor = {Avi Arampatzis and Evangelos Kanoulas and Theodora
Tsikrika and Stefanos Vrochidis and Anastasia Giachanou and Dan Li and
Mohammad Aliannejadi and Michalis Vlachos and Guglielmo Faggioli and Nicola
Ferro},
title = {Overview of {JOKER} -- {CLEF}-2023 Track on Automatic
Wordplay Analysis},
booktitle = {Experimental {IR} Meets Multilinguality,
Multimodality, and Interaction: Proceedings of the {Fourteenth}
{International} {Conference} of the {CLEF} {Association} ({CLEF}
2023)},
volume = {14163},
pages = {397--415},
series = {Lecture Notes in Computer Science},
month = sep,
year = {2023},
publisher = {Springer},
address = {Cham},
isbn = {978-3-031-42448-9},
issn = {0302-9743},
doi = {10.1007/978-3-031-42448-9_26},
}
The JOKER Corpus: English–French parallel data for multilingual wordplay recognition.
In SIGIR '23: Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 2796–2806, New York, NY, July 2023. Association for Computing Machinery. ISBN 978-1-4503-9408-6. DOI: 10.1145/3539618.3591885.
Despite recent advances in information retrieval and
natural language processing, rhetorical devices that exploit ambiguity or
subvert linguistic rules remain a challenge for such systems. However,
corpus-based analysis of wordplay has been a perennial topic of scholarship
in the humanities, including literary criticism, language education, and
translation studies. The immense data-gathering effort required for these
studies points to the need for specialized text retrieval and classification
technology, and consequently for appropriate test collections. In this paper,
we introduce and analyze a new dataset for research and applications in the
retrieval and processing of wordplay. Developed for the JOKER track at CLEF
2023, our annotated corpus extends and improves upon past English wordplay
detection datasets in several ways. First, we introduce hundreds of
additional positive examples; second, we provide French translations for the
examples; and third, we provide negative examples with characteristics
closely matching those of the positive examples. This last feature helps
ensure that AI models learn to effectively distinguish wordplay from
non-wordplay, and not simply texts differing in length, style, or vocabulary.
Our test collection represents then a step towards wordplay-aware
multilingual information retrieval.
@inproceedings{ermakova2023joker,
author = {Liana Ermakova and Anne-Gwenn Bosser and Adam Jatowt
and Tristan Miller},
title = {The {JOKER} {Corpus}: {English}--{French} Parallel
Data for Multilingual Wordplay Recognition},
booktitle = {{SIGIR}
'23: Proceedings of the 46th {International} {ACM} {SIGIR} {Conference} on
{Research} and {Development} in {Information} {Retrieval}},
pages = {2796--2806},
month = jul,
year = {2023},
publisher = {Association for Computing Machinery},
address = {New York, NY},
isbn = {978-1-4503-9408-6},
doi = {10.1145/3539618.3591885},
}
La interacción entre el hombre y la máquina en la traducción de juegos de palabras [Human–computer interaction in pun translation].
In Laura Mejías-Climent and Julio de los Reyes Lozano, editors, La traducción audiovisual a través de la traducción automática y la posedición: prácticas actuales y futuras, pages 37–60. Comares, Granada, July 2023. ISBN 978-84-1369-525-9. Translated by Lorena Pérez Macías.
@incollection{kolb2023interaccion,
author = {Waltraud Kolb and Tristan Miller},
editor = {Laura Mejías-Climent and de los Reyes Lozano,
Julio},
title = {La Interacción Entre El Hombre Y La Máquina En La
Traducción De Juegos De Palabras [{Human}--Computer Interaction in Pun
Translation]},
booktitle = {La
traducción audiovisual a través de la traducción automática y la
posedición: prácticas actuales y futuras},
pages = {37--60},
month = jul,
year = {2023},
publisher = {Comares},
address = {Granada},
isbn = {978-84-1369-525-9},
note = {Translated by Lorena Pérez Macías.},
}
Science for fun: The CLEF 2023 JOKER track on automatic wordplay analysis.
In Jaap Kamps, Lorraine Goeuriot, Fabio Crestani, Maria Maistro, Hideo Joho, Brian Davis, Cathal Gurrin, Udo Kruschwitz, and Annalina Caputo, editors, Advances in Information Retrieval: 45th European Conference on Information Retrieval, ECIR 2023, Dublin, Ireland, April 2–6, Proceedings, Part III, volume 13982 of Lecture Notes in Computer Science (ISSN 0302-9743), pages 546–556, Berlin, Heidelberg, April 2023. Springer. ISBN 978-3-031-28241-6. DOI: 10.1007/978-3-031-28241-6_63.
Understanding and translating humorous wordplay
often requires recognition of implicit cultural references, knowledge of word
formation processes, and discernment of double meanings – issues which
pose challenges for humans and computers alike. This paper introduces the
CLEF 2023 JOKER track, which takes an interdisciplinary approach to the
creation of reusable test collections, evaluation metrics, and methods for
the automatic processing of wordplay. We describe the track's interconnected
shared tasks for the detection, location, interpretation, and translation of
puns. We also describe associated data sets and evaluation methodologies, and
invite contributions making further use of our data.
@inproceedings{ermakova2023science,
author = {Liana Ermakova and Tristan Miller and Anne-Gwenn
Bosser and Victor Manuel {Palma Preciado} and Grigori Sidorov and Adam
Jatowt},
editor = {Jaap Kamps and Lorraine Goeuriot and Fabio Crestani
and Maria Maistro and Hideo Joho and Brian Davis and Cathal Gurrin and Udo
Kruschwitz and Annalina Caputo},
title = {Science for Fun: The {CLEF} 2023 {JOKER} Track on
Automatic Wordplay Analysis},
booktitle = {Advances
in Information Retrieval: 45th {European} {Conference} on {Information}
{Retrieval}, {ECIR} 2023, {Dublin}, {Ireland}, {April} 2–6, Proceedings,
Part~{III}},
volume = {13982},
pages = {546--556},
series = {Lecture Notes in Computer Science},
month = apr,
year = {2023},
publisher = {Springer},
address = {Berlin, Heidelberg},
isbn = {978-3-031-28241-6},
issn = {0302-9743},
doi = {10.1007/978-3-031-28241-6_63},
}
After the working group on “What is missing in
ML&AI to understanding Jokes?”, we discussed the possibility to
survey the expressiveness on existing models on meaning representation,
contrasted by the forecast of existing theories in cognitive science about
what is relevant cognitive activities and processes. Spatial stimuli activate
the zoo of spatial cells in hippocampus, forming cognitive map or collage in
the memory, producing spatial descriptions in languages. We need to survey
existing models on Mental Spatial Representation (MSR) in the literature of
cognitive psychology. On the other hand, we need to analyse vector embeddings
of spatial entities and relations in the large-scaled pre-train world model,
and find the gap between MSR and vector embedding via Machine Learning.
@article{dong2022towards,
author = {Tiansi Dong and Anthony Cohn and Christian Hempelmann
and Kanishka Misra and Jens Lehmann and Alexander Mehler and Tristan Miller
and Siba Mohsen and Roberto Navigli and Julia Rayz and Stefan Wrobel and Ron
Sun and Volker Tresp},
title = {Towards a Survey of Meaning
Representation},
journal = {Dagstuhl Reports},
volume = {11},
number = {8},
pages = {29},
year = {2022},
issn = {2192-5283},
}
Overview of JOKER@CLEF 2022: Automatic wordplay and humour translation workshop.
In Alberto Barrón-Cedeño, Giovanni Da San Martino, Mirko Degli Esposti, Fabrizio Sebastiani, Craig Macdonald, Gabriella Pasi, Allan Hanbury, Martin Potthast, Guglielmo Faggioli, and Nicola Ferro, editors, Experimental IR Meets Multilinguality, Multimodality, and Interaction: Proceedings of the Thirteenth International Conference of the CLEF Association (CLEF 2022), volume 13390 of Lecture Notes in Computer Science (ISSN 0302-9743), pages 447–469, Cham, 2022. Springer. ISBN 978-3-031-13642-9. DOI: 10.1007/978-3-031-13643-6_27.
While humour and wordplay are among
the most intensively studied problems in the field of translation studies,
they have been almost completely ignored in machine translation. This is
partly because most AI-based translation tools require a quality and quantity
of training data (e.g., parallel corpora) that has historically been lacking
for humour and wordplay. The goal of the JOKER@CLEF 2022 workshop was to
bring together translators and computer scientists to work on an evaluation
framework for wordplay, including data and metric development, and to foster
work on automatic methods for wordplay translation. To this end, we defined
three pilot tasks: (1) classify and explain instances of wordplay,
(2) translate single terms containing wordplay, and (3) translate
entire phrases containing wordplay (punning jokes). This paper describes and
discusses each of these pilot tasks, as well as the participating systems and
their results.
@inproceedings{ermakova2022overview,
author = {Liana Ermakova and Tristan Miller and Fabio Regattin
and Anne-Gwenn Bosser and Claudine Borg and Élise Mathurin and Gaëlle Le
Corre and Sílvia Araújo and Radia Hannachi and Julien Boccou and Albin
Digue and Aurianne Damoy and Benoît Jeanjean},
editor = {Alberto Barrón-Cedeño and Giovanni Da San Martino
and Mirko Degli Esposti and Fabrizio Sebastiani and Craig Macdonald and
Gabriella Pasi and Allan Hanbury and Martin Potthast and Guglielmo Faggioli
and Nicola Ferro},
title = {Overview of {JOKER@CLEF} 2022: Automatic Wordplay and
Humour Translation Workshop},
booktitle = {Experimental {IR} Meets Multilinguality,
Multimodality, and Interaction: Proceedings of the {Thirteenth}
{International} {Conference} of the {CLEF} {Association} ({CLEF}
2022)},
volume = {13390},
pages = {447--469},
series = {Lecture Notes in Computer Science},
year = {2022},
publisher = {Springer},
address = {Cham},
isbn = {978-3-031-13642-9},
issn = {0302-9743},
doi = {10.1007/978-3-031-13643-6_27},
}
Human–computer interaction in pun translation.
In James Luke Hadley, Kristiina Taivalkoski-Shilov, Carlos S. C. Teixeira, and Antonio Toral, editors, Using Technologies for Creative-Text Translation, pages 66–88. Routledge, 2022. ISBN 9781003094159. DOI: 10.4324/9781003094159-4.
We present and evaluate PunCAT, an interactive electronic
tool for the translation of puns. Following the strategies known to be
applied in pun translation, PunCAT automatically translates each sense of the
pun separately; it then allows the user to explore the semantic fields of
these translations in order to help construct a plausible target-language
solution that maximizes the semantic correspondence to the original. Our
evaluation is based on an empirical pilot study in which the participants
translated puns from a variety of published sources from English into German,
with and without PunCAT. We aimed to answer the following questions: Does the
tool support, improve, or constrain the translation process, and if so, in
what ways? And what are the tool's main benefits and drawbacks as perceived
and described by the participants? Our analysis of the translators' cognitive
processes gives us insight into their decision-making strategies and how they
interacted with the tool. We find clear evidence that PunCAT effectively
supports the translation process in terms of stimulating brainstorming and
broadening the translator's pool of solution candidates. We have also
identified a number of directions in which the tool could be adapted to
better suit translators' work processes.
@incollection{kolb2022human,
author = {Waltraud Kolb and Tristan Miller},
editor = {James Luke Hadley and Kristiina Taivalkoski-Shilov
and Carlos S. C. Teixeira and Antonio Toral},
title = {Human--Computer Interaction in Pun
Translation},
booktitle = {Using
Technologies for Creative-Text Translation},
pages = {66--88},
year = {2022},
publisher = {Routledge},
isbn = {9781003094159},
doi = {10.4324/9781003094159-4},
}
Why current Machine Learning and AI (ML&AI)
techniques cannot understand jokes as we humans do? What is missing? The
knowledge that is needed to understand jokes is neither in the joke texts,
nor in the neural networks. Acquisition and reasoning with commonsense
knowledge is still an open problem for Machine Learning and AI. The meaning
representation based on embeddings is insufficient. We need meaning
representation formats that are beyond vector representations. Vectors are
only shadows. Information processing and meaning understanding are embodied.
The discussion guides us to develop novel embodied ML&AI techniques to
understand \emphSpatial Jokes first.
@article{mehler2022what,
author = {Alexander Mehler and Tiansi Dong and Thomas Liebig
and Tristan Miller and Siba Mohsen and Sven Naumann},
title = {What Is Missing in {ML}\&{AI} to Understand
Jokes?},
journal = {Dagstuhl Reports},
volume = {11},
number = {8},
pages = {32},
year = {2022},
issn = {2192-5283},
}
Cartoons can be understood without language. That is, a
suitably arranged scene of simple objects, with no accompanying text, is
often enough to make us laugh – evidence that thinking (mental
activity) happens before language. This raises the question of non-linguistic
diagrammatic representation of spatial humour, along with the mechanism of
neural computation. In particular, we raise following questions: (1) How can
we diagrammatically formalise spatial humour? (2) How can these diagrammatic
formalisms be processed by neural networks? (3) How can this neural
computation deliver high-level schema that are similar to the
script-opposition semantic theory of humour? The spatial knowledge encoded in
the scene can activate the necessary spatial and non- spatial knowledge. By
what neural associative mechanism or process of reasoning do we put this all
together to “get” the joke? During the seminar, we aimed to make
some headway towards establishing (1) exactly what sort of
scene-specific and common-sense knowledge is required to understand any given
cartoon, (2) what part of this knowledge could in principle be acquired
by existing machine learning (ML) techniques, and which could be acquired or
encoded through symbolic structures, (3) what activation process
acquires the rest of the knowledge required to interpret the humour, and
(4) whether there is a unified representation that could represent this
knowledge in a computer’s working memory.
@article{miller2022can,
author = {Tristan Miller and Anthony Cohn and Tiansi Dong and
Christian Hempelmann and Siba Mohsen and Julia Rayz},
title = {Can We Diagram the Understanding of
Humour?},
journal = {Dagstuhl Reports},
volume = {11},
number = {8},
pages = {33},
year = {2022},
issn = {2192-5283},
}
Remembering Netizens: An interview with Ronda Hauben, co-author of Netizens: On the history and impact of Usenet and the Internet (1997).
Internet Histories: Digital Technology, Culture and Society, 7(1):76–98, 2022. ISSN 2470-1483. DOI: 10.1080/24701475.2022.2123120.
Netizens, Michael and Ronda Hauben's
foundational treatise on Usenet and the Internet, was first published in
print 25 years ago. In this piece, we trace the history and impact of the
book and of Usenet itself, contextualising them within the contemporary and
modern-day scholarship on virtual communities, online culture, and Internet
history. We discuss the Net as a tool of empowerment, and touch on the
social, technical, and economic issues related to the maintenance of shared
network infrastructures and to the preservation and commodification of Usenet
archives. Our interview with Ronda Hauben offers a retrospective look at the
development of online communities, their impact, and how they are studied.
She recounts her own introduction to the online world, as well as the impetus
and writing process for Netizens. She presents Michael Hauben's conception of
“netizens” as contributory citizens of the Net (rather than mere
users of it) and the “electronic commons” they built up, and argues that
this collaborative and collectivist model has been overwhelmed and endangered
by the privatisation and commercialisation of the Internet and its
communities.
@article{miller2022remembering,
author = {Tristan Miller and Camille Paloque-Bergès and Avery
Dame-Griff},
title = {Remembering {Netizens}: {An} Interview with {Ronda}
{Hauben}, Co-Author of {Netizens}: {On} the History and Impact of {Usenet}
and the {Internet} (1997)},
journal = {Internet Histories: Digital Technology, Culture and
Society},
volume = {7},
number = {1},
pages = {76--98},
year = {2022},
issn = {2470-1483},
doi = {10.1080/24701475.2022.2123120},
}
Overview of the CLEF 2022 JOKER Task 2: Translate wordplay in named entities.
In Guglielmo Faggioli, Nicola Ferro, Allan Hanbury, and Martin Potthast, editors, Proceedings of the Working Notes of CLEF 2022 – Conference and Labs of the Evaluation Forum, Bologna, Italy, September 5th to 8th, 2022, volume 3180 of CEUR Workshop Proceedings (ISSN 1613-0073), pages 1666–1680, August 2022.
Onomastic wordplay has been
widely used as a rhetorical device by novelists, poets, and playwrights, from
character names in Shakespeare and other classic literature to named entities
in Pokémon, Harry Potter, Asterix, and video games. The translation of such
wordplay is problematic both for humans and algorithms due to its ambiguity
and unorthodox morphology. In this paper, we present an overview of Pilot
Task 2 of the JOKER@CLEF 2022 track, where participants had to
translate wordplay in named entities from English into French. For this, we
constructed a parallel corpus wordplay in named entities from movies, video
games, advertising slogans, literature, etc. Five teams participated in the
task. The methods employed by participants were based on the state-of-the-art
transformer models, which have the advantage of subword tokenisation. The
participants' models were pre-trained on large corpora and fine-tuned on the
JOKER training set. We observed that in many cases the models provided the
exact official translations, suggesting that they were pre-trained on the
corpus containing the source texts used in the JOKER corpus. Those
translations that differed from the official ones only rarely contained
wordplay.
@inproceedings{ermakova2022overviewtask2,
author = {Liana Ermakova and Tristan Miller and Julien Boccou
and Albin Digue and Aurianne Damoy and Paul Campen},
editor = {Guglielmo Faggioli and Nicola Ferro and Allan Hanbury
and Martin Potthast},
title = {Overview of the {CLEF}~2022 {JOKER} {Task}~2:
Translate Wordplay in Named Entities},
booktitle = {Proceedings of the {Working} {Notes} of
{CLEF}~2022~-- {Conference} and {Labs} of the {Evaluation} {Forum},
{Bologna}, {Italy}, {September} 5th to 8th, 2022},
volume = {3180},
pages = {1666--1680},
series = {CEUR Workshop Proceedings},
month = aug,
year = {2022},
issn = {1613-0073},
}
Overview of the CLEF 2022 JOKER Task 1: Classify and explain instances of wordplay.
In Guglielmo Faggioli, Nicola Ferro, Allan Hanbury, and Martin Potthast, editors, Proceedings of the Working Notes of CLEF 2022 – Conference and Labs of the Evaluation Forum, Bologna, Italy, September 5th to 8th, 2022, volume 3180 of CEUR Workshop Proceedings (ISSN 1613-0073), pages 1641–1665, August 2022.
As a multidisciplinary field of
study, humour remains one of the most difficult aspects of intercultural
communication. Understanding humour often involves understanding implicit
cultural references and/or double meanings, which raises the questions of how
to detect and classify instances of this complex phenomenon. This paper
provides an overview of Pilot Task 1 of the CLEF 2022 JOKER track, where
participants had to classify and explain instances of wordplay. We introduce
a new classification of wordplay and a new annotation scheme for wordplay
interpretation suitable both for phrase-based wordplay and wordplay in named
entities. We describe the collection of our data, our task setup, and the
evaluation procedure, and we give a brief overview of the participating
teams' approaches and results.
@inproceedings{ermakova2022overviewtask1,
author = {Liana Ermakova and Fabio Regattin and Tristan Miller
and Anne-Gwenn Bosser and Sílvia Araújo and Claudine Borg and Gaëlle Le
Corre and Julien Boccou and Albin Digue and Aurianne Damoy and Paul Campen
and Orlane Puchalski},
editor = {Guglielmo Faggioli and Nicola Ferro and Allan Hanbury
and Martin Potthast},
title = {Overview of the {CLEF}~2022 {JOKER} {Task}~1:
Classify and Explain Instances of Wordplay},
booktitle = {Proceedings of the {Working} {Notes} of
{CLEF}~2022~-- {Conference} and {Labs} of the {Evaluation} {Forum},
{Bologna}, {Italy}, {September} 5th to 8th, 2022},
volume = {3180},
pages = {1641--1665},
series = {CEUR Workshop Proceedings},
month = aug,
year = {2022},
issn = {1613-0073},
}
Overview of the CLEF 2022 JOKER Task 3: Pun translation from English into French.
In Guglielmo Faggioli, Nicola Ferro, Allan Hanbury, and Martin Potthast, editors, Proceedings of the Working Notes of CLEF 2022 – Conference and Labs of the Evaluation Forum, Bologna, Italy, September 5th to 8th, 2022, volume 3180 of CEUR Workshop Proceedings (ISSN 1613-0073), pages 1681–1700, August 2022.
The translation of the pun is one
of the most challenging issues for translators and for this reason has become
an intensively studied phenomenon in the field of translation studies.
Translation technology aims to partially or even totally automate the
translation process, but relatively little attention has been paid to the use
of computers for the translation of wordplay. The CLEF 2022 JOKER track
aims to build a multilingual corpus of wordplay and evaluation metrics in
order to advance the automation of creative-language translation. This paper
provides an overview of the track's Pilot Task 3, where the goal is to
translate entire phrases containing wordplay (particularly puns). We describe
the data collection, the task setup, the evaluation procedure, and the
participants' results. We also cover a side product of our project, a
homogeneous monolingual corpus for wordplay detection in French.
@inproceedings{ermakova2022overviewtask3,
author = {Liana Ermakova and Fabio Regattin and Tristan Miller
and Anne-Gwenn Bosser and Claudine Borg and Benoît Jeanjean and Élise
Mathurin and Gaëlle Le Corre and Radia Hannachi and Sílvia Araújo and
Julien Boccou and Albin Digue and Aurianne Damoy},
editor = {Guglielmo Faggioli and Nicola Ferro and Allan Hanbury
and Martin Potthast},
title = {Overview of the {CLEF}~2022 {JOKER} {Task}~3: Pun
Translation from {English} into {French}},
booktitle = {Proceedings of the {Working} {Notes} of
{CLEF}~2022~-- {Conference} and {Labs} of the {Evaluation} {Forum},
{Bologna}, {Italy}, {September} 5th to 8th, 2022},
volume = {3180},
pages = {1681--1700},
series = {CEUR Workshop Proceedings},
month = aug,
year = {2022},
issn = {1613-0073},
}
CLEF Workshop JOKER: Automatic wordplay and humour translation.
In Matthias Hagen, Suzan Verberne, Craig Macdonald, Christin Seifert, Krisztian Balog, Kjetil Nørvåg, and Vinay Setty, editors, Advances in Information Retrieval: 44th European Conference on IR Research, ECIR 2022, Stavanger, Norway, April 10–14, 2022, Proceedings, Part II, Lecture Notes in Computer Science, pages 355–363, Berlin, Heidelberg, April 2022. Springer. ISBN 978-3-030-99738-0. DOI: 10.1007/978-3-030-99739-7_45.
Humour remains one of the most difficult aspects of
intercultural communication: understanding humour often requires
understanding implicit cultural references and/ or double meanings, and this
raises the question of the (un)translatability of humour. Wordplay is a
common source of humour in literature, journalism, and advertising due to its
attention-getting, mnemonic, playful, and subversive character. The
translation of humour and wordplay is therefore in high demand. Modern
translation depends heavily on technological aids, yet few works have treated
the automation of humour and wordplay translation and the creation of humour
corpora. The goal of the JOKER workshop is to bring together translators and
computer scientists to work on an evaluation framework for creative language,
including data and metric development, and to foster work on automatic
methods for wordplay translation. We propose three pilot tasks:
(1) classify and explain instances of wordplay, (2) translate
single words containing wordplay, and (3) translate entire phrases
containing wordplay.
@inproceedings{ermakova2022clef,
author = {Liana Ermakova and Tristan Miller and Orlane
Puchalski and Fabio Regattin and Élise Mathurin and Sílvia Araújo and
Anne-Gwenn Bosser and Claudine Borg and Monika Bokiniec and Gaelle Le Corre
and Benoît Jeanjean and Radia Hannachi and Ġorġ Mallia and Gordan Matas
and Mohamed Saki},
editor = {Matthias Hagen and Suzan Verberne and Craig Macdonald
and Christin Seifert and Krisztian Balog and Kjetil Nørvåg and Vinay
Setty},
title = {{CLEF} {Workshop} {JOKER}: Automatic Wordplay and
Humour Translation},
booktitle = {Advances
in Information Retrieval: 44th {European} {Conference} on {IR} {Research},
{ECIR} 2022, {Stavanger}, {Norway}, {April} 10–14, 2022, Proceedings, Part
{II}},
pages = {355--363},
series = {Lecture Notes in Computer Science},
month = apr,
year = {2022},
publisher = {Springer},
address = {Berlin, Heidelberg},
isbn = {978-3-030-99738-0},
issn = {0302-9743},
doi = {10.1007/978-3-030-99739-7_45},
}
End-to-end style-conditioned poetry generation: What does it take to learn from examples alone?.
In Proceedings of the 5th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature (LaTeCH-CLfL 2021), pages 57–66, November 2021. DOI: 10.18653/v1/2021.latechclfl-1.7.
In this work, we design an end-to-end model for poetry
generation based on conditioned recurrent neural network (RNN) language
models whose goal is to learn stylistic features (poem length, sentiment,
alliteration, and rhyming) from examples alone. We show this model
successfully learns the ‘meaning' of length and sentiment, as we can
control it to generate longer or shorter as well as more positive or more
negative poems. However, the model does not grasp sound phenomena like
alliteration and rhyming, but instead exploits low-level statistical cues.
Possible reasons include the size of the training data, the relatively low
frequency and difficulty of these sublexical phenomena as well as model
biases. We show that more recent GPT-2 models also have problems learning
sublexical phenomena such as rhyming from examples alone.
@inproceedings{woeckener2021end,
author = {J{\"{o}}rg W{\"{o}}ckener and Thomas Haider and
Tristan Miller and The-Khang Nguyen and Thanh Tung Linh Nguyen and Minh Vu
Pham and Jonas Belouadi and Steffen Eger},
title = {End-to-end Style-Conditioned Poetry Generation:
{What} Does It Take to Learn from Examples Alone?},
booktitle = {Proceedings of the 5th {Joint} {SIGHUM} {Workshop} on
{Computational} {Linguistics} for {Cultural} {Heritage}, {Social} {Sciences},
{Humanities} and {Literature} ({LaTeCH}-{CLfL} 2021)},
pages = {57--66},
month = nov,
year = {2021},
doi = {10.18653/v1/2021.latechclfl-1.7},
}
SemEval-2021 Task 12: Learning with disagreements.
In Proceedings of the 15th International Workshop on Semantic Evaluation (SemEval-2021), pages 338–347, August 2021. ISBN 978-1-954085-70-1. DOI: 10.18653/v1/2021.semeval-1.41.
Disagreement between coders is ubiquitous in virtually
all datasets annotated with human judgements in both natural language
processing and computer vision. However, most supervised machine learning
methods assume that a single preferred interpretation exists for each item,
which is at best an idealization. The aim of the SemEval-2021 shared task on
Learning with Disagreements (Le-wi-Di) was to provide a unified testing
framework for methods for learning from data containing multiple and possibly
contradictory annotations covering the best-known datasets containing
information about disagreements for interpreting language and classifying
images. In this paper we describe the shared task and its results.
@inproceedings{uma2021semeval,
author = {Alexandra Uma and Tommaso Fornaciari and Anca
Dumitrache and Tristan Miller and Jon Chamberlain and Barbara Plank and Edwin
Simpson and Massimo Poesio},
title = {{SemEval}-2021 {Task}~12: Learning with
Disagreements},
booktitle = {Proceedings of the 15th {International} {Workshop} on
{Semantic} {Evaluation} ({SemEval}-2021)},
pages = {338--347},
month = aug,
year = {2021},
isbn = {978-1-954085-70-1},
doi = {10.18653/v1/2021.semeval-1.41},
}
Dmitri Borgmann's rotas square articles.
Notes and Queries, 67(3):431–432, September 2020. ISSN 0029-3970. DOI: 10.1093/notesj/gjaa113.
In 1979 and 1980, Word Ways: The Journal of
Recreational Linguistics printed a series of articles on the early history,
religious symbolism, and cultural significance of the rotas square, an
ancient Latin-language palindromic word square. The articles were attributed
to Dmitri A. Borgmann, the noted American writer on wordplay and former
editor of Word Ways. While they attracted little attention at the time, some
35 years after their publication (and 29 years after Borgmann's death),
questions began to be raised about their authorship. There is much internal
and external evidence that, taken together, compellingly supports the notion
that Borgmann did not write the articles himself. This paper surveys this
evidence and solicits help in identifying the articles' original
source.
@article{miller2020dmitri,
author = {Tristan Miller},
title = {{Dmitri Borgmann's} Rotas Square
Articles},
journal = {Notes and Queries},
volume = {67},
number = {3},
pages = {431--432},
month = sep,
year = {2020},
issn = {0029-3970},
doi = {10.1093/notesj/gjaa113},
}
GPP, the generic preprocessor.
Journal of Open Source Software, 5(51), July 2020. ISSN 2475-9066. DOI: 10.21105/joss.02400.
In computer science, a preprocessor (or macro processor)
is a tool that programatically alters its input, typically on the basis of
inline annotations, to produce data that serves as input for another program.
Preprocessors are used in software development and document processing
workflows to translate or extend programming or markup languages, as well as
for conditional or pattern-based generation of source code and text. Early
preprocessors were relatively simple string replacement tools that were tied
to specific programming languages and application domains, and while these
have since given rise to more powerful, general-purpose tools, these often
require the user to learn and use complex macro languages with their own
syntactic conventions. In this paper, we present GPP, an extensible,
general-purpose preprocessor whose principal advantage is that its syntax and
behaviour can be customized to suit any given preprocessing task. This makes
GPP of particular benefit to research applications, where it can be easily
adapted for use with novel markup, programming, and control languages.
@article{miller2020gpp,
author = {Tristan Miller and Denis Auroux},
title = {{GPP}, the Generic Preprocessor},
journal = {Journal of Open Source Software},
volume = {5},
number = {51},
month = jul,
year = {2020},
issn = {2475-9066},
doi = {10.21105/joss.02400},
}
Don't shun the pun: On the requirements and constraints for preserving ambiguity in the (machine) translation of humour.
In Mehrdad Sabetzadeh, Andreas Vogelsang, Sallam Abualhaija, Markus Borg, Fabiano Dalpiaz, Maya Daneva, Nelly C. Fernández, Xavier Franch, Davide Fucci, Vincenzo Gervasi, Eduard Groen, Renata Guizzardi, Andrea Herrmann, Jennifer Horkoff, Luisa Mich, Anna Perini, and Angelo Susi, editors, Joint Proceedings of REFSQ-2020 Workshops, Doctoral Symposium, Live Studies Track, and Poster Track co-located with the 26th International Conference on Requirements Engineering: Foundation for Software Quality (REFSQ 2020), volume 2584 of CEUR Workshop Proceedings (ISSN 1613-0073), March 2020.
How do we know when a translation is good? This seemingly
simple question has long dogged human practitioners of translation, and has
arguably taken on even greater importance in today’s world of fully
automatic, end-to-end machine translation systems. Much of the difficulty in
assessing translation quality is that different translations of the same text
may be made for different purposes, each of which entails a unique set of
requirements and constraints. This difficulty is compounded by ambiguities in
the source text, which must be identified and then preserved or eliminated
according to the needs of the translation and the (apparent) intent of the
source text. In this talk, I survey the state of the art in linguistics,
computational linguistics, translation, and machine translation as it relates
to the notion of linguistic ambiguity in general, and intentional humorous
ambiguity in particular. I describe the various constraints and requirements
of different types of translations and provide examples of how various
automatic and interactive techniques from natural language processing can be
used to detect and then resolve or preserve linguistic ambiguities according
to these constraints and requirements. In the vein of the
“Translator’s Amanuensis” proposed by Martin Kay, I outline
some specific proposals concerning how the hitherto disparate work in the
aforementioned fields can be connected with a view to producing
“machine-in-the-loop” computer-assisted translation (CAT) tools
to assist human translators in selecting and implementing pun translation
strategies in furtherance of the translation requirements. Throughout the
talk, I will attempt to draw links with how this research relates to the
requirements engineering community.
@inproceedings{miller2020dont,
author = {Tristan Miller},
editor = {Mehrdad Sabetzadeh and Andreas Vogelsang and Sallam
Abualhaija and Markus Borg and Fabiano Dalpiaz and Maya Daneva and Nelly C.
Fernández and Xavier Franch and Davide Fucci and Vincenzo Gervasi and Eduard
Groen and Renata Guizzardi and Andrea Herrmann and Jennifer Horkoff and Luisa
Mich and Anna Perini and Angelo Susi},
title = {Don't Shun the Pun: {On} the Requirements and
Constraints for Preserving Ambiguity in the (Machine) Translation of
Humour},
booktitle = {Joint
Proceedings of REFSQ-2020 Workshops, Doctoral Symposium, Live Studies Track,
and Poster Track co-located with the 26th International Conference on
Requirements Engineering: Foundation for Software Quality (REFSQ
2020)},
volume = {2584},
series = {CEUR Workshop Proceedings},
month = mar,
year = {2020},
issn = {1613-0073},
}
Predicting the humorousness of tweets using Gaussian process preference learning.
Procesamiento del Lenguaje Natural, 64:37–44, March 2020. ISSN 1135-5948. DOI: 10.26342/2020-64-4.
Most humour processing systems to date
make at best discrete, coarse-grained distinctions between the comical and
the conventional, yet such notions are better conceptualized as a broad
spectrum. In this paper, we present a probabilistic approach, a variant of
Gaussian process preference learning (GPPL), that learns to rank and rate the
humorousness of short texts by exploiting human preference judgments and
automatically sourced linguistic annotations. We apply our system, which is
similar to one that had previously shown good performance on English-language
one-liners annotated with pairwise humorousness annotations, to the
Spanish-language data set of the HAHA@IberLEF2019 evaluation campaign. We
report system performance for the campaign's two subtasks, humour detection
and funniness score prediction, and discuss some issues arising from the
conversion between the numeric scores used in the HAHA@IberLEF2019 data and
the pairwise judgment annotations required for our method.
@article{miller2020predicting,
author = {Tristan Miller and Do Dinh, Erik-L{\^{a}}n and Edwin
Simpson and Iryna Gurevych},
title = {Predicting the Humorousness of Tweets Using
{Gaussian} Process Preference Learning},
journal = {Procesamiento del Lenguaje Natural},
volume = {64},
pages = {37--44},
month = mar,
year = {2020},
issn = {1135-5948},
doi = {10.26342/2020-64-4},
}
Reinhold Aman, 1936–2019.
Humor: International Journal of Humor Research, 32(1):1–5, February 2020. ISSN 0933-1719. DOI: 10.1515/humor-2019-0085.
@article{miller2020reinhold,
author = {Tristan Miller},
title = {Reinhold {Aman}, 1936--2019},
journal = {Humor: International Journal of Humor
Research},
volume = {32},
number = {1},
pages = {1--5},
month = feb,
year = {2020},
issn = {0933-1719},
doi = {10.1515/humor-2019-0085},
}
@article{miller2019reinhold,
author = {Tristan Miller},
title = {Reinhold {Aman} (1936--2019)},
journal = {The {LINGUIST} List},
volume = {30.4729},
month = dec,
year = {2019},
}
The punster's amanuensis: The proper place of humans and machines in the translation of wordplay.
In Proceedings of the Second Workshop on Human-Informed Translation and Interpreting Technology (HiT-IT 2019), pages 57–64, September 2019. DOI: 10.26615/issn.2683-0078.2019_007.
The translation of wordplay is one of the most
extensively researched problems in translation studies, but it has attracted
little attention in the fields of natural language processing and machine
translation. This is because today's language technologies treat anomalies
and ambiguities in the input as things that must be resolved in favour of a
single “correct” interpretation, rather than preserved and
interpreted in their own right. But if computers cannot yet process such
creative language on their own, can they at least provide specialized support
to translation professionals? In this paper, I survey the state of the art
relevant to computational processing of humorous wordplay and put forth a
vision of how existing theories, resources, and technologies could be adapted
and extended to support interactive, computer-assisted translation.
@inproceedings{miller2019punsters,
author = {Tristan Miller},
title = {The Punster's Amanuensis: {The} Proper Place of
Humans and Machines in the Translation of Wordplay},
booktitle = {Proceedings of the {Second} {Workshop} on
{Human-Informed} {Translation} and {Interpreting} {Technology} ({HiT}-{IT}
2019)},
pages = {57--64},
month = sep,
year = {2019},
issn = {2683-0078},
doi = {10.26615/issn.2683-0078.2019_007},
}
OFAI–UKP at HAHA@IberLEF2019: Predicting the humorousness of tweets using Gaussian process preference learning.
In Miguel Ángel García Cumbreras, Julio Gonzalo, Eugenio Martínez Cámara, Raquel Martínez Unanue, Paolo Rosso, Jorge Carrillo de Albornoz, Soto Montalvo, Luis Chiruzzo, Sandra Collovini, Yoan Guitiérrez, Salud Jiménez Zafra, Martin Krallinger, Manuel Montes y Gómez, Reynier Ortega-Bueno, and Aiala Rosá, editors, Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2019), volume 2421 of CEUR Workshop Proceedings (ISSN 1613-0073), pages 180–190, August 2019.
Most humour processing systems to date make at best
discrete, coarse-grained distinctions between the comical and the
conventional, yet such notions are better conceptualized as a broad spectrum.
In this paper, we present a probabilistic approach, a variant of Gaussian
process preference learning (GPPL), that learns to rank and rate the
humorousness of short texts by exploiting human preference judgments and
automatically sourced linguistic annotations. We apply our system, which had
previously shown good performance on English-language one-liners annotated
with pairwise humorousness annotations, to the Spanish-language data set of
the HAHA@IberLEF2019 evaluation campaign. We report system performance for
the campaign's two subtasks, humour detection and funniness score prediction,
and discuss some issues arising from the conversion between the numeric
scores used in the HAHA@IberLEF2019 data and the pairwise judgment
annotations required for our method.
@inproceedings{miller2019ofaiukp,
author = {Tristan Miller and Do Dinh, Erik-L{\^{a}}n and Edwin
Simpson and Iryna Gurevych},
editor = {García Cumbreras, Miguel Ángel and Julio Gonzalo
and Martínez Cámara, Eugenio and Martínez Unanue, Raquel and Paolo Rosso
and Jorge Carrillo-de-Albornoz and Soto Montalvo and Luis Chiruzzo and Sandra
Collovini and Yoan Guitiérrez and Jiménez Zafra, Salud and Martin
Krallinger and Manuel Montes-y-Gómez and Reynier Ortega-Bueno and Aiala
Rosá},
title = {{OFAI}--{UKP} at {HAHA}@{IberLEF}2019: {Predicting}
the Humorousness of Tweets Using {Gaussian} Process Preference
Learning},
booktitle = {Proceedings of the {Iberian} {Languages} {Evaluation}
{Forum} ({IberLEF} 2019)},
volume = {2421},
pages = {180--190},
series = {CEUR Workshop Proceedings},
month = aug,
year = {2019},
issn = {1613-0073},
}
Predicting humorousness and metaphor novelty with Gaussian process preference learning.
In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL 2019), pages 5716–5728, July 2019. ISBN 978-1-950737-48-2. DOI: 10.18653/v1/P19-1572.
The inability to quantify key aspects
of creative language is a frequent obstacle to natural language
understanding. To address this, we introduce novel tasks for evaluating the
creativeness of language—namely, scoring and ranking text by
humorousness and metaphor novelty. To sidestep the difficulty of assigning
discrete labels or numeric scores, we learn from pairwise comparisons between
texts. We introduce a Bayesian approach for predicting humorousness and
metaphor novelty using Gaussian process preference learning (GPPL),
which achieves a Spearman's $\rho$ of 0.56 against gold using word
embeddings and linguistic features. Our experiments show that given sparse,
crowdsourced annotation data, ranking using GPPL outperforms best–worst
scaling. We release a new dataset for evaluating humor containing 28,210
pairwise comparisons of 4,030 texts, and make our software freely
available.
@inproceedings{simpson2019predicting,
author = {Edwin Simpson and Do Dinh, Erik-L{\^{a}}n and Tristan
Miller and Iryna Gurevych},
title = {Predicting Humorousness and Metaphor Novelty with
{Gaussian} Process Preference Learning},
booktitle = {Proceedings of the 57th {Annual} {Meeting} of the
{Association} for {Computational} {Linguistics} ({ACL} 2019)},
pages = {5716--5728},
month = jul,
year = {2019},
isbn = {978-1-950737-48-2},
doi = {10.18653/v1/P19-1572},
}
Detecting humorous images by caption analysis.
In Proceedings of the 2019 Conference of the International Society for Humor Studies, June 2019.
The automatic recognition of verbal humour has
become an established work area in natural language processing (NLP), but the
detection of humour in visual media is still in its infancy. In this paper,
we describe and evaluate NLP methods for detecting humorous images by
analyzing descriptive captions. We present a data set of 40 scenes manually
annotated with English-language captions and funniness scores, as well as
various knowledge-based and data-driven methods that use the captions alone
to predict the funniness of the associated scene. Our knowledge-based
methods, inspired by (verbal) humour-theoretic notions of incongruity and
surprise, use semantic frames, selectional preferences for verb dependencies,
and/or n-gram frequencies, while our data-driven methods include bag-of-words
models and pre-trained word embeddings used as features in various machine
learning classifiers: naïve Bayes, support vector machine (SVM), random
forest, and a multilayer perceptron. On our data, the bag-of-words model with
an SVM achieves the best classification performance, approximating the human
upper bound. Our analysis of false negatives indicates that the element of
incongruity is absent, or at least not obvious, in many funny scenes or their
descriptive captions.
@inproceedings{miller2019detecting,
author = {Tristan Miller and Malou Ockenfels and Yevgeny
Puzikov},
title = {Detecting Humorous Images by Caption
Analysis},
booktitle = {Proceedings of the 2019 Conference of the
International Society for Humor Studies},
month = jun,
year = {2019},
}
A Bayesian approach for predicting the humorousness of one-liners.
In Proceedings of the 2019 Conference of the International Society for Humor Studies, June 2019.
Humour is an essential aspect of human communication
that computational methods have yet to master. Most natural language
processing systems to date make at best discrete, coarse-grained distinctions
between the comical and the conventional, yet such notions are better
conceptualized as a broad spectrum. We therefore introduce the novel task of
automatically quantifying and ranking short texts by humorousness, and
present a probabilistic approach that learns to do this by examining human
preference judgments. We evaluate our system on a crowdsourced data set of
nearly 30,000 pairwise comparisons of over 4000 one-liners. We find that it
correlates well with best–worst scaling (BWS) when pairwise labels are
abundant, and outperforms BWS when they are sparse. And unlike BWS, because
our method exploits word embeddings and shallow text features, it can make
accurate predictions even for previously unseen texts.
@inproceedings{miller2019bayesian,
author = {Tristan Miller and Edwin Simpson and Erik-Lân {Do
Dinh} and Iryna Gurevych},
title = {A {Bayesian} Approach for Predicting the Humorousness
of One-liners},
booktitle = {Proceedings of the 2019 Conference of the
International Society for Humor Studies},
month = jun,
year = {2019},
}
A streamlined method for sourcing discourse-level argumentation annotations from the crowd.
In Proceedings of the 17th Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT 2019), volume 1, pages 1790–1796, June 2019. ISBN 978-1-950737-13-0. DOI: 10.18653/v1/N19-1177.
The study of argumentation and the
development of argument mining tools depends on the availability of annotated
data, which is challenging to obtain in sufficient quantity and quality. We
present a method that breaks down a popular but relatively complex
discourse-level argument annotation scheme into a simpler, iterative
procedure that can be applied even by untrained annotators. We apply this
method in a crowdsourcing setup and report on the reliability of the
annotations obtained. The source code for a tool implementing our annotation
method, as well as the sample data we obtained (4909 gold-standard
annotations across 982 documents), are freely released to the research
community. These are intended to serve the needs of qualitative research into
argumentation, as well as of data-driven approaches to argument mining.
@inproceedings{miller2019streamlined,
author = {Tristan Miller and Maria Sukhareva and Iryna
Gurevych},
title = {A Streamlined Method for Sourcing Discourse-level
Argumentation Annotations from the Crowd},
booktitle = {Proceedings of the 17th {Annual} {Conference} of the
{North} {American} {Chapter} of the {Association} for {Computational}
{Linguistics}: Human Language Technologies ({NAACL}-{HLT}
2019)},
volume = {1},
pages = {1790--1796},
month = jun,
year = {2019},
isbn = {978-1-950737-13-0},
doi = {10.18653/v1/N19-1177},
}
Cross-topic argument mining from heterogeneous sources.
In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP 2018), pages 3664–3674, October 2018. ISBN 978-1-948087-84-1. DOI: 10.18653/v1/D18-1402.
Argument mining is a core technology
for automating argument search in large document collections. Despite its
usefulness for this task, most current approaches are designed for use only
with specific text types and fall short when applied to heterogeneous texts.
In this paper, we propose a new sentential annotation scheme that is reliably
applicable by crowd workers to arbitrary Web texts. We source annotations for
over 25,000 instances covering eight controversial topics. We show that
integrating topic information into bidirectional long short-term memory
networks outperforms vanilla BiLSTMs by more than 3 percentage points in
F$_1$ in two- and three-label cross-topic settings. We also show that these
results can be further improved by leveraging additional data for topic
relevance using multi-task learning.
@inproceedings{stab2018bcross-topic,
author = {Christian Stab and Tristan Miller and Benjamin
Schiller and Pranav Rai and Iryna Gurevych},
title = {Cross-topic Argument Mining from Heterogeneous
Sources},
booktitle = {Proceedings of the 2018 {Conference} on {Empirical}
{Methods} in {Natural} {Language} {Processing} ({EMNLP} 2018)},
pages = {3664--3674},
month = oct,
year = {2018},
isbn = {978-1-948087-84-1},
doi = {10.18653/v1/D18-1402},
}
ArgumenText: Searching for arguments in heterogeneous sources.
In Proceedings of the 16th Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Demonstrations (NAACL-HLT 2018), pages 21–25, June 2018. ISBN 978-1-948087-28-5. DOI: 10.18653/v1/N18-5005.
Argument mining is a core technology for enabling
argument search in large corpora. However, most current approaches fall short
when applied to heterogeneous texts. In this paper, we present an argument
retrieval system capable of retrieving sentential arguments for any given
controversial topic. By analyzing the highest-ranked results extracted from
Web sources, we found that our system covers 89% of arguments found in
expert-curated lists of arguments from an online debate portal, and also
identifies additional valid arguments.
@inproceedings{stab2018argumentext,
author = {Christian Stab and Johannes Daxenberger and Chris
Stahlhut and Tristan Miller and Benjamin Schiller and Christopher Tauchmann
and Steffen Eger and Iryna Gurevych},
title = {{ArgumenText}: Searching for Arguments in
Heterogeneous Sources},
booktitle = {Proceedings of the 16th {Annual} {Conference} of the
{North} {American} {Chapter} of the {Association} for {Computational}
{Linguistics}: Human Language Technologies: Demonstrations ({NAACL}-{HLT}
2018)},
pages = {21--25},
month = jun,
year = {2018},
isbn = {978-1-948087-28-5},
doi = {10.18653/v1/N18-5005},
}
Cross-topic argument mining from heterogeneous sources using attention-based neural networks.
ArXiv e-prints, 1802.05758, February 2018.
Argument mining is a core technology for automating
argument search in large document collections. Despite its usefulness for
this task, most current approaches to argument mining are designed for use
only with specific text types and fall short when applied to heterogeneous
texts. In this paper, we propose a new sentential annotation scheme that is
reliably applicable by crowd workers to arbitrary Web texts. We source
annotations for over 25,000 instances covering eight controversial topics.
The results of cross-topic experiments show that our attention-based neural
network generalizes best to unseen topics and outperforms vanilla BiLSTM
models by 6% in accuracy and 11% in F-score.
@article{stab2018cross-topic,
author = {Christian Stab and Tristan Miller and Iryna
Gurevych},
title = {Cross-topic Argument Mining from Heterogeneous
Sources Using Attention-based Neural Networks},
journal = {{ArXiv} e-prints},
volume = {1802.05758},
month = feb,
year = {2018},
}
SemEval-2017 Task 7: Detection and interpretation of English puns.
In Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017), pages 58–68, August 2017. ISBN 978-1-945626-55-5. DOI: 10.18653/v1/S17-2005.
A pun is a form of wordplay in which a word suggests
two or more meanings by exploiting polysemy, homonymy, or phonological
similarity to another word, for an intended humorous or rhetorical effect.
Though a recurrent and expected feature in many discourse types, puns stymie
traditional approaches to computational lexical semantics because they
violate their one-sense-per-context assumption. This paper describes the
first competitive evaluation for the automatic detection, location, and
interpretation of puns. We describe the motivation for these tasks, the
evaluation methods, and the manually annotated data set. Finally, we present
an overview and discussion of the participating systems' methodologies,
resources, and results.
@inproceedings{miller2017semeval,
author = {Tristan Miller and Christian F. Hempelmann and Iryna
Gurevych},
title = {{SemEval}-2017 {Task}~7: {Detection} and
Interpretation of {English} Puns},
booktitle = {Proceedings of the 11th {International} {Workshop} on
{Semantic} {Evaluation} ({SemEval}-2017)},
pages = {58--68},
month = aug,
year = {2017},
isbn = {978-1-945626-55-5},
doi = {10.18653/v1/S17-2005},
}
Puns: Taxonomy and phonology.
In Salvatore Attardo, editor, The Routledge Handbook of Language and Humor, Routledge Handbooks in Linguistics, pages 95–108. Routledge, New York, NY, February 2017. ISBN 978-1-138-84306-6. DOI: 10.4324/9781315731162-8.
@incollection{hempelmann2017taxonomy,
author = {Christian F. Hempelmann and Tristan
Miller},
editor = {Salvatore Attardo},
title = {Puns: Taxonomy and Phonology},
booktitle = {The
{Routledge} Handbook of Language and Humor},
pages = {95--108},
series = {Routledge Handbooks in Linguistics},
month = feb,
year = {2017},
publisher = {Routledge},
address = {New York, NY},
isbn = {978-1-138-84306-6},
doi = {10.4324/9781315731162-8},
}
CNN- and LSTM-based claim classification in online user comments.
In Proceedings of the 26th International Conference on Computational Linguistics: Technical Papers (COLING 2016), pages 2740–2751, December 2016. ISBN 978-4-87974-702-0.
When processing arguments in online user interactive
discourse, it is often necessary to determine their bases of support. In this
paper, we describe a supervised approach, based on deep neural networks, for
classifying the claims made in online arguments. We conduct experiments using
convolutional neural networks (CNNs) and long short-term memory networks
(LSTMs) on two claim data sets compiled from online user comments. Using
different types of distributional word embeddings, but without incorporating
any rich, expensive set of features, we achieve a significant improvement
over the state of the art for one data set (which categorizes arguments as
factual vs. emotional), and performance comparable to the state of the art on
the other data set (which categorizes claims according to their
verifiability). Our approach has the advantages of using a generalized,
simple, and effective methodology that works for claim categorization on
different data sets and tasks.
@inproceedings{guggilla2016cnn,
author = {Chinnappa Guggilla and Tristan Miller and Iryna
Gurevych},
title = {{CNN}- and {LSTM}-based Claim Classification in
Online User Comments},
booktitle = {Proceedings of the 26th {International} {Conference}
on {Computational} {Linguistics}: Technical Papers ({COLING}
2016)},
pages = {2740--2751},
month = dec,
year = {2016},
isbn = {978-4-87974-702-0},
}
Sense-annotating a lexical substitution data set with Ubyline.
In Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Marko Grobelnik, Bente Maegaard, Joseph Mariani, Asunción Moreno, Jan Odijk, and Stelios Piperidis, editors, Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC 2016), pages 828–835. European Language Resources Association, May 2016. ISBN 978-2-9517408-9-1.
We describe the construction of
GLASS, a newly sense-annotated version of the German lexical substitution
data set used at the GermEval 2015: LexSub shared task. Using the two
annotation layers, we conduct the first known empirical study of the
relationship between manually applied word senses and lexical substitutions.
We find that synonymy and hypernymy/hyponymy are the only semantic relations
directly linking targets to their substitutes, and that substitutes in the
target's hypernymy/hyponymy taxonomy closely align with the synonyms of a
single GermaNet synset. Despite this, these substitutes account for a
minority of those provided by the annotators. The results of our analysis
accord with those of a previous study on English-language data (albeit with
automatically induced word senses), leading us to suspect that the
sense–substitution relations we discovered may be of a universal
nature. We also tentatively conclude that relatively cheap lexical
substitution annotations can be used as a knowledge source for automatic WSD.
Also introduced in this paper is Ubyline, the web application used to produce
the sense annotations. Ubyline presents an intuitive user interface optimized
for annotating lexical sample data, and is readily adaptable to sense
inventories other than GermaNet.
@inproceedings{miller2016sense-annotating,
author = {Tristan Miller and Mohamed Khemakhem and Eckart de
Castilho, Richard and Iryna Gurevych},
editor = {Nicoletta Calzolari and Khalid Choukri and Thierry
Declerck and Marko Grobelnik and Bente Maegaard and Joseph Mariani and
Asunci{\'{o}}n Moreno and Jan Odijk and Stelios Piperidis},
title = {Sense-annotating a Lexical Substitution Data Set with
{Ubyline}},
booktitle = {Proceedings of the 10th {International} {Conference}
on {Language} {Resources} and {Evaluation} ({LREC} 2016)},
pages = {828--835},
month = may,
year = {2016},
publisher = {European
Language Resources Association},
isbn = {978-2-9517408-9-1},
}
Adjusting sense representations
for word sense disambiguation and automatic pun interpretation.
Dr.-Ing. thesis, Department of Computer Science, Technische Universität
Darmstadt, April 2016.
@phdthesis{miller2016adjusting,
author = {Tristan Miller},
title = {Adjusting Sense Representations for Word Sense
Disambiguation and Automatic Pun Interpretation},
type = {{Dr.-Ing.}\ thesis},
month = apr,
year = {2016},
school = {Department of Computer Science, Technische
Universit{\"{a}}t Darmstadt},
}
Towards the automatic detection and identification of English puns.
European Journal of Humour Research, 4(1):59–75, January 2016. ISSN 2307-700X. DOI: 10.7592/EJHR2016.4.1.miller.
Lexical polysemy, a fundamental characteristic of all
human languages, has long been regarded as a major challenge to machine
translation, human–computer interaction, and other applications of
computational natural language processing (NLP). Traditional approaches to
automatic word sense disambiguation (WSD) rest on the assumption that there
exists a single, unambiguous communicative intention underlying every word in
a document. However, writers sometimes intend for a word to be interpreted as
simultaneously carrying multiple distinct meanings. This deliberate use of
lexical ambiguity — i.e., punning — is a particularly
common source of humour, and therefore has important implications for how NLP
systems process documents and interact with users. In this paper we make a
case for research into computational methods for the detection of puns in
running text and for the isolation of the intended meanings. We discuss the
challenges involved in adapting principles and techniques from WSD to
humorously ambiguous text, and outline our plans for evaluating WSD-inspired
systems in a dedicated pun identification task. We describe the compilation
of a large manually annotated corpus of puns and present an analysis of its
properties. While our work is principally concerned with simple puns which
are monolexemic and homographic (i.e., exploiting single words which have
different meanings but are spelled identically), we touch on the challenges
involved in processing other types.
@article{miller2016towards,
author = {Tristan Miller and Mladen
Turkovi{\'{c}}},
title = {Towards the Automatic Detection and Identification of
{English} Puns},
journal = {European Journal of Humour Research},
volume = {4},
number = {1},
pages = {59--75},
month = jan,
year = {2016},
issn = {2307-700X},
doi = {10.7592/EJHR2016.4.1.miller},
}
GermEval 2015: LexSub – A shared task for German-language lexical substitution.
In Proceedings of GermEval 2015: LexSub, pages 1–9, September 2015.
Lexical substitution is a task in which participants
are given a word in a short context and asked to provide a list of synonyms
appropriate for that context. This paper describes GermEval 2015: LexSub, the
first shared task for automated lexical substitution on German-language text.
We describe the motivation for this task, the evaluation methods, and the
manually annotated data set used to train and test the participating systems.
Finally, we present an overview and discussion of the participating systems'
methodologies, resources, and results.
@inproceedings{miller2015germeval,
author = {Miller, Tristan and Benikova, Darina and Abualhaija,
Sallam},
title = {{GermEval} 2015: {LexSub}~-- {A} Shared Task for
{German}-language Lexical Substitution},
booktitle = {Proceedings of {GermEval} 2015:
{LexSub}},
pages = {1--9},
month = sep,
year = {2015},
}
Automatic disambiguation of English puns.
In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (ACL–IJCNLP 2015), volume 1, pages 719–729, July 2015. ISBN 978-1-941643-72-3. DOI: 10.3115/v1/P15-1070.
Traditional approaches to word sense disambiguation
(WSD) rest on the assumption that there exists a single, unambiguous
communicative intention underlying every word in a document. However, writers
sometimes intend for a word to be interpreted as simultaneously carrying
multiple distinct meanings. This deliberate use of lexical
ambiguity—i.e., punning—is a particularly common source of
humour. In this paper we describe how traditional, language-agnostic WSD
approaches can be adapted to “disambiguate” puns, or rather to
identify their double meanings. We evaluate several such approaches on a
manually sense-annotated corpus of English puns and observe performance
exceeding that of some knowledge-based and supervised baselines.
@inproceedings{miller2015automatic,
author = {Tristan Miller and Iryna Gurevych},
title = {Automatic Disambiguation of {English}
Puns},
booktitle = {Proceedings of the 53rd {Annual} {Meeting} of the
{Association} for {Computational} {Linguistics} and the 7th {International}
{Joint} {Conference} on {Natural} {Language} {Processing} ({ACL}--{IJCNLP}
2015)},
volume = {1},
pages = {719--729},
month = jul,
year = {2015},
isbn = {978-1-941643-72-3},
doi = {10.3115/v1/P15-1070},
}
A255436: Number of distinct, connected, order-n subgraphs of the infinite knight's graph.
In The On-line Encyclopedia of Integer Sequences. February 2015.
We present an integer sequence $a(n)$ corresponding to the
number of distinct graphs of order $n$ where the vertices can be mapped to
different squares of a chessboard such that the connected pairs of vertices
are a knight's move apart.
@incollection{A255436,
author = {Tristan Miller},
title = {A255436: Number of Distinct, Connected, Order-n
Subgraphs of the Infinite Knight's Graph},
booktitle = {The
On-line Encyclopedia of Integer Sequences},
month = feb,
year = {2015},
}
An analysis of ambiguity in English puns.
In International Humour Symposium [of the 4th Hungarian Interdisciplinary Humour Conference]: Programme and Abstracts, Komárno, Slovakia, November 2014. J. Selye University, Faculty of Education, Department of Modern Philology.
Punning is a common source of verbal humour in which
a word is used to evoke two or more distinct meanings simultaneously. The
present work describes and analyzes a large corpus of English homographic
puns manually annotated with senses from WordNet. We discuss the challenges
in developing and applying the annotation scheme, introduce our annotation
support tools, and present an analysis of selected morphological, syntactic,
and semantic properties of the annotated examples. Particular focus is placed
on the implications for computational approaches to detection of puns and
identification of their opposing meanings.
@inproceedings{miller2014analysis,
author = {Tristan Miller},
title = {An Analysis of Ambiguity in {English}
Puns},
booktitle = {International Humour Symposium [of the 4th Hungarian
Interdisciplinary Humour Conference]: Programme and Abstracts},
month = nov,
year = {2014},
publisher = {J. Selye
University, Faculty of Education, Department of Modern
Philology},
address = {Kom{\'{a}}rno, Slovakia},
}
A language-independent sense clustering approach for enhanced WSD.
In Josef Ruppenhofer and Gertrud Faaß, editors, Proceedings of the 12th Konferenz zur Verarbeitung natürlicher Sprache (KONVENS 2014), pages 11–21. Universitätsverlag Hildesheim, October 2014. ISBN 978-3-934105-46-1.
We present a method for clustering
word senses of a lexical-semantic resource by mapping them to those of
another sense inventory. This is a promising way of reducing polysemy in
sense inventories and consequently improving word sense disambiguation
performance. In contrast to previous approaches, we use Dijkstra-WSA, a
parameterizable alignment algorithm which is largely resource- and
language-agnostic. To demonstrate this, we apply our technique to GermaNet,
the German equivalent to WordNet. The GermaNet sense clusterings we induce
through alignments to various collaboratively constructed resources achieve a
significant boost in accuracy, even though our method is far less complex and
less dependent on language-specific knowledge than past approaches.
@inproceedings{matuschek2014language,
author = {Michael Matuschek and Tristan Miller and Iryna
Gurevych},
editor = {Josef Ruppenhofer and Gertrud Faa{\ss}},
title = {A Language-independent Sense Clustering Approach for
Enhanced {WSD}},
booktitle = {Proceedings of the 12th {Konferenz} zur
{Verarbeitung} {nat{\"{u}}rlicher} {Sprache} ({KONVENS} 2014)},
pages = {11--21},
month = oct,
year = {2014},
publisher = {Universit{\"{a}}tsverlag Hildesheim},
isbn = {978-3-934105-46-1},
}
WordNet–Wikipedia–Wiktionary: Construction of a three-way alignment.
In Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Hrafn Loftsson, Bente Maegaard, Joseph Mariani, Asunción Moreno, Jan Odijk, and Stelios Piperidis, editors, Proceedings of the 9th International Conference on Language Resources and Evaluation (LREC 2014), pages 2094–2100. European Language Resources Association, May 2014. ISBN 978-2-9517408-8-4.
The coverage and quality of conceptual information
contained in lexical semantic resources is crucial for many tasks in natural
language processing. Automatic alignment of complementary resources is one
way of improving this coverage and quality; however, past attempts have
always been between pairs of specific resources. In this paper we establish
some set-theoretic conventions for describing concepts and their alignments,
and use them to describe a method for automatically constructing $n$-way
alignments from arbitrary pairwise alignments. We apply this technique to the
production of a three-way alignment from previously published
WordNet–Wikipedia and WordNet–Wiktionary alignments. We then
present a quantitative and informal qualitative analysis of the aligned
resource. The three-way alignment was found to have greater coverage, an
enriched sense representation, and coarser sense granularity than both the
original resources and their pairwise alignments, though this came at the
cost of accuracy. An evaluation of the induced word sense clusters in a word
sense disambiguation task showed that they were no better than random
clusters of equivalent granularity. However, use of the alignments to enrich
a sense inventory with additional sense glosses did significantly improve the
performance of a baseline knowledge-based WSD algorithm.
@inproceedings{miller2014wordnet,
author = {Tristan Miller and Iryna Gurevych},
editor = {Nicoletta Calzolari and Khalid Choukri and Thierry
Declerck and Hrafn Loftsson and Bente Maegaard and Joseph Mariani and
Asunci{\'{o}}n Moreno and Jan Odijk and Stelios Piperidis},
title = {{WordNet}--{Wikipedia}--{Wiktionary}: Construction of
a Three-way Alignment},
booktitle = {Proceedings of the 9th {International} {Conference}
on {Language} {Resources} and {Evaluation} ({LREC} 2014)},
pages = {2094--2100},
month = may,
year = {2014},
publisher = {European
Language Resources Association},
isbn = {978-2-9517408-8-4},
}
DKPro WSD: A generalized UIMA-based framework for word sense disambiguation.
In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (System Demonstrations) (ACL 2013), pages 37–42, August 2013.
Implementations of word sense disambiguation (WSD)
algorithms tend to be tied to a particular test corpus format and sense
inventory. This makes it difficult to test their performance on new data
sets, or to compare them against past algorithms implemented for different
data sets. In this paper we present DKPro WSD, a freely licensed,
general-purpose framework for WSD which is both modular and extensible. DKPro
WSD abstracts the WSD process in such a way that test corpora, sense
inventories, and algorithms can be freely swapped. Its UIMA-based
architecture makes it easy to add support for new resources and algorithms.
Related tasks such as word sense induction and entity linking are also
supported.
@inproceedings{miller2013dkpro,
author = {Tristan Miller and Nicolai Erbs and Hans-Peter Zorn
and Torsten Zesch and Iryna Gurevych},
title = {{DKPro} {WSD}: {A} Generalized {UIMA}-based Framework
for Word Sense Disambiguation},
booktitle = {Proceedings of the 51st {Annual} {Meeting} of the
{Association} for {Computational} {Linguistics} (System Demonstrations)
({ACL} 2013)},
pages = {37--42},
month = aug,
year = {2013},
}
Using distributional similarity for lexical expansion in knowledge-based word sense disambiguation.
In Martin Kay and Christian Boitet, editors, Proceedings of the 24th International Conference on Computational Linguistics (COLING 2012), pages 1781–1796, December 2012.
We explore the contribution of distributional
information for purely knowledge-based word sense disambiguation.
Specifically, we use a distributional thesaurus, computed from a large parsed
corpus, for lexical expansion of context and sense information.This bridges
the lexical gap that is seen as the major obstacle for word overlap–based
approaches.We apply this mechanism to two traditional knowledge-based methods
and show that distributional information significantly improves
disambiguation results across several data sets.This improvement exceeds the
state of the art for disambiguation without sense frequency information—a
situation which is especially encountered with new domains or languages for
which no sense-annotated corpus is available.
@inproceedings{miller2012using,
author = {Tristan Miller and Chris Biemann and Torsten Zesch
and Iryna Gurevych},
editor = {Martin Kay and Christian Boitet},
title = {Using Distributional Similarity for Lexical Expansion
in Knowledge-based Word Sense Disambiguation},
booktitle = {Proceedings of the 24th {International} {Conference}
on {Computational} {Linguistics} ({COLING} 2012)},
pages = {1781--1796},
month = dec,
year = {2012},
}
Exploiting latent semantic relations in highly linked hypertext for information retrieval in wikis.
In Galia Angelova, Kalina Bontcheva, Ruslan Mitkov, Nicolas Nicolov, and Nikolai Nikolov, editors, Proceedings of the 7th International Conference on Recent Advances in Natural Language Processing (RANLP 2009), pages 241–245. ACM Press, September 2009.
Good hypertext writing style mandates
that link texts clearly indicate the nature of the link target. While this
guideline is routinely ignored in HTML, the lightweight markup languages used
by wikis encourage or even force hypertext authors to use semantically
appropriate link texts. This property of wiki hypertext makes it an ideal
candidate for processing with latent semantic analysis, a factor analysis
technique for finding latent transitive relations among natural-language
documents. In this study, we design, implement, and test an LSA-based
information retrieval system for wikis. Instead of a full-text index, our
system indexes only link texts and document titles. Nevertheless, its
precision exceeds that of a popular full-text search engine, and is
comparable to that of PageRank-based systems such as Google.
@inproceedings{miller2009exploiting,
author = {Tristan Miller and Bertin Klein and Elisabeth
Wolf},
editor = {Galia Angelova and Kalina Bontcheva and Ruslan Mitkov
and Nicolas Nicolov and Nikolai Nikolov},
title = {Exploiting Latent Semantic Relations in Highly Linked
Hypertext for Information Retrieval in Wikis},
booktitle = {Proceedings of the 7th {International} {Conference}
on {Recent} {Advances} in {Natural} {Language} {Processing} ({RANLP}
2009)},
pages = {241--245},
month = sep,
year = {2009},
publisher = {ACM
Press},
}
Word completion with latent semantic analysis.
In Yuan Yan Tang, S. Patrick Wang, G. Lorette, Daniel So Yeung, and Hong Yan, editors, Proceedings of the 18th International Conference on Pattern Recognition (ICPR 2006), volume 1, pages 1252–1255. IEEE Press, August 2006. ISBN 978-0-7695-2521-1. DOI: 10.1109/ICPR.2006.1191.
Current word completion tools rely mostly on statistical
or syntactic knowledge. Can using semantic knowledge improve the completion
task? We propose a language-independent word completion algorithm which uses
latent semantic analysis (LSA) to model the semantic context of the word
being typed. We find that a system using this algorithm alone achieves
keystroke savings of 56% and a hit rate of 42%. This represents
improvements of 4.3% and 12%, respectively, over existing
approaches.
@inproceedings{miller2006word,
author = {Tristan Miller and Elisabeth Wolf},
editor = {Yuan Yan Tang and S. Patrick Wang and G. Lorette and
Daniel So Yeung and Hong Yan},
title = {Word Completion with Latent Semantic
Analysis},
booktitle = {Proceedings of the 18th {International} {Conference}
on {Pattern} {Recognition} ({ICPR} 2006)},
volume = {1},
pages = {1252--1255},
month = aug,
year = {2006},
publisher = {IEEE
Press},
isbn = {978-0-7695-2521-1},
issn = {1051-4651},
doi = {10.1109/ICPR.2006.1191},
}
On the use of topic models for word completion.
In Tapio Salakoski, Filip Ginter, Sampo Pyysalo, and Tapio Pahikkala, editors, Advances in Natural Language Processing: 5th International Conference on NLP, FinTAL 2006 Turku, Finland, August 23–25, 2006 Proceedings, volume 4139 of Lecture Notes in Computer Science (ISSN 0302-9743), pages 500–511. Springer, August 2006. ISBN 978-3-540-37334-6. DOI: 10.1007/11816508_50.
We investigate the use of topic models, such as
probabilistic latent semantic analysis (PLSA) and latent Dirichlet allocation
(LDA), for word completion tasks. The advantage of using these models for
such an application is twofold. On the one hand, they allow us to exploit
semantic or contextual information when predicting candidate words for
completion. On the other hand, these probabilistic models have been found to
outperform classical latent semantic analysis (LSA) for modeling text
documents. We describe a word completion algorithm that takes into account
the semantic context of the word being typed. We also present evaluation
metrics to compare different models being used in our study. Our experiments
validate our hypothesis of using probabilistic models for semantic analysis
of text documents and their application in word completion tasks.
@inproceedings{wolf2006use,
author = {Elisabeth Wolf and Shankar Vembu and Tristan
Miller},
editor = {Tapio Salakoski and Filip Ginter and Sampo Pyysalo
and Tapio Pahikkala},
title = {On the Use of Topic Models for Word
Completion},
booktitle = {Advances
in Natural Language Processing: 5th International Conference on {NLP},
{FinTAL} 2006 {Turku}, {Finland}, {August} 23–25, 2006
Proceedings},
volume = {4139},
pages = {500--511},
series = {Lecture Notes in Computer Science},
month = aug,
year = {2006},
publisher = {Springer},
isbn = {978-3-540-37334-6},
issn = {0302-9743},
doi = {10.1007/11816508_50},
}
Creare splendide slade con LaTeX: Un'introduzione al pacchetto HA-prosper [Producing beautiful slides with LaTeX: An introduction to the HA-prosper package].
Pluto Journal, (47), May 2006. Translated by Gabriele Zucchetta.
In questo articolo verrà presentato
HA-prosper, un pacchetto LaTeX per la creazione di sofisticate slide. Ne
descriveremo le caratteristiche mostrandone alcuni esempi d'uso. Inoltre,
discuteremo quali vantaggi si possono trarre dal tipo di approccio, proprio
della filosofia LaTeX, in rapporto agli altri tipi di programmi per
presentazioni che generalmente sono presenti nelle attuali suite di
applicazioni per ufficio.
@article{miller2006producing,
author = {Tristan Miller},
title = {Creare splendide slade con {\LaTeX}: Un'introduzione
al pacchetto {HA}-prosper [{Producing} Beautiful Slides with {\LaTeX}: {An}
Introduction to the {HA}-prosper Package]},
journal = {Pluto Journal},
number = {47},
month = may,
year = {2006},
note = {Translated by Gabriele Zucchetta},
}
Impressions from PracTeX'05.
TUGboat: The Communications of the TeX Users Group, 26(1):31–32, 2005. ISSN 0896-3207.
@article{flom2005bimpressions,
author = {Peter Flom and Tristan Miller},
title = {Impressions from {Prac}{\TeX}'05},
journal = {{TUGboat}: The Communications of the {\TeX}{} {Users}
{Group}},
volume = {26},
number = {1},
pages = {31--32},
year = {2005},
issn = {0896-3207},
}
Biblet: A portable BibTeX bibliography style for generating highly customizable XHTML.
TUGboat: The Communications of the TeX Users Group, 26(1):85–96, 2005. ISSN 0896-3207.
We present Biblet, a set of BibTeX bibliography styles
(bst) which generate XHTML from BibTeX databases. Unlike other BibTeX to
XML/HTML converters, Biblet is written entirely in the native BibTeX style
language and therefore works “out of the box” on any system that
runs BibTeX. Features include automatic conversion of LaTeX symbols to HTML
or Unicode entities; customizable graphical hyperlinks to PostScript, PDF,
DVI, LaTeX, and HTML resources; support for nonstandard but common fields
such as day, isbn, and abstract; hideable text blocks; and output of the
original BibTeX entry for sharing citations. Biblet's highly structured XHTML
output means that bibliography appearance to can be drastically altered
simply by specifying a Cascading Style Sheet (CSS), or easily postprocessed
with third-party XML, HTML, or text processing tools. We compare and contrast
Biblet to other common converters, describe basic usage of Biblet, give
examples of how to produce custom-formatted bibliographies, and provide a
basic overview of Biblet internals for those wishing to modify the style file
itself.
@article{miller2005biblet,
author = {Tristan Miller},
title = {Biblet: {A} Portable {\BibTeX}\ Bibliography Style
for Generating Highly Customizable {XHTML}},
journal = {{TUGboat}: The Communications of the {\TeX}{} {Users}
{Group}},
volume = {26},
number = {1},
pages = {85--96},
year = {2005},
issn = {0896-3207},
}
Using the RPM Package Manager for (La)TeX packages.
TUGboat: The Communications of the TeX Users Group, 26(1):17–28, 2005. ISSN 0896-3207.
RPM is a package management system which provides a
uniform, automated way for users to install, upgrade, and uninstall programs.
Because RPM is the default software distribution format for many operating
systems (particularly GNU/Linux), users may find it useful to manage their
library of TeX-related packages using RPM. This article explains how to
produce RPM files for TeX software, either for personal use or for public
distribution. We also explain how a (La)TeX user can find, install, and
remove TeX-related RPM packages.
@article{miller2005using,
author = {Tristan Miller},
title = {Using the {RPM} {Package} {Manager} for {\LaTeXTeX}{}
Packages},
journal = {{TUGboat}: The Communications of the {\TeX}{} {Users}
{Group}},
volume = {26},
number = {1},
pages = {17--28},
year = {2005},
issn = {0896-3207},
}
Attention-based information retrieval using eye tracker data.
In Peter Clark and Guus Schreiber, editors, Proceedings of the 3rd International Conference on Knowledge Capture (K-CAP 2005), pages 209–210, New York, NY, September 2005. ACM. ISBN 978-1-59593-163-4. DOI: 10.1145/1088622.1088672.
We describe eFISK, an automated
keyword extraction system which unobtrusively measures the user's attention
in order to isolate and identify those areas of a written document the reader
finds of greatest interest. Attention is measured by use of eye-tracking
hardware consisting of a desk-mounted infrared camera which records various
data about the user's eye. The keywords thus identified are subsequently used
in the back end of an information retrieval system to help the user find
other documents which contain information of interest to him. Unlike
traditional IR techniques which compare documents simply on the basis of
common terms withal, our system also accounts for the weights users
implicitly attach to certain words or sections of the source document. We
describe a task-based user study which compares the utility of standard
relevance feedback techniques to the keywords and keyphrases discovered by
our system in finding other relevant documents from a corpus.
@inproceedings{miller2005attention-based,
author = {Tristan Miller and Stefan Agne},
editor = {Peter Clark and Guus Schreiber},
title = {Attention-based Information Retrieval Using Eye
Tracker Data},
booktitle = {Proceedings of the 3rd {International} {Conference}
on {Knowledge} {Capture} ({K-CAP} 2005)},
pages = {209--210},
month = sep,
year = {2005},
publisher = {ACM},
address = {New York, NY},
isbn = {978-1-59593-163-4},
doi = {10.1145/1088622.1088672},
}
@article{flom2005impressions,
author = {Peter Flom and Tristan Miller},
title = {Impressions from {Prac}{\TeX}'05},
journal = {The {Prac}{\TeX}{} Journal},
volume = {2},
number = {3},
month = jul,
year = {2005},
issn = {1556-6994},
}
Security issues for pervasive personalized communication systems.
In Dieter Hutter and Markus Ullmann, editors, Security in Pervasive Computing: Second International Conference, SPC 2005, Boppard, Germany, April 6–8, 2005. Proceedings, volume 3450 of Lecture Notes in Computer Science (ISSN 0302-9743), pages 56–62. Springer, April 2005. ISBN 3-540-25521-4. DOI: 10.1007/978-3-540-32004-3_7.
Technological progress allows us to equip any mobile
phone with new functionalities, such as storing personalized information
about its owner and using the corresponding personal profile for enabling
communication to persons whose mobile phones represent similar profiles.
However, this raises very specific security issues, in particular relating to
the use of Bluetooth technology. Herein we consider such scenarios and
related problems in privacy and security matters. We analyze in which respect
certain design approaches may fail or succeed at solving these problems. We
concentrate on methods for designing the user-related part of the
communication service appropriately in order to enhance
confidentiality.
@inproceedings{klein2005security,
author = {Bertin Klein and Tristan Miller and Sandra
Zilles},
editor = {Dieter Hutter and Markus Ullmann},
title = {Security Issues for Pervasive Personalized
Communication Systems},
booktitle = {Security
in Pervasive Computing: Second International Conference, {SPC} 2005,
{Boppard}, {Germany}, {April} 6–8, 2005. Proceedings},
volume = {3450},
pages = {56--62},
series = {Lecture Notes in Computer Science},
month = apr,
year = {2005},
publisher = {Springer},
isbn = {3-540-25521-4},
issn = {0302-9743},
doi = {10.1007/978-3-540-32004-3_7},
}
Producing beautiful slides with LaTeX: An introduction to the HA-prosper package.
The PracTeX Journal, 2(1), April 2005. ISSN 1556-6994.
In this paper, we present HA-prosper, a LaTeX package for creating
overhead slides. We describe the features of the package and give examples of
their use. We also discuss what advantages there are to producing slides with
LaTeX versus the presentation software typically bundled with today's office
suites.
@article{miller2005producing,
author = {Tristan Miller},
title = {Producing Beautiful Slides with {\LaTeX}: {An}
Introduction to the {HA}-prosper Package},
journal = {The Prac{\TeX}{} Journal},
volume = {2},
number = {1},
month = apr,
year = {2005},
issn = {1556-6994},
}
Latent semantic analysis and the construction of coherent extracts.
In Nicolas Nicolov, Kalina Botcheva, Galia Angelova, and Ruslan Mitkov, editors, Recent Advances in Natural Language Processing III, volume 260 of Current Issues in Linguistic Theory (CILT) (ISSN 0304-0763), pages 277–286. John Benjamins, Amsterdam/Philadelphia, 2004. ISBN 1-58811-618-2. DOI: 10.1075/cilt.260.31mil.
We describe a language-neutral automatic summarization
system which aims to produce coherent extracts. It builds an initial extract
composed solely of topic sentences, and then recursively fills in the topical
lacunae by providing linking material between semantically dissimilar
sentences. While experiments with human judges did not prove a statistically
significant increase in textual coherence with the use of a latent semantic
analysis module, we found a strong positive correlation between coherence and
overall summary quality.
@incollection{miller2004latent,
author = {Tristan Miller},
editor = {Nicolas Nicolov and Kalina Botcheva and Galia
Angelova and Ruslan Mitkov},
title = {Latent Semantic Analysis and the Construction of
Coherent Extracts},
booktitle = {Recent
Advances in Natural Language Processing {III}},
volume = {260},
pages = {277--286},
series = {Current Issues in Linguistic Theory
(CILT)},
year = {2004},
publisher = {John
Benjamins},
address = {Amsterdam/Philadelphia},
isbn = {1-58811-618-2},
issn = {0304-0763},
doi = {10.1075/cilt.260.31mil},
}
Essay assessment with latent semantic analysis.
Journal of Educational Computing Research, 29(4):495–512, December 2003. ISSN 0735-6331. DOI: 10.2190/W5AR-DYPW-40KX-FL99.
Latent semantic analysis (LSA) is an automated,
statistical technique for comparing the semantic similarity of words or
documents. In this paper, I examine the application of LSA to automated essay
scoring. I compare LSA methods to earlier statistical methods for assessing
essay quality, and critically review contemporary essay-scoring systems built
on LSA, including the Intelligent Essay Assessor, Summary Street, State the
Essence, Apex, and Select-a-Kibitzer. Finally, I discuss current avenues of
research, including LSA's application to computer-measured readability
assessment and to automatic summarization of student essays.
@article{miller2003essay,
author = {Tristan Miller},
title = {Essay Assessment with Latent Semantic
Analysis},
journal = {Journal of Educational Computing
Research},
volume = {29},
number = {4},
pages = {495--512},
month = dec,
year = {2003},
issn = {0735-6331},
doi = {10.2190/W5AR-DYPW-40KX-FL99},
}
Latent semantic analysis and the construction of coherent extracts.
In Galia Angelova, Kalina Bontcheva, Ruslan Mitkov, Nicolas Nicolov, and Nikolai Nikolov, editors, Proceedings of the 4th International Conference on Recent Advances in Natural Language Processing (RANLP 2003), pages 270–277, September 2003. ISBN 954-90906-6-3.
We describe a language-neutral automatic summarization
system which aims to produce coherent extracts. It builds an initial extract
composed solely of topic sentences, and then recursively fills in the topical
lacunae by providing linking material between semantically dissimilar
sentences. While experiments with human judges did not prove a statistically
significant increase in textual coherence with the use of a latent semantic
analysis module, we found a strong positive correlation between coherence and
overall summary quality.
@inproceedings{miller2003latent,
author = {Tristan Miller},
editor = {Galia Angelova and Kalina Bontcheva and Ruslan Mitkov
and Nicolas Nicolov and Nikolai Nikolov},
title = {Latent Semantic Analysis and the Construction of
Coherent Extracts},
booktitle = {Proceedings of the 4th {International} {Conference}
on {Recent} {Advances} in {Natural} {Language} {Processing} ({RANLP}
2003)},
pages = {270--277},
month = sep,
year = {2003},
isbn = {954-90906-6-3},
}
Efficient defeasible reasoning systems.
International Journal on Artificial Intelligence Tools, 10(4):483–501, December 2001. ISSN 0218-2130. DOI: 10.1142/S0218213001000623.
For many years, the non-monotonic reasoning community
has focussed on highly expressive logics. Such logics have turned out to be
computationally expensive, and have given little support to the practical use
of non-monotonicreasoning. In this work we discuss defeasible logic, a
less-expressive but more efficient non-monotonic logic. We report on two new
implemented systems for defeasible logic: a query answering system employing
a backward-chaining approach, and a forward-chaining implementation that
computes all conclusions. Our experimental evaluation demonstrates that the
systems can deal with large theories (up to hundreds of thousands of rules).
We show that defeasible logic has linear complexity, which contrasts markedly
with most other non-monotonic logics and helps to explain the impressive
experimental results. We believe that defeasible logic, with its efficiency
and simplicity, is a good candidate to be used as a modelling language for
practical applications, including modelling of regulations and business
rules.
@article{maher2001efficient,
author = {Michael J. Maher and Allan Rock and Grigoris Antoniou
and David Billington and Tristan Miller},
title = {Efficient Defeasible Reasoning Systems},
journal = {International Journal on Artificial Intelligence
Tools},
volume = {10},
number = {4},
pages = {483--501},
month = dec,
year = {2001},
issn = {0218-2130},
doi = {10.1142/S0218213001000623},
}
Essay assessment with latent semantic analysis.
Technical Report CSRG-440, Department of Computer Science, University of
Toronto, May 2001.
@techreport{miller2001essay,
author = {Tristan Miller},
title = {Essay Assessment with Latent Semantic
Analysis},
number = {{CSRG-440}},
type = {Technical Report},
month = may,
year = {2001},
institution =
{Department of Computer Science, University of
Toronto},
}
Efficient defeasible reasoning systems.
In Proceedings of the 12th IEEE International Conference on Tools with Artificial Intelligence (ICTAI 2000), pages 384–392. IEEE Press, November 2000. ISBN 0-7695-0909-6. DOI: 10.1109/TAI.2000.889898.
For many years, the non-monotonic reasoning community
has focussed on highly expressive logics. Such logics have turned out to be
computationally expensive, and have given little support to the practical use
of non-monotonicreasoning. In this work we discuss defeasible logic, a
less-expressive but more efficient non-monotonic logic. We report on two new
implemented systems for defeasible logic: a query answering system employing
a backward-chaining approach, and a forward-chaining implementation that
computes all conclusions. Our experimental evaluation demonstrates that the
systems can deal with large theories (up to hundreds of thousands of rules).
We show that defeasible logic has linear complexity, which contrasts markedly
with most other non-monotonic logics and helps to explain the impressive
experimental results. We believe that defeasible logic, with its efficiency
and simplicity, is a good candidate to be used as a modelling language for
practical applications, including modelling of regulations and business
rules.
@inproceedings{maher2000efficient,
author = {Michael J. Maher and Allan Rock and Grigoris Antoniou
and David Billington and Tristan Miller},
title = {Efficient Defeasible Reasoning Systems},
booktitle = {Proceedings of the 12th {IEEE} {International}
{Conference} on {Tools} with {Artificial} {Intelligence} ({ICTAI}
2000)},
pages = {384--392},
month = nov,
year = {2000},
publisher = {IEEE
Press},
isbn = {0-7695-0909-6},
issn = {1082-3409},
doi = {10.1109/TAI.2000.889898},
}
A well-behaved algorithm for simulating dependence structures of Bayesian networks.
International Journal of Applied Mathematics, 1(8):923–932, 1999. ISSN 1311-1728.
Automatic generation of Bayesian
network (BN) structures (directed acyclic graphs) is an important step in
experimental study of algorithms for inference in BNs and algorithms for
learning BNs from data. Previously known simulation algorithms do not
guarantee connectedness of generated structures or even successful
genearation according to a user specification. We propose a simple, efficient
and well-behaved algorithm for automatic generation of BN structures. The
performance of the algorithm is demonstrated experimentally.
@article{xiang1999wellbehaved,
author = {Yang Xiang and Tristan Miller},
title = {A Well-behaved Algorithm for Simulating Dependence
Structures of {Bayesian} Networks},
journal = {International Journal of Applied
Mathematics},
volume = {1},
number = {8},
pages = {923--932},
year = {1999},
issn = {1311-1728},
}