【论文阅读 SIGIR‘19】Asking Clarifying Questions in Open-Domain Information-Seeking Conversations

文章目录

前言

最近重新读了一下这篇文章，于是也顺便分享笔记
本文重要提出数据集 qulac，并定义了一套question selection的pipeline：先检索相关问题，再在检索回来的问题中进行选择，最后利用选择得到的问题和对应的答案进行检索增强
原文链接: Asking Clarifying Questions in Open-Domain Information-Seeking Conversations

Motivation

Formulate the task of selecting and asking clarifying questions in open-domain information-seeking conversational system

Intro

在这里插入图片描述

If the system is not sufficiently confident to present the results to the user, it then starts the process of asking clarifying questions.

generate a list of candidate questions related to the query
select one candidate question and ask it from the user
Based on the answer, the system retrieves new documents and repeats the process

Contribution

collect Qulac building on top of the TREC Web Track 2009-2012 collections

consist of over 10K question-answer pairs for 198 TREC topics consisting of 762 facets

Propose a retrieval framework consisting of three main components:

question retrieval
question selection
document retrieval

Problem Statement

A Facet-Based Offline Evaluation Protocol

The design of an offline evaluation protocol is challenging because conversation requires online interaction between a user and a system.

To circumvent this problem, we substitute the Question Generation Model in Figure 2 with a large bank of questions, assuming that it consists of all possible questions in the collection

We build our evaluation protocol on top of the TREC Web track’s data

TREC has released 200 search topics, each of which being either “ambiguous” or “faceted”

Define

$T={t_1,\cdots,t_n}$ be the set of topics(queries)

$F=f_1, \cdots, f_n$ be the set of facets with $f_i=f_1^i, \cdots,f_{m_i}^i$ defining different facets of $t_i$ , where $m_i$ denotes the number of facets for $t_i$

$Q=\left\{\mathrm{q}_1, \mathrm{q}_2, \ldots, \mathrm{q}_n\right\}$ be the set of clarifying questions belonging to every topic, where $\mathrm{q}_i=\left\{q_1^i, q_2^i, \ldots, q_{z_i}^i\right\}$ consists of all clarifying questions that belong to $t_i$ ; $z_i$ is the number of clarifying questions for $t_i$ .

$A (t, f, q) - > a$ define a function that returns answer $a$ for a given topic $t$ , facet $f$ , and question $q$

our aim is to provide the users’ answers to all clarifying questions considering all topics and their corresponding facets.
Hence, to enable offline evaluation, A requires to return an answer for all possible values of $t$ , $f$ , and $q$ .

Data Collection

We follow a four-step strategy to build Qulac

define the topics and their corresponding facets
collect a number of candidate clarifying questions ( $Q$ ) for each query through crowdsourcing
assess the relevance of the questions to each facet and collect new questions for those facets that require more specific questions
we collect the answers for every query-facet-question triplet, modeling $A$

Topic and Facets

we choose the TREC Web track’s topics as the basis for Qulac.

In other words, we take the topics of TREC Web track 09-12 as initial user queries.

We then break each topic into its facets and assume that each facet describes the information need of a different user (i.e., it is a topic).

the average facet per topic is 3.85 ± 1.05. Therefore, the initial 198 TREC topics5 leads to 762 topic-facet pairs in Qulac
for each topic-facet pair, we take the relevance judgements associated with the respective facet.

Clarifying Questions

we asked human annotators to generate questions for a given query based on the results they observed on a commercial search engine as well as query auto-complete suggestions.

To collect clarifying questions,we designed a Human Intelligence Task (HIT) on Amazon Mechanical Turk

Question Verification and Addition

Aim to address two main concerns: we appointed two expert annotators for this task

how good are the collected clarifying questions
- We instructed the annotators to read all the collected questions of each topic, marking invalid and duplicate questions. Moreover, we asked them to match a question to a facet if the question was relevant to the facet.
  - A question was considered relevant to a facet if its answer would address the facet
are all facets addressed by at least one clarifying question
- we asked the annotators to generate an additional question for the facets that needed more specific questions

Answers

designed another HIT in which we collected answers to the questions for every facet.

they were required to write the answer to one clarifying question that was presented to them

Quality check: The checks were done manually on 10% of submissions per worker

Selecting Clarifying Questions

we propose a conversational search system that is able to select and ask clarifying questions and rank documents based on the user’s responses.

Context: For a given topic $t$ , let $h=\{(q_1,a_1), (q_{|h|}, a_{|h|})\}$ be the history of clarifying questions and their corresponding answers exchanged between the user and the system

the ultimate goal is to predict $q$ , that is the next question that the system should ask from the user

Question Retrieval Model

We now describe our BERT Language Representation based Question Retrieval model, called BERT-LeaQuR

We aim to maximize the recall of the retrieved questions, retrieving all relevant clarifying questions to a given query in the top k questions

BERT-LeaQuR estimates the probability $p (R = 1∣ t, q)$

R is a binary random variable indicating whether the question q should be retrieved (R = 1) or not (R = 0)
$t$ and $q$ denote the query (topic) and the candidate clarifying question, respectively

The question relevance probability in the BERT-LeaQuR model is estimated as follows:

在这里插入图片描述

ψ is the matching component that takes the aforementioned representations and produces a question retrieval score

Implement:

We use BERT to learn $ϕ_T$ and $ϕ_Q$

pre-trained for the language modeling task on Wikipedia and fine-tune the parameters on Qulac with 3 epochs

The component $ψ$ is modeled using a fully-connected feed-forward network with the output dimensionality of 2

activation function: ReLU
a softmax function is applied on the output layer to compute the probability of each label (i.e., relevant or non-relevant).

Loss function: cross-entropy

Question Selection Model

we introduce a Neural Question Selection Model (NeuQS) which selects questions with a focus on maximizing the precision at the top of the ranked list

The main challenge in the question selection task is to predict whether a question has diverged from the query and conversation context
- when given a negative answer(s) to previous question(s), the model needs to diverge from the history.
- when given a positive answer, questions on the same topic that ask for more details are preferred

NeuQS outputs a relevance score for a given query t , question q, and conversation context h

在这里插入图片描述

context representation $ϕ_H (h)$
retrieval representation $η (t, h, q)$
query performance representation $σ (t, h, q)$

Implement:

the context representation component $ϕ_H$ is implemented as follows:

在这里插入图片描述

$ϕ_{QA}(q, a)$ is an embedding function of a question $q$ and answer $a$ .

The retrieval representation $η(t , h,q) ∈ R^k$ : is implemented by interpolating the retrieval score of the query, context and question (see Section 5.3) and the score of the top k retrieved documents(use document retrieval model) is used

The query performance prediction (QPP) representation component $σ (t , h,q) ∈ R^k$ : consists of the performance prediction score of the ranked documents at different ranking positions (for a maximum of k ranked documents).
We employed the σ QPP model for this component [28]. We take the representations from the [CLS] layer of the pre-trained uncased BERT-Base model

To model the function γ we concatenate and feed $ϕ_T (t ), ϕ_H (h), ϕ_Q (q), η(t , h,q), and σ (t , h,q)$ into a fully-connected feed-forward network with two hidden layers. We use ReLU as the activation function in the hidden layers of the network

Loss function: cross-entropy function

Document Retrieval Model

we describe the model that we use to retrieve documents given a query, conversation context, and current clarifying question as well as user’s answer.

We use the KL-divergence retrieval model [24] based on the language modeling framework [30] with Dirichlet
prior smoothing [52] where we linearly interpolate two likelihood models: one based on the original query, and one based on the questions and their respective answers

在这里插入图片描述

We use the document retrieval model for two purposes:

ranking documents after the user answers a clarifying question
ranking documents of a candidate question as part of the NeuQS(dose not see the answer)

Experiment

Qulac-T: split the train/validation/test sets based on topics.

Qulac-F: split the data based on their facets

We expand Qulac to include multiple artificially generated conversation turns:

To do so, for each instance, we consider all possible combinations of questions to be asked as the context of conversation.
the total number of unique conversational contexts is 75,200, a model should select questions for 75,200 contexts from all 907,366 candidate questions.

Notice: the set of candidate clarifying questions for each multi-turn example would be the ones that have not appeared
in the context

Results

在这里插入图片描述

BERT-LeaQuR is able to outperform all baselines

在这里插入图片描述

Oracle question selection: performance
- an oracle model is aware of the answers to the questions
- selecting best questions (BestQuestion model) helps the model to achieve substantial improvement
Oracle question selection: impact of topic type and length
- We see that the relative improvement of BestQuestion is negatively correlated with the number of query terms, suggesting that shorter queries require clarification in more cases
- The average ∆MRR for ambiguous topics is 0.3858, compared with the faceted topics with average ∆MRR of 0.2898.
question selection
- Table 3 presents the results of the document retrieval model taking into account a selected question together with its answer
- The obtained improvements highlight the necessity and effectiveness of asking clarifying questions in a conversational search system
Impact of clarifying questions on facets
- the performance for 45% of the facets is improved by asking clarifying questions, whereas the performance for 19% is worse.
Case study
Failure:
- with additional information: the retrieval model is not able to take advantage of additional information when it has no terms in common with relevant documents(the first two rows)
- with no additional information(the third row)
- we think that a more effective document retrieval model can potentially improve the performance
Success:
- the system asks open questions and gets useful feedback
- some keywords like “biography” which would match with relevant documents
- question/answer contains some facet terms