文章目录
前言
- 最近重新读了一下这篇文章,于是也顺便分享笔记
- 本文重要提出数据集
qulac
,并定义了一套question selection的pipeline:先检索相关问题,再在检索回来的问题中进行选择,最后利用选择得到的问题和对应的答案进行检索增强 - 原文链接: Asking Clarifying Questions in Open-Domain Information-Seeking Conversations
Motivation
Formulate the task of selecting and asking clarifying questions in open-domain information-seeking conversational system
Intro
If the system is not sufficiently confident to present the results to the user, it then starts the process of asking clarifying questions.
- generate a list of candidate questions related to the query
- select one candidate question and ask it from the user
- Based on the answer, the system retrieves new documents and repeats the process
Contribution
collect Qulac
building on top of the TREC Web Track 2009-2012 collections
- consist of over 10K question-answer pairs for 198 TREC topics consisting of 762 facets
Propose a retrieval framework consisting of three main components:
- question retrieval
- question selection
- document retrieval
Problem Statement
A Facet-Based Offline Evaluation Protocol
The design of an offline evaluation protocol is challenging because conversation requires online interaction between a user and a system.
- To circumvent this problem, we substitute the Question Generation Model in Figure 2 with a large bank of questions, assuming that it consists of all possible questions in the collection
We build our evaluation protocol on top of the TREC Web track’s data
- TREC has released 200 search topics, each of which being either “ambiguous” or “faceted”
Define
T = t 1 , ⋯ , t n T={t_1,\cdots,t_n} T=t1,⋯,tn be the set of topics(queries)
F = f 1 , ⋯ , f n F=f_1, \cdots, f_n F=f1,⋯,fn be the set of facets with f i = f 1 i , ⋯ , f m i i f_i=f_1^i, \cdots,f_{m_i}^i fi=f1i,⋯,fmii defining different facets of t i t_i ti , where m i m_i mi denotes the number of facets for t i t_i ti
Q = { q 1 , q 2 , … , q n } Q=\left\{\mathrm{q}_1, \mathrm{q}_2, \ldots, \mathrm{q}_n\right\} Q={ q1,q2,…,qn} be the set of clarifying questions belonging to every topic, where q i = { q 1 i , q 2 i , … , q z i i } \mathrm{q}_i=\left\{q_1^i, q_2^i, \ldots, q_{z_i}^i\right\} qi={ q1i,q2i,…,qzii} consists of all clarifying questions that belong to t i t_i ti ; z i z_i zi is the number of clarifying questions for t i t_i ti .
A ( t , f , q ) − > a A(t,f,q)->a A(t,f,q)−>a define a function that returns answer a a a for a given topic t t t , facet f f f , and question q q q
- our aim is to provide the users’ answers to all clarifying questions considering all topics and their corresponding facets.
- Hence, to enable offline evaluation, A requires to return an answer for all possible values of t t t , f f f , and q q q.
Data Collection
We follow a four-step strategy to build Qulac
- define the topics and their corresponding facets
- collect a number of candidate clarifying questions ( Q Q Q) for each query through crowdsourcing
- assess the relevance of the questions to each facet and collect new questions for those facets that require more specific questions
- we collect the answers for every query-facet-question triplet, modeling A A A
Topic and Facets
we choose the TREC Web track’s topics as the basis for Qulac
.
- In other words, we take the topics of TREC Web track 09-12 as initial user queries.
We then break each topic into its facets and assume that each facet describes the information need of a different user (i.e., it is a topic).
-
the average facet per topic is 3.85 ± 1.05. Therefore, the initial 198 TREC topics5 leads to 762 topic-facet pairs in
Qulac
-
for each topic-facet pair, we take the relevance judgements associated with the respective facet.
Clarifying Questions
we asked human annotators to generate questions for a given query based on the results they observed on a commercial search engine as well as query auto-complete suggestions.
To collect clarifying questions,we designed a Human Intelligence Task (HIT) on Amazon Mechanical Turk
Question Verification and Addition
Aim to address two main concerns: we appointed two expert annotators for this task
- how good are the collected clarifying questions
- We instructed the annotators to read all the collected questions of each topic, marking invalid and duplicate questions. Moreover, we asked them to match a question to a facet if the question was relevant to the facet.
- A question was considered relevant to a facet if its answer would address the facet
- We instructed the annotators to read all the collected questions of each topic, marking invalid and duplicate questions. Moreover, we asked them to match a question to a facet if the question was relevant to the facet.
- are all facets addressed by at least one clarifying question
- we asked the annotators to generate an additional question for the facets that needed more specific questions
Answers
designed another HIT in which we collected answers to the questions for every facet.
- they were required to write the answer to one clarifying question that was presented to them
Quality check: The checks were done manually on 10% of submissions per worker
Selecting Clarifying Questions
we propose a conversational search system that is able to select and ask clarifying questions and rank documents based on the user’s responses.
Context: For a given topic t t t, let h = { ( q 1 , a 1 ) , ( q ∣ h ∣ , a ∣ h ∣ ) } h=\{(q_1,a_1), (q_{|h|}, a_{|h|})\} h={(q1,a1),(q∣h∣,a∣h∣)} be the history of clarifying questions and their corresponding answers exchanged between the user and the system
- the ultimate goal is to predict q q q, that is the next question that the system should ask from the user
Question Retrieval Model
We now describe our BERT Language Representation based Question Retrieval model, called BERT-LeaQuR
- We aim to maximize the recall of the retrieved questions, retrieving all relevant clarifying questions to a given query in the top k questions
BERT-LeaQuR
estimates the probability p ( R = 1 ∣ t , q ) p(R = 1|t ,q) p(R=1∣t,q)
- R is a binary random variable indicating whether the question q should be retrieved (R = 1) or not (R = 0)
- t t t and q q q denote the query (topic) and the candidate clarifying question, respectively
The question relevance probability in the BERT-LeaQuR
model is estimated as follows:
- ψ is the matching component that takes the aforementioned representations and produces a question retrieval score
Implement:
We use BERT to learn ϕ T ϕ_T ϕT and ϕ Q ϕ_Q ϕQ
- pre-trained for the language modeling task on Wikipedia and fine-tune the parameters on
Qulac
with 3 epochs
The component ψ ψ ψ is modeled using a fully-connected feed-forward network with the output dimensionality of 2
- activation function: ReLU
- a softmax function is applied on the output layer to compute the probability of each label (i.e., relevant or non-relevant).
Loss function: cross-entropy
Question Selection Model
we introduce a Neural Question Selection Model (NeuQS
) which selects questions with a focus on maximizing the precision at the top of the ranked list
- The main challenge in the question selection task is to predict whether a question has diverged from the query and conversation context
- when given a negative answer(s) to previous question(s), the model needs to diverge from the history.
- when given a positive answer, questions on the same topic that ask for more details are preferred
NeuQS
outputs a relevance score for a given query t
, question q
, and conversation context h
- context representation ϕ H ( h ) ϕ_H (h) ϕH(h)
- retrieval representation η ( t , h , q ) η(t , h,q) η(t,h,q)
- query performance representation σ ( t , h , q ) σ (t , h,q) σ(t,h,q)
Implement:
the context representation component ϕ H ϕ_H ϕH is implemented as follows:
- ϕ Q A ( q , a ) ϕ_{QA}(q, a) ϕQA(q,a) is an embedding function of a question q q q and answer a a a.
The retrieval representation η ( t , h , q ) ∈ R k η(t , h,q) ∈ R^k η(t,h,q)∈Rk: is implemented by interpolating the retrieval score of the query, context and question (see Section 5.3) and the score of the top k retrieved documents(use document retrieval model) is used
The query performance prediction (QPP) representation component σ ( t , h , q ) ∈ R k σ (t , h,q) ∈ R^k σ(t,h,q)∈Rk: consists of the performance prediction score of the ranked documents at different ranking positions (for a maximum of k ranked documents).
We employed the σ QPP model for this component [28]. We take the representations from the [CLS] layer of the pre-trained uncased BERT-Base model
To model the function γ we concatenate and feed ϕ T ( t ) , ϕ H ( h ) , ϕ Q ( q ) , η ( t , h , q ) , a n d σ ( t , h , q ) ϕ_T (t ), ϕ_H (h), ϕ_Q (q), η(t , h,q), and σ (t , h,q) ϕT(t),ϕH(h),ϕQ(q),η(t,h,q),andσ(t,h,q) into a fully-connected feed-forward network with two hidden layers. We use ReLU
as the activation function in the hidden layers of the network
Loss function: cross-entropy function
Document Retrieval Model
we describe the model that we use to retrieve documents given a query, conversation context, and current clarifying question as well as user’s answer.
We use the KL-divergence retrieval model [24] based on the language modeling framework [30] with Dirichlet
prior smoothing [52] where we linearly interpolate two likelihood models: one based on the original query, and one based on the questions and their respective answers
We use the document retrieval model for two purposes:
- ranking documents after the user answers a clarifying question
- ranking documents of a candidate question as part of the
NeuQS
(dose not see the answer)
Experiment
Qulac-T
: split the train/validation/test sets based on topics.
Qulac-F
: split the data based on their facets
We expand Qulac
to include multiple artificially generated conversation turns:
- To do so, for each instance, we consider all possible combinations of questions to be asked as the context of conversation.
- the total number of unique conversational contexts is 75,200, a model should select questions for 75,200 contexts from all 907,366 candidate questions.
Notice: the set of candidate clarifying questions for each multi-turn example would be the ones that have not appeared
in the context
Results
BERT-LeaQuR
is able to outperform all baselines
-
Oracle question selection: performance
- an oracle model is aware of the answers to the questions
- selecting best questions (
BestQuestion
model) helps the model to achieve substantial improvement
-
Oracle question selection: impact of topic type and length
- We see that the relative improvement of
BestQuestion
is negatively correlated with the number of query terms, suggesting that shorter queries require clarification in more cases - The average ∆MRR for ambiguous topics is 0.3858, compared with the faceted topics with average ∆MRR of 0.2898.
- We see that the relative improvement of
-
question selection
- Table 3 presents the results of the document retrieval model taking into account a selected question together with its answer
- The obtained improvements highlight the necessity and effectiveness of asking clarifying questions in a conversational search system
-
Impact of clarifying questions on facets
- the performance for 45% of the facets is improved by asking clarifying questions, whereas the performance for 19% is worse.
-
Case study
-
Failure:
- with additional information: the retrieval model is not able to take advantage of additional information when it has no terms in common with relevant documents(the first two rows)
- with no additional information(the third row)
- we think that a more effective document retrieval model can potentially improve the performance
-
Success:
- the system asks open questions and gets useful feedback
- some keywords like “biography” which would match with relevant documents
- question/answer contains some facet terms