A survey of consumer health question answering systems

Anuradha WelivitaPearl Pu

First published: 27 November 2023


Consumers are increasingly using the web to find answers to their health-related queries. Unfortunately, they often struggle with formulating the questions, further compounded by the burden of having to traverse long documents returned by the search engine to look for reliable answers. To ease these burdens for users, automated consumer health question answering systems try to simulate a human professional by refining the queries and giving the most pertinent answers. This article surveys state-of-the-art approaches, resources, and evaluation methods used for automatic consumer health question answering. We summarize the main achievements in the research community and industry, discuss their strengths and limitations, and finally come up with recommendations to further improve these systems in terms of quality, engagement, and human-likeness.


In traditional web search, the user inputs a keyword-based query to a search engine, and the search engine returns a list of documents, which likely contain the answer the user is seeking. However, having to traverse through a long list of records to find the desired answer is cognitively demanding. Sometimes, users need to reformulate the query several times until what they seek matches the domain’s vocabulary. Automatic question answering (QA) addresses these problems by providing direct and concise answers to user queries expressed in natural language. Additionally, it checks spelling mistakes and reformulates the queries to reduce semantic ambiguities. However, this task is challenging as natural language is often ambiguous. Additionally, constructing a response requires a detailed understanding of the question asked, expert domain knowledge, and automatic ways to generate text using, for example, language generation models (Datla, Arora et al. 2017).

Automatic QA can be general or domain-specific, also known as open- and restricted-domain QA. Open-domain QA answers factoid or non-factoid questions from a wide range of domains, while restricted-domain QA answers questions from a specialized area, using domain-specific linguistic resources that enable more precise answers to a given question (Olvera-Lobo and Gutiérrez-Artacho 2011). Both open- and restricted-domain QA can take on either single- or multiturn conversational manners. Single-turn QA focuses on answering questions in a one-off fashion, that is, one question and one answer at a time. In this, the context of one problem is not carried over to another. Therefore, the system understands each question separately. Recently, there has been a push towards multiturn conversational QA, where users can interactively engage with a system through follow-up questions, which are dependent on the previous context. Conversational QA is more natural and reflects the way humans seek information. This survey focuses on both single-turn and conversational QA systems restricted to the consumer health domain due to their relative complexity and room for improvement, as discussed in the following sections.

The general public increasingly consults knowledge resources, especially the world wide web, prior or subsequent to a doctor’s visit, to obtain information about a disease, contradictions of treatment, or side effects of a drug. However, they often struggle to find reliable and concise answers to their questions, partially due to the complexity of traversing the long health-related text they receive as results. Consumer Health Question Answering (CHeQA) addresses this concern by providing reliable and succinct answers to consumer health questions. This task is more challenging than open-domain QA due to the specialized jargon used in healthcare and the knowledge gap between consumers and medical professionals (Zweigenbaum 2005). Moreover, the unavailability of large-scale healthcare Q&A datasets contributes to the challenges. Thus, despite the popularity of open-domain QA systems (Chen et al. 2017; Tay, Tuan, and Hui 2018; Ferrucci et al. 2010; Huang, Choi, and Yih 2019; Zhu, Zeng, and Huang 2018), consumer health QA methods are relatively rare.

The medical QA task organized by the TREC 2017 LiveQA track (Abacha et al. 2017) was the first to provide an open benchmark to compare single-turn CHeQA systems. The task was organized in the scope of the CHeQA project1 of the U.S. National Library of Medicine (NLM) to address automatic answering of consumer health questions received by the U.S. NLM. It motivated a number of research groups such as Carnegie Mellon University’s Open Advancement of Question Answering (CMU-OAQA) (Wang and Nyberg 2017), East China Normal University Institute of Computer Application (ECNU-ICA) (An et al. 2017), Philips Research North America (PRNA) (Datla, Arora et al. 2017), and Carnegie Mellon University’s LiveMedQA (CMU-LiveMedQA) (Yang et al. 2017) to develop systems that focusing on answering consumer health questions.

There are many studies surveying QA systems in general (Gao, Galley, and Li 2019; Mishra and Jain 2016; Bouziane et al. 2015) and studies that address biomedical or clinical QA (Sharma et al. 2015; Dodiya and Jain 2013; Bauer and Berleant 2012; Athenikos and Han 2010). A recent survey by Montenegro, da Costa, and da Rosa Righi (2019) reviews conversational agents in health. In this survey, they identify an agent application taxonomy, current challenges, and research gaps associated with healthcare conversational agents. However, most of the agents reviewed here are built for specific use cases such as virtual counseling, monitoring fitness, and assisting patients in hospitals. Hence, a comprehensive survey analyzing the approaches and resources that assist in CHe QA is lacking in the literature. Consumer health questions are different from medical professionals’ questions due to the vocabulary gap between consumers and healthcare providers. An example illustrating this scenario is indicated in Table 1. An ideal system should map vague and colloquial terms such as “pimples,” or “pizza-face” to the same medical concept “acne” or “acne vulgaris,” enabling the system to provide answers accurately by mining medical text. This poses an additional challenge for systems designed to answer consumer health questions. The scope of extending those systems in terms of usability is also broader compared to clinical QA.TABLE 1. Two example conversations showing that the same question is phrased by two different users using different terminology.

Consumer_A: I’ve been having pimples for a long time. What can I do to get rid of them?
Agent: Treatment for acne depends on how severe it is. If you just have a few blackheads, whiteheads, and spots …
Consumer_B: I’ve been having a pizza face since I was a child. What types of remedies are available?
Agent: Treatment for acne depends on how severe it is. If you just have a few blackheads, whiteheads, and spots …

To fill this gap, we select 11 single-turn and conversational QA systems developed since 2012, restricted to the domain of consumer health. We review those systems in terms of their approach, system architecture, strengths, and limitations. We include both research-based systems and commercial systems from the industry. Figure 1 summarizes our selection criteria and the systems surveyed. We also review the available resources used to address the linguistic challenges specific to this domain and the evaluation metrics utilized to measure their success. Finally, we develop a set of recommendations and guidelines for developing systems that are potentially more engaging and human-like. We target this survey primarily toward NLP and IR audiences interested in applying these technologies and recommendations for the advancement of consumer healthcare.

Details are in the caption following the image
FIGURE 1Open in figure viewerPowerPointSelection criteria that led to choosing consumer health question answering systems for the survey. The criteria highlighted in blue lead to CHeQA approaches.


We group these 11 systems into three categories in terms of the approach followed in answering consumer health questions.

  • 1.Traditional information retrieval-based approaches
  • 2.Question similarity/entailment-based approaches
  • 3.Knowledge graph-based approaches

The systems that utilize each of the above approaches to answer consumer health questions are indicated in Figure 2. Some hybrid methods also exist, which combine the strengths of more than one of the above approaches (Demner-Fushman, Mrabet, and Ben Abacha 2020; Yang et al. 2017). Such systems are indicated with circles with two colors in Figure 2. The following subsections describe the above approaches and individual systems that utilize these approaches in detail.

Details are in the caption following the image
FIGURE 2Open in figure viewerPowerPointClassification of consumer health question answering systems from research laboratories. Those circles with two colors are hybrid systems combining more than one approach to answer consumer health questions.

A summary of the details of both research-based and commercial CHeQA systems in terms of the document collection used, resources utilized, evaluation datasets, and performance scores, are indicated in Table 2.TABLE 2. Summary of research and commercial consumer health question answering systems.

SystemPrinciple approach usedDocument collectionOther datasets or resources utilizedEvaluation datasetsPerformance scores
R&D systems
CMU-OAQA (Wang and Nyberg 2017)Traditional information retrievalWorld Wide Web, Yahoo! AnswersNoneTREC LiveQA 2017 medical Q&A test dataset (104 Q&A pairs)0.637 (Average score)
PRNA (Datla, Arora et al. 2017)Knowledge graphKnowledge graph composed of Wikipedia articlesNLP engine (Hasan et al. 2014)TREC LiveQA’17 Med0.490 (Average score)
ECNU-ICA (An et al. 2017)Question similarityYahoo! Answers, Answers.comNoneTREC LiveQA’17 Med0.402 (Average score)
Question entailment approach (Abacha and Demner-Fushman 2019a)Question entailmentMedQuAD dataset (47,457 medical Q&A pairs curated from 12 trusted NIH websites)Terrier search engine (terrier.org); Stanford Natural Language Inference (SNLI) (Bowman et al. 2015) and multiNLI (Williams, Nangia, and Bowman 2018) sentence pair datasets; Quora2; Clinical-QE (Abacha and Demner-Fushman 2016); and SemEval-cQA (Nakov et al. 2016) question pair datasetsTREC LiveQA’17 Med0.827 (Average score), 0.311 MAP@10, 0.333 MRR@10
CMU-LiveMedQA (Yang et al. 2017)Traditional IR + knowledge graphTREC LiveQA’17 medical Q&A development dataset (634 Q&A pairs)NLTK Bird toolkit (Bird 2006); and GARD Question Decomposition Dataset (Roberts, Masterton et al. 2014)TREC LiveQA’17 Med0.353 (Average score)
CHiQA (Demner-Fushman, Mrabet, and Ben Abacha 2020)Traditional IR + question similarityMedlinePlus (medlineplus.gov)MetaMap LiteSimple Alexa questions (104 Q&A pairs)2.336 (Average score), 0.766 MAP@10, and 0.866 MRR@10
TREC LiveQA’17 Med development and test datasets (634 and 104 Q&A pairs, respectively)1.308 (Average score), 0.445 MAP@10, 0.516 MRR@10
enquireMe (Wong, Thangarajah, and Padgham 2012)Question similarityA database having 80K Q&A pairs from Yahoo! Answers.None150 definitional questions extracted from WebMD (Olvera-Lobo and Gutiérrez-Artacho 2011)94.00% (Accuracy)
274 questions (1st dataset extended with contextual follow-up questions)86.86% (Accuracy)
TREC 2004 cross-domain QA dataset with 65 series of questions having 351 questions altogether Voorhees (2004)87.80% (Accuracy)
Commercial systems
Cleveland ClinicQuestion similarityExisting physician-reviewed contentUnknownUnknownUnknown
FlorenceUnknownMedlinePlus and WikipediaUnknownUnknownUnknown
Ask WebMDUnknownPhysician-reviewed responses in WebMDUnknownUnknownUnknown
MedWhatUnknownContent from NIH and Center for Disease Control and Prevention (CDC), PubMed peer-reviewed journals, and Wikipedia articles with quality referencesUnknownUnknownUnknown

Traditional information retrieval-based approaches

Information retrieval is considered to be the most dominant form of information access over unstructured data. Since most consumer health information is available as unstructured documents, this approach works efficiently to answer consumer health questions allowing consumers to query information using natural language. Traditional information retrieval-based approaches to CHeQA formulate queries by analyzing question text and retrieving candidate answers from collections of health-related documents using TF-IDF-based scoring methods. They employ neural or non-neural scoring techniques or weighted combinations to rank the candidate answers and select the best among them. Figure 3 illustrates the generic architecture or pipeline of a traditional information retrieval-based QA system.

Details are in the caption following the image
FIGURE 3Open in figure viewerPowerPointArchitecture of a traditional information retrieval-based question answering system.

Despite the many advantages of this approach, the enormous increase in health-related information on the web may pose many challenges, such as validating the reliability of the answers retrieved. Since no well-defined structure or semantics are involved in this process, the answers retrieved are often approximate. One of the significant downsides to this approach is the vocabulary mismatch between consumer health questions and medical documents. Automatic query expansion methods such as Latent Semantic Indexing or methods with a built-in automatic thesaurus can address this concern. Further research on this area can make this approach more robust and reliable.


The CMU-OAQA system (Wang and Nyberg 2017) uses a traditional information retrieval-based approach to answering both open-domain and consumer health questions. It follows three steps: (1) retrieving relevant web pages (Clue Retrieval); (2) extracting candidate answers and ranking (Answer Ranking); and (3) generating the final answer by concatenating the highest-ranked candidate answers (Answer Passage Tiling). The system retrieves relevant web pages by formulating queries using the question text and issuing them to search engines such as Bing Web Search and community QA websites such as Yahoo! Answers. Then, candidate answers (the title/body/answer tuples that represent either conceptional questions or answer texts) are extracted from the web pages and are ranked using a heuristically weighted combination of (a) optimized BM25 similarity scoring over the title and body texts; (b) an attentional encoder–decoder recurrent neural networks model, which estimates the relevance of a candidate answer text given a question text; and (c) a neural dual entailment-based question paraphrase identification model, which predicts the relevance between the input question and titles of answer candidates. Finally, a subset of the highest-ranked candidate answers are selected using a greedy algorithm and are concatenated to produce the final answer. This approach achieved the highest average score of 0.637 at the TREC LiveQA 2017 medical subtask, compared to the median score of 0.431. The average score [0–3] was calculated based on the scores given by human assessors on a 4-point Likert scale (1: incorrect, 2: incorrect but related, 3: correct but incomplete; 4: correct and complete) to answers retrieved by the system for questions in the TREC LiveQA 2017 medical Q&A test dataset consisting of 104 medical questions and associated answers.

Knowledge graph-based approaches

Knowledge graph-based approaches build and maintain a knowledge graph consisting of health-related information such that it preserves the medical domain-specific concept hierarchy. They extract medical concepts from the question text using a semantic parser and traverse along the knowledge graph to find the correct answer. Figure 4 illustrates the generic architecture or pipeline of a knowledge-graph-based QA system.

Details are in the caption following the image
FIGURE 4Open in figure viewerPowerPointArchitecture of a knowledge-graph-based question answering system.

The main advantage of this approach over traditional and question similarity-based methods is that knowledge graphs are better able to explain the answers produced by the system. If question attributes such as the question focus and question type can be identified correctly, this method can be more reliable than traditional and question similarity-based approaches since it retrieves answers from a prebuilt, validated knowledge graph that maintains up-to-date health information. This approach is known to have significant performance over simple questions (Chakraborty et al. 2021). However, querying a knowledge graph becomes difficult when the complexity of the questions increases, including aggregates, constraints, and longer relation paths. These nuanced retrieval methods are considered the next set of challenges associated with knowledge-graph-based QA (Chakraborty et al. 2021).


The PRNA system (Datla, Arora et al. 2017) uses a knowledge graph composed of Wikipedia articles to answer consumer health questions. The link structure inside Wikipedia is converted into a knowledge graph where the nodes represent Wikipedia pages, hyperlinked concepts, and redirect pages, and the edges represent the relationships between them (Datla, Hasan et al. 2017). It identifies the most appropriate section that answers the question by extracting medical entities and classifying the question into one of 23 question types defined in the TREC LiveQA 2017 medical subtask. It uses an NLP engine (Hasan et al. 2014) to extract medical entities such as diseases, medications, and symptoms and normalize them so that they can be mapped to their corresponding Wikipedia pages. The feature extraction algorithm described in Sarker and Gonzalez (2015) is used to identify the question type. If the question has multiple subquestion types, it answers each subquestion by identifying the appropriate section in Wikipedia that matches the subquestion type. The PRNA system was ranked second in the TREC LiveQA 2017 medical subtask with an average score of 0.490.

Question similarity/entailment-based approaches

Question similarity/entailment-based approaches assume that the answer to a given question is the best answer corresponding to the most similar or the most entailed question that already has associated answers. Thus, instead of retrieving the best matching answer(s) for a given question, this approach retrieves similar questions that already have associated answers using either search in Q&A websites or TF-IDF-based scoring methods. Then it selects the best answer by ranking the retrieved questions based on the question similarity or the question entailment score calculated using neural or non-neural scoring techniques. Figure 5 illustrates the generic architecture or pipeline of a question similarity/entailment-based QA system.

Details are in the caption following the image
FIGURE 5Open in figure viewerPowerPointArchitecture of a question similarity/entailment-based question answering system.

Since similar questions found online or in curated databases may follow similar conversation patterns that we see in real consumer health questions, and the number of similar questions posted online or received by health-related agencies keeps on increasing, this approach has the potential to match better to the consumer’s language than traditional or knowledge graph methods. It eliminates the need to explicitly identify question attributes such as the question focus and question type as required by knowledge graph methods and solves the challenges associated with both traditional and knowledge graph methods in question understanding and answer extraction. Thus, current systems tend to focus increasingly on this approach to answer consumer health questions.


The ECNU-ICA system (An et al. 2017) uses a question similarity-based QA approach based on the hypothesis that the answer to the most similar question can better answer the original question than the others. It uses this approach to answer both open-domain and consumer health questions. First, it retrieves semantically similar candidate Q&A pairs for a given question by searching in community QA websites such as Yahoo! Answers and Answers.com. Then it uses two Parameter Sharing Long Short-Term Memory (PS-LSTM) networks trained on similar questions from Yahoo! Answers, and Answers.com to semantically represent the original question and each question from the pool of candidate questions to learn the similarity between them. This representation is combined with keyword information from the questions and is fed into a simple metric function to calculate the similarity. These candidate Q&A pairs are then reranked by combining the similarity scores of each question pair obtained in the previous step with external information such as the source and the order in the community QA website from where they were retrieved. The system responds with the answer of the highest-ranked candidate Q&A pair as the best answer. Finally, it judges whether the answer is eligible by restricting the answer length to less than 1000 (as required in the TREC LiveQA 2017 medical subtask) and removing the unreadable characters from the answer text. This approach achieved an average score of 0.402 in the TREC LiveQA 2017 medical subtask.

Question-entailment approach

The Question-Entailment approach, developed by Abacha and Demner-Fushman (2019a), first recognizes entailed questions that already have associated answers. A question A is said to entail a question B if every answer to question B is also a correct answer to question A. For a given question, it uses the Terrier search engine3 to retrieve candidate questions from a dataset (MedQuAD) with over 47,000 medical Q&A pairs curated from 12 trusted NIH websites. To improve the performance, the questions of the dataset are indexed without the associated answers. The question focus, synonyms of the question focus, the question type, and the terms that triggered the question type are also indexed with each question to avoid shortcomings of query expansion (e.g., incorrect or irrelevant synonyms, increased execution time). After retrieving the set of candidate questions, it applies a logistic regression-based RQE (Recognizing Question Entailment) classifier, which is trained on the Stanford Natural Language Inference (SNLI) (Bowman et al. 2015) and multiNLI (Williams, Nangia, and Bowman 2018) sentence pair datasets and Quora4, Clinical-QE (Abacha and Demner-Fushman 2016), and SemEval-cQA (Nakov et al. 2016) question pair datasets, to filter out the nonentailed questions and rerank the remaining candidates. Finally, the answer to the highest-ranked candidate question is retrieved and presented as the best answer. This approach obtained an average score of 0.827 over the TREC LiveQA 2017 medical Q&A test dataset, exceeding the TREC LiveQA 2017 medical subtask’s best results by 29.8%.

Hybrid approaches to CHeQA

Hybrid approaches to CHeQA combine traditional IR-based, question similarity/entailment-based, and knowledge graph-based approaches to build a complete CHeQA system. The idea is to leverage the merits of different approaches to construct a more robust and efficient QA system. For example, knowledge graphs can provide more precise answers to consumer health questions over free text. However, the natural incompleteness of knowledge graphs may limit the question scope that the system can answer. Therefore, a hybrid approach that exploits both a knowledge graph and traditional information retrieval over free text may provide a better solution for answering consumer health questions.


The Consumer Health Information and Question Answering (CHiQA) system (Demner-Fushman, Mrabet, and Ben Abacha 2020), developed by the U.S. National Library of Medicine (NLM), retrieves answers for consumer health questions by searching MedlinePlus,5 an online information service produced by the U.S. NLM, using the focus/foci (main topic(s) of interest such as disease, drug, and diet), and the type (task for which information is needed) of the question as search terms. A study conducted to find if it is possible to use the question focus and question type of a consumer request to find accurate answers from MedlinePlus and other online resources, has demonstrated that the extracted terms provide enough information to find authoritative answers for over 60% of the questions. Therefore, this approach was utilized in the development of CHiQA (Deardorff et al. 2017). It uses the MetaMap Lite tool6 to identify PICO (P: Problem/Patient/Population; I: Intervention/Indicator; C: Comparison; O: Outcome) elements in the question, and the Support Vector Machine (SVM) and Long Short Term Memory (LSTM) based approaches to identify the question focus/foci and type. Figure 6 presents the developer view of the CHIQA system, which shows the MetaMap Lite analysis of the question, question foci, and type recognized by SVM and LSTM/CNN approaches and the answers generated. This system was evaluated on simple Alexa questions generated by MedlinePlus staff in a pilot development of Alexa skills and the TREC LiveQA 2017 medical Q&A dataset. It was able to achieve an average score of 1.308 over the TREC LiveQA medical questions, which is a 58.16% increase over the average score obtained by the question entailment approach by Abacha and Demner-Fushman (2019a).

Details are in the caption following the image
FIGURE 6Open in figure viewerPowerPointThe developer view of the CHIQA system, showing analysis of the question, question focus, and type recognized by SVM and LSTM/CNN approaches and the answers generated using traditional information retrieval-based approach. Figure credit: Demner-Fushman (2018).


The CMU-LiveMedQA system (Yang et al. 2017) builds and maintains a knowledge graph that organizes information for each medical entity as a tree to preserve the medical domain-specific concept hierarchy, with the entity as root and its attributes as leaves. The medical entities are organized in an array. It uses the NLTK Bird toolkit (Bird 2006) and TF-IDF scores to infer the question focus (the medical entity in focus); a text classification convolutional neural network (CNN) trained on the TREC LiveQA 2017 Medical Q&A Development Dataset and the GARD Question Decomposition Dataset Roberts, Kilicoglu et al. (2014) to infer the question type (e.g. Information, Cause, and Treatment); and a structure-aware searcher that makes use of regular expression and a Lucene BM25 search engine to map the question focus to medical entity and an attribute lookup table to map the question types to entity attributes in the knowledge graph. For example, the question “How do I know if I have diabetes?” can be formulated as a query for receiving the “symptom” attribute associated with the entity “Diabetes.” It assumes that each question has exactly one such medical entity with one or more attributes associated with it.

At the TREC LiveQA 2017 medical subtask, the CMU-LiveMedQA system performed worse than the average, receiving a score of 0.353 below the median. The authors conclude that three main factors were responsible for its low performance. (1) Since the CNN model was trained on a small dataset, it was not robust enough to accurately classify the question type. (2) Inaccurate matches of the medical entities were caused by either using hard match or BM25 searcher. (3) The assumption that each question has exactly one medical entity did not hold in questions on drug interactions that usually involve comparing two entities. The authors suggest that the system’s performance can be improved by (1) training a more robust type prediction model using more training data; and (2) relaxing the single entity assumption to multiple entities.

Multiturn conversational CHeQA

All of the CHeQA approaches described so far are based on the single-turn QA setting. This assumes that consumers can compose a complicated query in natural language and receive a concise answer just in one turn. However, in most cases, because of the naiveness and lack of experience of consumers in the healthcare domain, they fail to describe their needs in one shot. Thus, multiturn conversational CHeQA agents, which allow consumers to query interactively, are preferred in this context. Users often start with a vaguely expressed information need and gradually refine it over the course of the interaction sequence without having to compose complicated queries in the beginning. Although this approach is more desirable, it presents new challenges, such as managing context and resolving coreferences and ellipsis. Hence, in addition to a core QA component, a conversational CHeQA agent is equipped with other strategies to manage a conversation. Compared to single-turn CHeQA approaches, multiturn conversational approaches to CHeQA are relatively limited. The following subsection describes one such method in detail.


enquireMe (Wong, Thangarajah, and Padgham 2012) is a contextual QA system, which allows users to engage in conversations to get their health questions answered. An example interaction with enquireMe is shown in Figure 7. The interaction shows how the system can manage the context and resolve coreferences in answering a set of related questions about “headache” in a conversational manner.

Details are in the caption following the image
FIGURE 7Open in figure viewerPowerPointAn example interaction with enquireMe (reads from bottom to top). Figure credit: Wong, Thangarajah, and Padgham (2012).

enquireMe follows a question similarity-based approach to answer consumer health questions. It uses a weight decay model to resolve coreferences and manage the context of the conversation. First, it extracts noun phrases (keyphrases) from the input question and assigns them weights based on how accurately their occurrences can be modeled using the Poisson distribution. The weights represent the amount of content-bearingness of the keyphrases. A decay model is used to exponentially decay the weights of keyphrases over time to maintain the interaction context, which is crucial to a contextual QA system. The weighted keyphrases extracted from current and previous inputs are used to retrieve candidate Q&A pairs using a simple structured query from a database containing over 80,000 Q&A pairs curated from community QA websites such as Yahoo! Answers. It uses a special scoring algorithm to score and rank the Q&A pairs based on the number of overlapping key phrases and other criteria such as the number of user votes and whether the pair has been previously used as a response in the conversation context. The highest-ranked Q&A pair is used to generate the final answer.

In addition to resolving coreferences, it uses an algorithm that, for each pronoun in the input question, iterates through previously weighted key phrases and finds the highest-weighted context word that also happens to be a noun as well as a seed concept used by the system administrator to extract Q&A pairs. Thus, this approach is limited by the coverage of the list of seed concepts.

enquireMe was evaluated on three datasets: (1) 150 definitional questions extracted from WebMD (Olvera-Lobo and Gutiérrez-Artacho 2011); (2) 274 questions including questions extended from the 1st dataset to include contextual follow-up questions; and (3) TREC 2004 cross-domain QA dataset (Voorhees 2004). It was able to achieve precision scores of 94.00, 86.86, and 87.80% on the three evaluation datasets, respectively.


In this section, we discuss commercially available applications for CHeQA. The inner workings of these systems are not easily accessible to the general public. However, we summarize the publicly shared information about them, including the knowledge sources they use and their features. We also show some example Q&A sessions conducted using these applications to understand their capabilities better.

Cleveland Clinic Health Q&A

Cleveland Clinic Health Q&A (my.clevelandclinic.org/mobile-apps/health-q-a) is a single-turn CHeQA mobile app available on the App Store. It uses existing physician-reviewed content to answer consumer health questions. It searches the keywords in the input question in a database containing more than 10,000 Q&A pairs to find the best matching answer. A user can either type or speak the question into the mobile device. Then it provides a list of similar questions, for which answers already exist, for the user to select from. Figure 8 shows an example of a question posed to the app, and the answer received.

Details are in the caption following the image
FIGURE 8Open in figure viewerPowerPointAn example of a question posed to the Cleveland Clinic Health Q&A mobile app and the answer received. (A) The list of related questions to the question typed/spoken. (B) The answer received by selecting the first question suggested by the app.


Florence (florence.chat) is an online personal health assistant designed to answer consumer health questions and perform other tasks such as reminding the user to take medicine and keeping track of the user’s health. Users can ask questions by sending instant messages to Florence via Facebook Messenger, Skype, and Kik and receive answers. Florence uses content from MedlinePlus, the U.S. Food and Drug Administration, and Wikipedia to generate answers. Figure 9 shows an example of a conversation with Florence via Facebook Messenger.

Details are in the caption following the image
FIGURE 9Open in figure viewerPowerPointAn example of a conversation with Florence via Facebook Messenger.

Ask WebMD

Ask WebMD (www.amazon.com/WebMD-Health-Corp/dp/B01MRM361G) is a skill that is integrated with Amazon Alexa and Google Assistant to answer basic health-related questions (e.g., treatments for common ailments, definitions of basic diseases, and side effects of drugs) in a conversational manner. It is launched by WebMD, the leading source of health information in the United States. It uses a broad range of physician-reviewed responses in WebMD on health conditions, drugs, and their side effects to answer questions. It also provides an option to request additional information, which directs the user to a WebMD page for more details. Table 3 shows an example of a conversation with Ask WebMD integrated with Amazon Alexa.TABLE 3. An example of a conversation with Ask WebMD integrated with Amazon Alexa.

WebMD:Welcome to WebMD. I can tell you about health conditions, drugs, or even side effects. What would you like to know?
User:What is Insomnia?
WebMD:Insomnia is trouble falling asleep or staying asleep. A lack of sleep can leave you tired during the day. Medical conditions like depression, arthritis, and heartburn can cause insomnia. Stress, medicines, temperature, and noise can also keep you awake. Good sleep habits like going to bed at the same time each night can help. If that doesn’t work, your doctor can prescribe sleep medicines. Would you like to know more about this?
WebMD:Is there something else you would like to know?
WebMD:Okay, what would you like to know?
User:What are the side effects of Asprin?
WebMD:Asprin may upset your stomach or cause heartburn. You are at a higher risk for bleeding if you combine Asprin with other nonsteroidal anti-inflammatory drugs, or NSAIDs. Would you like to know more about this?
WebMD:Is there something else you would like to know?
User:No, thank you.
WebMD:Thank you for using WebMD. Goodbye.


MedWhat (medwhat.com) is a conversational assistant that can answer consumer health questions. It is developed by healthcare and data science experts from Stanford and is available as a mobile app in the Microsoft Store. Users can ask questions from MedWhat either by typing or by speaking into the mobile device. It uses content from trusted sources such as NIH and the Center for Disease Control and Prevention (CDC), peer-reviewed journals in PubMed, and Wikipedia articles that use quality references to answer questions. It provides personalized answers by taking into account the user’s health profile and medical history. It tracks and manages the context throughout the conversation, enabling users to ask follow-up questions. Table 4 shows an example of a conversation carried out with MedWhat.TABLE 4. An example chat with MedWhat—taken from a demonstration session of the MedWhat app in Microsoft Accelerator’s Machine Learning Demo Day 2016.

MedWhat:Good afternoon User! What medical or health question can I help you with?
User:I have a cold.
MedWhat:Sorry to hear that User. But just to make sure it’s only a cold, do you also have fever?
User:No, I don’t have fever.
MedWhat:Good, if it were the flu you would also have a fever. Would you like to know how to treat your cold?
User:Yes, please.
MedWhat:Ok. I see you are 30. For an adult, a cold can be treated with over-the-counter medications or decongestants. Do not take antibiotics. However, User, I see in your health record, you have high blood pressure. Can you confirm this is correct?
User:Yes, I have high blood pressure.
MedWhat:In that case decongestants are not recommended for you because they can cause even high blood pressure. You can take over-the-counter Tylenol or get a prescription for Relenza or Tamiflu.
User:Are there natural ways to take care of my cold?
MedWhat:Yes, you can drink chicken soup, take Vitamic C, eat fruits and vegetables, drink tea and honey to increase your immune system.
User:Great, and how many Tynols can I take in a day?
MedWhat:Up to 6 pills per day or 3000 mg. Your cold symptoms should go away within a week. If you still feel sick after 7 days, you might have to see a doctor. Would you like me to follow up in 7 days to see how you feel?
User:Yes, do that please.
MedWhat:Great, I will check on you in 7 days and will monitor your condition tomorrow and throughout the week with daily follow-up questions. Feel better User.


This section discusses the resources such as the various publicly available datasets and other linguistic tools intended to assist the development of automated CHeQA approaches. They can be grouped as follows:

  • 1.Q&A Datasets
  • 2.Semantically Annotated Question Datasets
  • 3.Entailed Questions Datasets
  • 4.Ranked Q&A Datasets
  • 5.Spelling Correction Datasets
  • 6.Language-Specific Datasets
  • 7.Other Linguistic Tools

Figure 10 shows the summary of the different types of resources we review in this article. The following subsections describe those resources in detail.

Details are in the caption following the image
FIGURE 10Open in figure viewerPowerPointClassification of resources used for consumer health question answering.

Q&A datasets

These datasets serve as benchmarks to train and evaluate CHeQA approaches. They can be utilized as document collections from which content can be retrieved to answer consumer health questions. Q&A datasets having semantic annotations (e.g., question focus, question type) also help in training and evaluating question understanding methods.

TREC LiveQA 2017 medical Q&A dataset

This benchmark dataset was introduced by Abacha et al. (2017) at the TREC LiveQA 2017 medical subtask. It consists of two training datasets and a testing dataset with questions and their respective reference answers. The two training datasets contain 634 Q&A pairs altogether, which are constructed from FAQs on trusted websites of the U.S. National Institute of Health (NIH). The (sub)questions in the dataset are annotated with one of four question foci (Disease; Drug; Treatment; and Exam) and 23 question types (e.g., Treatment, Cause, Indication, and Dosage). Candidate answers for the questions in the first training dataset were retrieved using automatic matching between the CHQs and the FAQs based on the question focus and type. But only the manually validated Q&A pairs were retained for training. Answers in the second training dataset were retrieved manually by librarians using PubMed and web search engines.

The testing dataset contains 104 questions received by the U.S. National Library of Medicine (NLM) along with reference answers, which are manually collected from trusted sources such as NIH websites. The (sub)questions span across five question foci (Problem; Drug Supplement; Food; Procedure Device; and Substance) and 26 question types. This dataset is accessible publicly on GitHub (github.com/abachaa/LiveQA_MedicalTask_TREC2017).

MedQuAD (Medical Question Answering Dataset)

This dataset introduced by Abacha and Demner-Fushman (2019a) contains 47,457 medical Q&A pairs generated from 12 trusted NIH websites (e.g., cancer.gov, niddk.nih.gov, GARD, MedlinePlus Health Topics). Handcrafted patterns specific to each website were used to automatically generate Q&A pairs based on the document structure and the section titles. Each question in the dataset is annotated with one of THREE question foci (Diseases, Drugs, and Other) and 37 question types (e.g., Treatment, Diagnosis, and Side Effects). The questions are annotated additionally with synonyms of the question focus, its UMLS Concept Unique Identifier (CUI), and UMLS semantic type. The dataset is accessible publicly on GitHub (github.com/abachaa/MedQuAD).

Medication QA dataset

This dataset introduced by Ben Abacha et al. (2019) is a gold standard corpus for answering CHQs about medications. It consists of 674 real consumer questions received by the U.S. NLM regarding medications and associated answers extracted from websites such as MedlinePlus, DailyMed, MayoClinic, NIH or U.S. government websites, academic institutions’ websites, and other websites returned by Google search. Each question is manually annotated with the question focus (name of the drug the question is about) and type (e.g., Information, Dose, Usage, and Interaction). The dataset is publicly available on GitHub (github.com/abachaa/Medication_QA_MedInfo2019).

Semantically annotated question datasets

These datasets consist of CHQs annotated with question attributes such as question type, focus, and named entities. These datasets can be utilized to train and evaluate methods for question understanding, such as question type classifiers, focus recognizers, and question decomposition classifiers (Kilicoglu et al. 2018) that can assist in CHeQA.

NLM-CHQA corpus

This corpus introduced by Kilicoglu et al. (2018) is a two-part corpus containing semantically annotated consumer health questions. The first part, CHQA-email, consists of 1740 email requests received by the NLM customer service regarding consumer health. The second part, CHQA-web, consists of 874 relatively shorter questions posed to the MedlinePlus search engine as queries. Each (sub)question is manually annotated with its named entities, question focus and category (e.g., question focus “leg cramps” belongs to the category “Problem”—together termed as the question theme), and question type and trigger (e.g., word “test” triggers the question type “Diagnosis”). This information is arranged in a representation called a question frame. Each (sub)question is allowed to have more than one focus (e.g., a reaction between two drugs). The questions in CHQA-email are associated with 33 question types (Information, Cause, Diagnosis, Dosage), whereas the questions in CHQA-web are associated with 26 question types (created by merging some of the question types in CHQA-email that do not frequently occur in CHQA-web). The corpus is accessible publicly at bionlp.nlm.nih.gov.

GARD Question Decomposition Dataset

This dataset introduced by Roberts, Masterton et al. (2014) contains 1467 consumer-generated requests available on the Genetic and Rare Diseases Information Center (GARD) website regarding disease conditions. Each request is decomposed into subquestions and are annotated with 13 different question types (e.g., Anatomy, Cause, Complication, and Diagnosis). Additionally, each request is annotated with one or more focus diseases. This process has resulted in 2937 annotated subquestions altogether. The dataset is intended to help train and evaluate automatic techniques for decomposing complex medical questions and recognizing question focus and type. The dataset is publicly available at the U.S. NLM website (lhncbc.nlm.nih.gov/project/consumer-health-question-answering).

CHQA-NER corpus

This corpus introduced by Kilicoglu et al. (2016) contains 1548 consumer health questions received by the U.S. NLM about diseases and drugs, which are manually annotated with biomedical named entities that belong to 15 broad categories (e.g., Anatomy, Problem, Diagnostic Procedure, and Drug Supplement). The dataset is intended to help in training and evaluating methods for biomedical Named Entity Recognition (NER) in CHQs and forming a basis for recognizing question types, concepts, semantic relations, and question frames. The corpus is publicly available at the U.S. NLM website (lhncbc.nlm.nih.gov/project/consumer-health-question-answering).

Entailed questions datasets

These datasets contain CHQs with an associated list of similar/entailed questions. They can be used to train and evaluate methods for identifying the entailment between two questions (as described in Abacha and Demner-Fushman (2019a)) and develop CHeQA systems that work by using question entailment or similarity to answer CHQs.

RQE (Recognizing Question Entailment) Dataset

This dataset introduced by Abacha and Demner-Fushman (2016) contains 8588 clinical question–question pairs with a label indicating whether or not the questions entail each other. The RQE test dataset contains 302 question pairs, each pair consisting of a question received by the U.S. NLM and a question from NIH FAQs. Examples of entailed and nonentailed question pairs from the RQE testing dataset are shown in Table 5. The dataset is publicly accessible on GitHub (github.com/abachaa/RQE_Data_AMIA2016).TABLE 5. Entailed and nonentailed question pairs from RQE test dataset (Abacha and Demner-Fushman 2016).

Entailed question pair:
– sepsis. Can sepsis be prevented. Can someone get this from a hospital? (CHQ)
– How is sepsis treated? (FAQ)
Nonentailed question pair:
– sepsis. Can sepsis be prevented. Can someone get this from a hospital? (CHQ)
– Have any medicines been developed specifically to treat sepsis? (FAQ)

Ranked Q&A datasets

These datasets contain CHQs with an associated list of answers ranked according to relevance/accuracy. They help in training and evaluating methods that filter and rank retrieved answers for a given question.

MEDIQA 2019 QA dataset

This dataset introduced by Ben Abacha, Shivade, and Demner-Fushman (2019) consists of two training sets containing medical questions and the associated answers retrieved by CHiQA (chiqa.nlm.nih.gov). The first training set consists of 104 CHQs covering different types of questions about diseases and drugs (TREC LiveQA 2017 medical test questions) and the associated answers. The second training set contains 104 simple questions about the most frequent diseases (dataset named Alexa) and the associated answers. The validation and testing datasets consist of similar types of questions about diseases and drugs, and their answers generated by CHiQA (Demner-Fushman, Mrabet, and Ben Abacha 2020). Each answer in the training, validation, and testing datasets is annotated with the system rank, which corresponds to CHiQA’s rank, and the reference rank, which corresponds to the correct rank. The answers in training and validation datasets are annotated additionally with the reference score, which corresponds to the manual judgment/rating of the answer (4: Excellent, 3: Correct but Incomplete, 2: Related, 1: Incorrect). The dataset is accessible publicly on GitHub (github.com/abachaa/MEDIQA2019/tree/master/MEDIQA_Task3_QA).

Spelling correction datasets

These datasets contain CHQs that are annotated and corrected for spelling errors. They can be utilized to train and evaluate methods for detecting and correcting common spelling errors in consumer language.

CHQA Spelling Correction Dataset

This dataset introduced by Kilicoglu et al. (2015a) contains 472 CHQs received by the U.S. NLM, which are manually annotated and corrected for spelling errors (orthographic and punctuation errors). One thousand and eight spelling errors are annotated on a total of 1075 tokens. It also contains annotations on whether the error occurred in a focus element or in an element important for extracting the semantic frame of the question. The dataset is publicly available at the U.S. NLM website (lhncbc.nlm.nih.gov/project/consumer-health-question-answering).

Language-specific datasets

Language-specific datasets that assist in CHeQA contain CHQs or Q&A pairs in a specified language. They can be used to develop language-specific CHeQA systems.


This dataset introduced by Guo, Na, and Li (2018) is a Chinese health question corpus annotated according to a two-layered classification schema consisting of 29 question types (e.g., Diagnosis, Treatment, Anatomy and physiology, Epidemiology, and Healthy lifestyle). The dataset consists of two parts: the first part contains 2000 questions related to hypertension; and the second part includes 3000 questions, which are randomly selected from five Chinese health websites within six broad sections: internal medicine, surgery, obstetrics and gynecology, pediatrics, infectious diseases, and traditional Chinese medicine. The questions are manually annotated with 7101 tags altogether according to a well-defined classification schema and annotation rules that relate subtopics with question types. The dataset is accessible at the Qcorp website (www.phoc.org.cn/healthqa/qcorp).

Bahasa Indonesia consumer health Q&A corpus

This corpus introduced by Hakim et al. (2017) contains 86,731 consumer health Q&A pairs, collected from five Indonesian Q&A websites in the health domain, in which doctors provide answers. Each question is annotated with one of 13 question categories based on the medical specialization to which a question belongs (e.g., Obstetrics and Gynecology, Nutrition, and General Health). The questions are classified using two complementary approaches: a dictionary-based approach and a supervised machine learning approach. Unfortunately, an online link to this resource cannot be found.

Other CHeQA resources


Unified Medical Language System (UMLS) is a set of files and software that brings together many health and biomedical vocabularies and standards to enable interoperability between computer systems (Bodenreider 2004). The UMLS comprises three parts: Metathesaurus, Specialist Lexicon, and Semantic Network. The Metathesaurus is the largest component of UMLS, a large biomedical thesaurus organized by concept or meaning, and links similar names for the same concept from nearly 200 different vocabularies. The Specialist Lexicon is a large syntactic lexicon of general English that includes many biomedical terms. The Semantic Network consists of a set of broach Semantic Types that provide a consistent categorization of concepts represented in the UMLS Metathesaurus and a set of useful Semantic Relations between Semantic Types. Previous studies (Tolentino et al. 2007; Kilicoglu et al. 2015b; Jimeno Yepes and Aronson 2012; Stevenson and Guo 2010) have utilized the UMLS to obtain a domain-specific source of dictionary terms and meanings to correct spelling errors and disambiguate ambiguous words occurring in the biomedical text (e.g., word “cold” can have several possible meanings including “common cold” (disease), “cold sensation” (symptom), and “cold temperature” (symptom), according to the UMLS Metathesaurus). The UMLS also consists of a Consumer Health Vocabulary (CHV), which intends to help consumer health applications translate technical terms into consumer-friendly language. The UMLS resources can be publicly accessed via www.nlm.nih.gov/research/umls.


Spell Checker for Consumer Language (CSpell) (Lu et al. 2019) is a generic, configurable, real-time, open-source, and distributable standalone tool intended to correct spelling errors in CHQs. It can handle various spelling errors, including nonword errors, real-word errors, word boundary infractions, punctuation errors, and combinations of the above. It uses an approach that uses dual embedding within Word2Vec for context-dependent corrections together with dictionary-based corrections in a two-stage ranking system. It also has splitters and handlers to correct word boundary infractions. It has achieved an F1 score of 80.93 and 69.17% for spelling error detection and correction, respectively. The CSpell software and its testing dataset are available at umlslex.nlm.nih.gov/cSpell.


This section summarizes the evaluation methods used in the literature to assess the performance of CHeQA approaches. They can be used as standard metrics to evaluate single-turn and conversational approaches to answer CHQs.

Evaluation of single-turn CHeQA approaches

Human judgment was used to evaluate the performance of the single-turn CHeQA approaches submitted to the TREC LiveQA 2017 medical subtask. The task’s testing dataset to assess those approaches contains 104 CHQs received by the U.S. NLM. The questions were chosen such that they cover a wide range of question types (26) and have a slightly different distribution than the training questions so that the scalability of the approaches can be evaluated. The subquestions and the question focus and type annotations were not provided initially to the participants. Reference answers to the questions were collected manually from trusted sources such as NIH websites. Question paraphrases/interpretations were then generated by assessors from the National Institute of Standards and Technology (NIST) after reading both the original question and the reference answers. Question paraphrases, along with the reference answers, were used to assess the participants’ responses.

The answers produced by each approach were stored conditioned on meeting a time limit of 1 min and a character length limit of 1000. They were assessed by human assessors (one assessor per question) from NIST, based on a 4-point Likert scale (1: incorrect, 2: incorrect but related, 3: correct but incomplete; 4: correct and complete). The following scoring scheme was followed to compute seven measures to compare the performance of the approaches in the task.

  • avgScore [0–3 range]: the average score overall questions after transferring 1–4 level grades to 0–3 scores, treating a 1-level grade answer the same as a nonanswered question. This is the main score used to rank the approaches.
  • succ@i+: the number of questions with score i or above (�∈2..4) divided by the total number of questions.
  • prec@i+: the number of questions with score i or above (�∈2..4) divided by the number of questions answered by the system.

The above measures were used to assess the top answer retrieved for each test question. The following metrics are also used to evaluate the quality of top-k answers retrieved by CHeQA systems (Abacha and Demner-Fushman 2019a).

  • Mean average precision (MAP): the mean of the average precision scores over all questions. This is given by Equation (1), where Q is the number of questions and �⁢�⁢�⁢�� is the average precision of the ith question.MAP=1�⁢∑��Avg⁢��(1)The average precision of a question is calculated as given by Equation (2), where K is the number of correct answers and �⁢�⁢�⁢�� is the rank of the nth answer.AvgP=1�⁢∑�=1��rank�(2)
  • Mean reciprocal rank (MRR): MRR is the average of the reciprocal ranks for each question. This is given by Equation (3), where Q is the number of questions and �⁢�⁢�⁢�� is the rank of the first correct answer for the ith question.MRR=1�⁢∑��1rank�(3)

Evaluation of multiturn CHeQA approaches

The input to a contextual QA system such as enquireMe (Wong, Thangarajah, and Padgham 2012) is a sequence of evolving expressions related to some common information need. The evaluation methods should test such approaches in tracking the context from one question to the next and resolving coreferences to maintain coherent and focused dialog. The TREC 2004 QA dataset (Voorhees 2004), which consists of 65 series of questions, where each question in a series asks for some information regarding a common target, is meant to evaluate these abilities in open-domain contextual QA systems. Table 6 shows a sample question series from this dataset.TABLE 6. A sample question series from TREC 2004 QA task track (Voorhees 2004).

Hale Bopp comet
FACTOIDWhen was the comet discovered?
FACTOIDHow often does it approach the earth?
LISTIn what countries was the comet visible on its last turn?

But to the best of our knowledge, there are no publicly available datasets comprising of series of related questions regarding consumer health. To overcome this limitation, the authors of enquireMe extended the one-off questions from the definitional questions dataset used in Olvera-Lobo and Gutiérrez-Artacho (2011) by including contextual follow-up questions to evaluate the contextual QA ability of their system. Questions on medical conditions were extended by including questions: “What causes it?” and “What are its treatments?”. The remaining questions on medical procedures, treatment options, medical devices, and drugs were extended by including the question, “What are its uses?”. A human judgment-based evaluation approach similar to the process followed in TREC LiveQA 2017 medical subtask can then be applied to judge the appropriateness of the answers generated.

Advanced conversational CHeQA approaches, which might be developed in the future having capabilities to ask follow-up questions to refine user queries and allow users to engage in mixed-initiative dialogs, and generate emotion-aware human-like responses, may need to follow methods used to evaluate existing task-oriented and chit-chat-based dialog agents. We would like to direct our readers to the survey conducted by Gao, Galley, and Li (2019) that discusses human- and simulation-based evaluation methods for task-oriented and chit-chat-based dialog agents.


This section discusses the limitations of existing CHeQA approaches and how addressing those limitations will benefit people without any medical knowledge to access health information more naturally and intuitively.

Use of large and pre-trained language models

The introduction of Large Language Models (LLMs) such as GPT-3 Brown et al. (2020) and GPT-4 (OpenAI 2023), PaLM (Chowdhery et al. 2023), LaMDA (Thoppilan et al. 2022), and LLaMA (Touvron et al. 2023), and Pre-trained Language Models (PLMs) such as BERT (Devlin et al. 2019), RoBERTa (Liu et al. 2019), T5 (Raffel et al. 2020), ALBERT (Lan et al. 2019), and XLNET (Yang et al. 2019) have substantially advanced the state-of-the-art in a number of NLP tasks including QA. One can use few-shot or even zero-shot prompting on LLMs to get them to answer consumer health questions, whereas one may need to fine-tune PLMs to perform the task of CHeQA using domain-specific but comparatively smaller datasets. Both these systems have shown high performance on QA tasks as evidenced by the following studies.

A recent study conducted by Beilby and Hammarberg (2023) on using ChatGPT (powered by GPT-3.5) to answer patient questions about fertility. Evaluation of the responses by experts reported that ChatGPT generates high-quality answers to patient questions with little evidence of commercial bias, which suggests that ChatGPT may be a useful tool for patients seeking factual and unbiased information. A cross-sectional study conducted by Ayers et al. (2023), to evaluate ChatGPT’s responses to public health questions reports that ChatGPT consistently provided evidence-based answers to public health questions.

All the top-scoring systems in the SQuAD 2.0 (Rajpurkar et al. 2016) and CoQA (Conversational Question Answering) (Reddy, Chen, and Manning 2019) leaderboards for open-domain QA are based on BERT. These systems are rapidly approaching human performance on the SQuAD and CoQA datasets. Wen et al. (2020) adapt BERT for clinical why QA. They train BERT with varying data sources to perform SQuAD 2.0 style why-question answering (why-QA) on clinical notes. They show that with sufficient domain customization, BERT can achieve accuracy close to 70% on clinical why questions. Lee et al. (2020) propose Bio-BERT, a domain-specific language model obtained by pretraining BERT on large-scale biomedical corpora to facilitate mining on the biomedical text. It shows close to 12% MRR improvement over other state-of-the-art models for biomedical QA.

The above work implies that with proper prompting and/or sufficient fine-tuning, both LLMs and PLMs can be adapted for CHeQA even with a limited amount of training data. This has significant advantages over other approaches since it requires less time for training and less training data. Because of its ability to generalize, it can be well-suited for consumer health questions in particular since they are often ill-formed and may use different colloquial terms to describe medical concepts. Existing CHeQA systems sometimes fail to understand consumer health questions that include such colloquial terms. An example of such a situation is illustrated in Figure 11A, where the CHiQA system fails to interpret the question “How to get rid of pimples?” since it does not include any terminology that matches the medical concepts in its document collection. Whereas the same question asked properly with correct medical terms as “How to treat acne?” retrieves correct results as seen in Figure 11B. We believe that the use of LLMs and PLMs with sufficient fine-tuning can improve CHeQA systems to generalize better to consumers’ terminology.

Details are in the caption following the image
FIGURE 11Open in figure viewerPowerPointTwo example situations where the CHiQA system (A) fails to correctly interpret the consumer health question “How to get rid of pimples?” that does not include proper medical terms, and (B) the system gives the correct results when the same question “How to get rid of pimples?” is rephrased as “How to treat acne?”, which includes proper medical terminology.

However, a risk that is associated with these systems is the accuracy of the generated answers. A study conducted by Hulman et al. (2023) to evaluate the answers given by ChatGPT for diabetes-related questions against answers given by humans, reports that ChatGPT’s answers to two out of 10 diabetes-related questions contain misinformation. With respect to this, the quality of the answers provided by more traditional CHeQA approaches as described in this survey, whose answers come from validated scientific content, may be higher than those provided by systems based on generic LLMs or PLMs fine-tuned on smaller sets of domain-specific data. Thus, future work should carefully evaluate the ability of LLMs and PLMs to generate accurate and reliable answers to CHQs compared to traditional CHeQA approaches.

Mixed-initiative conversational QA

Mixed-initiative interaction is a computer–human interaction, in which either the computer or the human can take the initiative and decide which step to take next (Allen, Guinn, and Horvtz 1999). It is a crucial property of effective dialog that seamlessly interacts with humans to perform complex tasks (Hearst et al. 1999). It enables both the system and the user to interrupt a conversation and ask questions to clarify anything unclear. It becomes important for CHeQA systems, mainly due to the linguistic gap between consumer health and medical vocabulary.

Consumer health questions contain many ambiguities due to the lack of knowledge and experience of consumers in the healthcare domain. A consumer may (1) use ambiguous terms, which have more than one meaning, for example, the term “cold” can have several possible meanings, including “common cold” (disease), “cold sensation” (symptom), and “cold temperature” (symptom), according to the UMLS Metathesaurus (Humphreys et al. 1998); (2) describe something in his own words without using the correct medical term; or (3) not clearly convey what information he requires (e.g., causes, prevention, treatment options, or maintenance) regarding a medical condition. This implies the importance of asking follow-up questions to get such ambiguities clarified before returning an answer. Because of the technical complexities that may exist in understanding the returned answer, or any clarification question being asked, it is important that the user also can interrupt between a conversation and ask anything he does not understand. The conversation with the commercial CHeQA agent, MedWhat, illustrated in Table 4, is an example of a mixed-initiative conversation, in which both the user and the agent interrupt between the conversation to receive additional information and clear doubts.

To enable this type of interaction, in addition to context management, co-referencing, and ellipsis resolution capabilities, it requires integrating additional components such as a dialog manager. The typical architecture of a mixed-initiative conversational QA agent is shown in Figure 12. The dialog manager is responsible for understanding the user’s intent, tracking the dialog state, and deciding the next response based on the dialog policy. At each turn, the agent receives a natural language utterance as input and selects an action as output. The action space may consist of a set of questions for clarifying any ambiguity detected in the user utterance or requesting any missing information and an action for providing an answer to a question. This typically reflects a task-oriented dialog system, where the dialog is conducted to assist users in obtaining a precise answer to a question by enabling them to convey their exact information need through follow-up questions (Gao, Galley, and Li 2019) effectively. For example, KB-InfoBot (Dhingra et al. 2017) is an end-to-end trainable task-oriented dialog agent for querying a movie knowledge base using natural language. It can ask users easy-to-answer questions to help them search in the knowledge base. An example of an interaction with this bot is shown in Figure 13.

Details are in the caption following the image
FIGURE 12Open in figure viewerPowerPointThe typical architecture of a mixed-initiative conversational question answering agent.
Details are in the caption following the image
FIGURE 13Open in figure viewerPowerPointAn example interaction between a user looking for a movie and the KB-InfoBot. Figure credit: Dhingra et al. (2017).

Other strategies have also been developed in the recent past to embed capabilities in QA systems to ask follow-up questions. For example, Wu, Li, and Lee (2015) proposed an Entropy Minimization Dialog Management (EMDM) strategy to facilitate querying a knowledge base using natural language. Table 7 shows an example of an interaction between a user looking for a song and an agent that uses the EMDM strategy to answer. The agent always requests for the attribute’s value with maximum entropy over the remaining entries in the knowledge base to narrow down the search space. Such techniques can be readily utilized by CHeQA systems to make the QA experience more robust and engaging.TABLE 7. An example interaction between a user looking for a song and an agent that uses Entropy Minimization Dialog Management (EMDM) strategy (Wu, Li, and Lee 2015).

System:What can I do for you?
User:I would like a song by Maggie Chiang
System:What kind of song?
User:A popular one
System:Who is the lyricist of that song?
User:Peggy Hsu
System:On what album does that song appear?
User:It is on “My Room”

Empathetic responses

A recent trend towards making domain-specific conversational agents more engaging is embedding more human-like properties such as personality traits and emotional intelligence. Recent work focuses more on such social qualities of domain-specific conversational agents over functional aspects (Xu et al. 2017; Liao et al. 2016; Yan et al. 2017). For example, a study by Yu et al. proposes a social media customer service agent capable of generating responses showing empathy to help users cope with emotional situations. A content analysis conducted on social media customer requests reveals that more than 40% of the requests are emotional, which forms the basis for the authors’ proposal. Table 8 shows examples of empathetic responses generated by this agent for emotional user inputs.TABLE 8. Examples of emotional responses generated by the social media customer service agent (Xu et al. 2017).

Customer:Your customer service is horrible! You don’t even know how to deliver packages!
Agent:I’m sorry to hear that. What was the delivery date provided in your order?
Customer:Can’t wait to travel with you next week for the 1st time of my life with you 😉
Agent:We can’t wait to see you! We’re excited to have you on board with us soon!

A study conducted to evaluate medical students’ emotional intelligence in India (Sundararajan and Gopichandran 2018) shows that positive emotions such as empathy, comfort, and rapport have a positive influence on the doctor–patient relationship. It has also been found that emotion-awareness increases user satisfaction and enhances the system–user interaction (McDuff and Czerwinski 2018). The ability to identify emotions and respond in an empathetic manner makes the agents more engaging and human-like. Also, to compensate for potential errors that can happen while answering consumer questions, a reply such as “I’m sorry” can make consumers feel less frustrated with the agent. Thus, embedding empathy into the design of CHeQA systems can be useful for the acceptance and success of these technologies.

Generating emotion-aware and empathetic responses has gained increasing attention in research. For example, Ghandeharioun et al. (2019) introduce EMMA, an EMotion-aware mHealth Agent, which provides emotionally appropriate micro-activities for mental wellness in an empathetic manner. When suggesting micro-interventions, they use a rule-based approach with scripted emotion enriched phrases that are appropriate for the user’s mood. The results of a 2-week long human-subject experiment with 39 participants show that EMMA is perceived as likable. Xie, Svikhnushina, and Pu (2020) describe an end-to-end Multi-turn Emotionally Engaging Dialog model (MEED), capable of recognizing emotions and generating emotionally appropriate responses. They use a GRU-based Seq2Seq dialog model consisting of a hierarchical mechanism to track the conversation history combined with an additional emotion RNN to process the emotional information in each history utterance. Huo et al. (2020) introduce TERG, a Topic-aware Emotional Response Generator, which performs well in generating emotional responses relevant to the discussed topic. They utilize an encoder–decoder model having two modules: one to control the emotion of the response; and the other to enhance topic relevance. TERG shows substantial improvements against several baseline methods in both automatic and manually evaluation. Such approaches can be adopted by both single-turn and conversational QA systems to deliver answers in an emotionally appropriate and empathetic manner. Automated techniques can be designed to separately identify emotional and informational requests made by consumers and route these requests to separate modules to be handled appropriately. However, the lack of large scale doctor–patient or consumer health-related emotion datasets limits using such approaches in the domain of consumer health. The development of emotion-labeled datasets on consumer health-related conversations will facilitate future research on embedding empathy in consumer health QA systems.

Controllability and interpretability

Neural network approaches are widely used for conversational QA over traditional rule-based methods due to the generalizability and adaptability of the former over the latter. But an inherent limitation is that the generated responses are unpredictable and cannot be controlled. Several neural response generation approaches attempt to gain control over the generated responses by conditioning them on manually specified dialog acts or emotion labels (Zhou et al. 2017; Zhou and Wang 2017; Hu et al. 2018; Song et al. 2019) or using loss functions based on heuristics such as minimizing or maximizing affective dissonance between prompts and responses (Asghar et al. 2018). These models claim to generate more appropriate responses than those generated from purely data-driven models. But the primary concern of these handcrafted rules is their practicality.

Xu, Wu, and Wu (2018) attempt to avoid the need to explicitly condition the response on a manually specified label by using a joint network of dialog act selection and response generation. They first select a dialog act from a policy network according to the dialog context and feed that into the generation network, which generates a response based on both dialog history and the input dialog act. It is thus possible to generate more controlled and interpretable responses without the need for manually crafted rules.

It is vital in the healthcare domain to generate explainable responses to avoid potential mishaps and risks of generating an inappropriate response. A survey by Tjoa and Guan (2019) defines explainability or interpretability as the ability to (1) explain the decisions made, (2) uncover patterns within the inner mechanism of an algorithm, (3) present the system with coherent models or mathematics. Unfortunately, the black-box nature of neural models is unresolved, and many machine decisions are poorly understood (Tjoa and Guan 2019). Both controllability and interpretability, when embedded into deep learning methods, would better establish accountability and increase CHeQA systems’ reliability.

Other concerns

Reliability versus abundance

Most of the CHeQA approaches surveyed in this article use the World Wide Web and community QA forums such as Yahoo! Answers, Answers.com, and Quora as knowledge sources to answer consumer health questions. The abundance of health-related articles and previously answered consumer health questions can be viewed as the closest reasons for their selection. However, the credibility of these sources is often less known. The question entailment approach (Abacha and Demner-Fushman 2019a), which has achieved the current best average score at the TREC LiveQA 2017 medical subtask, highlights that restricting the answer sources to only reliable collections (as they have done) improves the QA performance since such sources contain more relevant answers to the questions asked. Also, Abacha and Demner-Fushman (2019b) in their work studying the factors behind the complexity of consumer health questions, provide empirical evidence supporting the role of reliable information sources in building efficient CHeQA systems, even though it contradicts the widespread tendency of relying on big data for health-related QA. Hence, restricting the sources to reliable knowledge bases such as MEDLINE records (www.nlm.nih.gov/bsd/medline.html) and credible websites such as NIH websites, WebMD (www.webmd.com), MayoClinic (www.mayoclinic.org), and WHO (www.who.int) is recommended to extract answers and prepare Q&A pairs for training and testing.

Reliability versus naturalness

enquireMe (Wong, Thangarajah, and Padgham 2012) claims that the colloquial and nontechnical nature of community Q&A pairs allows the system to generate more natural responses compared to extracting sentences or paragraphs from other forms of web content. However, as discussed in the previous subsection, this compromises the reliability of the answers provided, which is more crucial since, unlike medical professionals, consumers do not have the ability to validate the information they receive. Hence, attention should be given to generating natural responses from credible information sources. Translating medical terms into their corresponding Consumer Health Vocabulary (CHV) terms is one approach that can be looked at. Qenam et al. (2017) discuss ways of simplifying text using CHV to generate patient-centered radiology reports. They use the MetaMap tool (metamap.nlm.nih.gov) to link terms in radiology reports to CHV concepts. Such techniques can be utilized to eliminate the limitations of using credible information sources due to medical language barriers.


In this article, we surveyed single-turn and multiturn conversational approaches developed in the recent past for answering consumer health questions. However, many challenges remain. We reviewed resources developed to address some of these challenges as well as evaluation methods and benchmarks for evaluating single-turn and multiturn CHeQA systems. We discussed that less generalizability, lack of empathy, mixed-initiative interaction, and less interpretability are some of the limitations of existing CHeQA systems and how addressing these limitations can benefit people in conveying their information needs more effectively, offering a more natural and trust-inspiring interaction experience.


The authors declare that there is no conflict.



  • Anuradha Welivita is a postdoctoral researcher working in the Human-Computer Interaction (HCI) Group in the School of Computer and Communication Sciences at the Swiss Federal Institute of Technology in Lausanne (EPFL). She obtained her Ph.D. in Computer Science from the same university. Her research interests lie mainly in the area of developing empathetic conversational agents to support people in open-domain as well as therapeutic settings. She is also interested in developing and analyzing language resources, mainly human discourse. She also has expertise in designing and conducting large-scale human computation experiments.
  • Pearl Pu currently leads the Human–Computer Interaction (HCI) Group in the School of Computer and Communication Sciences at the Swiss Federal Institute of Technology in Lausanne (EPFL). Her research interests include human–computer interaction, recommender technology, language models for empathetic dialog generation, and AI and ethics. She is a member of the steering committee of the ACM International Conference on Recommender Systems, a distinguished speaker for ACM, and served on the editorial boards of several highly recognized scientific journals. She is a recipient of 14 Research Awards from the Swiss National Science Foundation, three Technology Innovation Awards from the Swiss Government, and a Research Career Award from US National Science Foundation. She also co-founded three startup companies, for which she received the 2008 Rising Star Award from Sina.com and the 2014 Worldwide Innovation Challenge Award from the French President. She was made a fellow of EurAI (European Association for Artificial Intelligence) in 2021.