Question Answering systems are usually referred to a computer program that are capable of answering questions posed by humans in a natural language. It may construct its answers by querying a structured repositories of knowledge or information, usually a knowledge base. More commonly, QA systems can pull answers from an unstructured collection of natural language documents which may include a group of reference documents, webpages or even set of Wikipedia pages.
Natural language processing methods are used to both process the question and indexed documents or the text corpus from which answers are extracted. Most of QA systems use the World Wide Web as their corpus of knowledge, however, these systems do not yield a human-like answer, but rather employ narrow approaches (keyword-based techniques, templates, etc.) to produce a list of document extracts encompassing the possible answer.
Generally QA systems included a question classifier unit that determines the type of question and answer. The system apply progressively complex NLP methods in order to steadily reduce the volume of text. Based on which document sets or extracts are generated that are likely to contain the answer, and then the reselect small text fragments that contain strings of the same type as the expected answer are filtered out. Scoring are provided to all candidate answer found.
Types of question encountered by a Question Answering systems:
Two main category of question are based on the choice of domain of the application for which the system is built, namely the
- Closed domain: question are specific to a particular domain, hence enabling exploitation of specific structure of the domain documents or creation of ontology. These makes them easier NLP task when compared to open domain questions, as they are predictable in a sense as the question are specific to a particular set.
- Open domain : They can merely about anything and rely on general world knowledge representation, one thing worth mentioning here the dimensionality of the data available is far larger then closed domain questions, making them more complex.
Another way to categories might be based on how to deduce the answers, certain question are based on facts, some of them are hypothetical in nature etc. Let try to understand more on them as they might help us understand a wider variety of questions encountered and also help us to think different ways of answering them too.
- Fact based or Factoid question: These are simple question based on facts, like when was a person born or when did we first landed on moon or who was the first person to step on moon etc. These are question basically answered in commercially application like Siri, Cortona etc.
- Narrative question: These question are part of narrative speech like what was the view of people on a book? There answers are often seem to be generalized on a larger audience like how does the people coupe up with a new law in place etc. Another way to think of is they don’t have a particular answer, and the answer may vary from person to person and over the period of time. I view them more as a statistical hypothesis, and answer them based on statistical learning method combined with facts deduced.
Two approaches to build Question Answering systems:
- Informational retrieval based approaches, simple one which we all familiar with, example google search find the answer from a large repository of organized (indexed)documents. Basically is a query based system which extract all the relevant keywords from query in order to identify the related indexed documents, in this process it enables exploration of relationship among the document and query. Finally system does the processing on the set of document generating candidate answer set with some specific ranking system to deduce the most efficient answer available to the system.
- Knowledge based approaches, where the answer is generated by the program by understanding the sentence or question. Ok that was vague, I here a semantic representation is created from the query based on entities present, quantitative data and specific information pertaining to space and time. These deduced information are then mapped to various structured resource or database which may include ontology knowledge base, scientific database, geospatial database etc
- Hybrid approaches : combination of both IR and Knowledge based approach and is the most common to be used now a days, there may be variety of techniques to implement various hybrid approaches but the core concept are based on the two approaches above. Different implementation example varies from Siri, Cortona, Google assistant, IBM Watson etc. Here in general IR system are used to generate the possible candidate answers and knowledge based approaches used to score the candidate answer based on semantic representation deduced.