Dunietz et al 2020 - To Test Machine Comprehension, Start by Defining Comprehension

1 Motivation

Models based on BERT (devlin18_bert) are achieving and exceeding human performance on MRC datasets, yet machines still make baffling errors that humans do not make and do not generalize well.

2 Argument: existing machine reading comprehension datasets do not effectively capture comprehension

Machine reading comprehension (MRC) datasets contain natural language question-answer pairs in the context of a passage to be comprehended. In past approaches, the questions, answers, and passages have been:

handwritten by humans
- These include questions which require multi-hop reasoning
collected from the wild, e.g. Reddit
taken from tests for humans, e.g. trivia bowl questions
generated by machine

Main contention: performing well on difficult questions does not necessarily indicate good understanding. Many datasets do not cover questions which humans would find obvious: e.g. "what color is a banana?" A systems sophistication does not matter as much as its ability to comprehend passages in a certain context.

3 Approach: questions about stories

For their context, the authors propose looking at narrative stories. For this purpose, they define a template of understanding - a baseline set of concepts that a reading comprehension system should attend to:

temporal relationships
causal relationships
spatial
motivational

In this way, for narrative fiction, they present a systematized way to probe machine understanding of the world and the passage.

4 Useful Links

The field of natural language processing is chasing the wrong goal - opinion piece by the first author

Bibliography

[devlin18_bert] Devlin, Chang, Lee, & Toutanova, Bert: Pre-Training of Deep Bidirectional Transformers for Language Understanding, CoRR, (2018). link.