Prasanna 2020 – When BERT Plays The Lottery, All Tickets Are Winning
Notes for prasanna-etal-2020-bert
1 pruning BERT
- BERT can be pruned, removing either:
- the weights with smallest magnitude (call this magnitude pruning)
- the self attention heads of least importance (call this structural pruning)
2 methods
- use BERT embeddings for downstream tasks, for which a top layer network is fine-tuned
- iteratively prune, while checking to see if performance is maintained
- investigate whether the pruned heads/weights are invariant across:
- random initializations of the task-specific top layer
- different tasks
3 Key takeaways from this paper:
- the heads which survive structural pruning do not seem to encode much linguistic/structural information
- all the heads which can be pruned contribute about as much as those which were not pruned. This indicates that perhaps they did redundant work
- points away from a theory that BERT is composed of modules which do individual jobs. points towards a theory that language processing is distributed across BERT in many heads
Bibliography
- [prasanna-etal-2020-bert] "Prasanna, , Rogers, & Rumshisky, When BERT Plays the Lottery, All Tickets Are Winning, 3208-3229, in in: "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)", edited by Association for Computational Linguistics (2020)