Events

Filtered Corpus Training (FiCT) Shows that Language Models can Generalize from Indirect Evidence

Shane Steinert-Threlkeld

Tuesday, February 18, 2025
11:45 a.m.–1:15 p.m.

Humanities Center Room D

A central focus of the cognitive sciences has been what features of the human conceptual system are built-in and which are (and, more generally, can be) learned from experience. In this talk, I will introduce a simple method for training language models called filtered corpus training and argue that it can help shed light on these debates when it comes to what forms of inductive bias are necessary for language learning. The method trains language models (LMs) on corpora with certain linguistic constructions entirely filtered out from the training data, and then measures the ability of LMs to perform linguistic generalization on the basis of indirect evidence. We apply the method to both LSTM and Transformer LMs (of roughly comparable size), developing filtered corpora that target a wide range of linguistic phenomena. Our results show that while transformers are better qua LMs (as measured by perplexity), both models perform equally and surprisingly well on linguistic generalization measures, suggesting that they are capable of generalizing from indirect evidence. A deeper dive on one phenomenon---negative polarity items---also shows that LMs learn to base their judgments in this domain on the semantic concept of monotonicity, in a way not dissimilar to how they are known to be processed in human language.

This event will take place in person and via zoom. If participating online, please register in advance:

https://rochester.zoom.us/meeting/register/tJcsd-iuqD8tG9SEvQTZ9m9LcikHj4bJCYHE