Low-Resource Languages and the Limits of Scale
Large models improve when they see more language, but many communities are not represented by the data patterns that make scale appear universal.
Low-resource language modeling is often described as a data scarcity problem. That description is incomplete. The issue is not only that some languages have fewer digitized documents. It is also that available text may be domain-skewed, externally curated, outdated, politically sensitive, or disconnected from the way people speak and sign in daily life.
Scale can amplify this asymmetry. Languages with abundant web text receive better tokenization, broader pretraining, more evaluation sets, and more commercial feedback. Languages with limited data are often folded into multilingual systems whose aggregate performance hides local failure. The result is a model that appears multilingual while remaining unreliable for the people most dependent on it.
Sign languages make the limits clearer. They are not visual encodings of nearby spoken languages. They have their own grammar, regional variation, and community norms. Modeling them requires attention to video, motion, facial expression, spatial reference, annotation labor, and consent. Treating sign-language modeling as ordinary translation with a different input channel misses the research problem.
Community governance is central. Small language communities may face extractive data collection, misrepresentation, or the release of tools that standardize one dialect at the expense of others. Research practice must ask who controls datasets, who benefits from model release, and what harms may follow from making language data searchable or inferable at scale.
The technical agenda includes better low-resource adaptation, data-efficient evaluation, speech and sign representation, and uncertainty reporting when systems are operating outside reliable coverage. It also includes refusal behavior, because a model should be able to state when it lacks adequate evidence for a language or communicative context.
Accessible intelligence cannot be achieved by assuming that scale will eventually absorb every community. It requires methods designed for uneven data, accountable collection, and respectful collaboration with the people whose languages are at stake.
Contact the research office
For collaboration proposals or questions about low-resource and sign-language modeling, contact the institute.
research@compaccess.edu.kg