Data Skeptic

Very Large Corpora and Zipf's Law

Author: Vários
Narrator: Vários
Publisher: Podcast
Duration: 0:24:11
More information

Add to list

Listen

preview

Listen

Synopsis

The earliest efforts to apply machine learning to natural language tended to convert every token (every word, more or less) into a unique feature. While techniques like stemming may have cut the number of unique tokens down, researchers always had to face a problem that was highly dimensional. Naive Bayes algorithm was celebrated in NLP applications because of its ability to efficiently process highly dimensional data. Of course, other algorithms were applied to natural language tasks as well. While different algorithms had different strengths and weaknesses to different NLP problems, an early paper titled Scaling to Very Very Large Corpora for Natural Language Disambiguation popularized one somewhat surprising idea. For many NLP tasks, simply providing a large corpus of examples not only improved accuracy, but it also showed that asymptotically, some algorithms yielded more improvement from working on very, very large corpora. Although not explicitly in about NLP, the noteworthy paper The Unreasonable Effect

Data Skeptic

Very Large Corpora and Zipf's Law

Synopsis

Join Now

Need help

Install our app:

Data Skeptic

Very Large Corpora and Zipf's Law

Informações:

Synopsis

Join Now

Need help

Install our app: