Why Frequency-Based Learning?

Learning a new language is most efficient when you focus on the words you will actually encounter. Linguistic research shows that a small number of words account for the vast majority of everyday text.

In Turkish, roughly 1,400 words cover about 80% of what you read in news and literature. Master these first, and you'll understand most of what you encounter.

Our Corpus

This dictionary is built from a 6 million word corpus of real Turkish text, combining:

This breadth ensures the vocabulary reflects general Turkish rather than a single editorial voice or time period.

How It Works

Corpus Sources → Lemmatizer → Frequency Analysis → Ranked Word List

Lemmatization. Turkish is an agglutinative language — a single root word can take dozens of suffixes. We use Zemberek-NLP, a morphological analyzer for Turkish, to reduce inflected forms back to their dictionary lemmas. This means "geldim", "geliyorum", and "gelecek" all count toward the root verb "gelmek" (to come).

Sentence-level disambiguation. Zemberek analyzes each sentence as a whole, correctly identifying ambiguous words from context — for example, recognizing "dolar" as the currency rather than a verb form.

Noise filtering. We remove newspaper names, boilerplate text, URL fragments, and other noise to keep the list focused on meaningful vocabulary.

Coming Soon: Premium Anki Deck

We're building a comprehensive Anki flashcard deck based on this frequency data:

Sign up on the homepage to get notified when it launches.

Methodology Notes

This is an active project. The frequency list is useful for guiding study priorities, and we're continuously improving the data:

Open Source

The full source code — including the scraper, analyzer, and web frontend — is available on GitHub.

Technology