Motivation

Learning a new language is most efficient when you focus on the words you will actually encounter. Research in linguistics shows that a relatively small number of words account for the vast majority of everyday text. In Turkish, roughly 1,000 words cover about 80% of what you read in the news.

This project was built to answer a simple question: which Turkish words should a learner study first? Instead of relying on textbook vocabulary lists, it analyzes real Turkish news articles to produce a frequency-ranked word list grounded in how the language is actually used today.

How It Works

The dictionary is generated by a two-stage pipeline that scrapes, processes, and serves Turkish word frequency data.

RSS Feeds → Scraper → Raw Articles
Raw Articles → Lemmatizer → Normalized Words
Normalized Words → Analyzer → Frequency Data → Web Interface

1. Scraping. A Python scraper fetches articles from Turkish news RSS feeds (such as TRT Haber), extracts the article text, and stores them locally.

2. Lemmatization. Turkish is an agglutinative language — a single root word can take dozens of suffixes. The analyzer uses Zeyrek, a morphological analyzer for Turkish, to reduce inflected forms back to their dictionary lemmas. A stemmer serves as a fallback for words Zeyrek does not recognize.

3. Frequency analysis. The pipeline counts occurrences of each lemma, ranks them by frequency, and calculates cumulative coverage percentages. It also filters noise words (connectors, particles) to keep the list focused on meaningful vocabulary.

The resulting frequency list is served as a static JSON file and rendered by a vanilla JavaScript frontend with real-time search filtering.

Technology

Source Code

The full source code — including the scraper, analyzer, and web frontend — is available on GitHub.