Releases: aviiciii/tamil-word-frequency
v2.2.0
Release Description - Version 2.2.0
We are pleased to announce another significant release of our language model, introducing an additional dataset to further enhance its capabilities. Here are the key details of this release:
-
Dataset Addition: In this release, we have integrated another valuable dataset sourced from TamilCorpus. This dataset brings a substantial increase in the volume of available text data, providing a more comprehensive and diverse collection of Tamil words.
-
Summary of Total Dataset: With the inclusion of the new dataset, our language model now encompasses an extensive vocabulary and offers enhanced language processing capabilities. Here is a summary of the total dataset included in this release:
- Total Count of Words: 4,591,656
- Words with Frequency > 5: 699,092
- Words with Frequency > 100: 88,350
- Words with Frequency > 1,000: 14,743
These statistics reflect the breadth and depth of the dataset, enabling the language model to understand and generate more accurate responses across a wide range of topics and contexts.
v2.1.0
Release Description - Version 2.1.0
We are excited to announce a new release of our language model with several significant improvements and additions. Here are the key changes in this release:
-
Integration of a Large Dataset: To enhance the accuracy and coverage of our language model, we have incorporated an additional large dataset. This dataset provides a more comprehensive collection of words and their frequencies, allowing the model to offer improved performance and results.
-
Dataset Management with CSV: We have transitioned to using CSV files for maintaining and managing the dataset. This change provides a structured and efficient approach to handle the large volume of word-frequency data. The CSV format ensures easy readability and compatibility with various data analysis tools.
-
Expanded Word Count: With the inclusion of the new dataset, the total count of words in our language model has significantly increased. The model now encompasses a comprehensive vocabulary of 3,080,012 words, enabling it to understand and generate more precise responses across a wide range of topics.
-
Filtering Words by Frequency: We have implemented a frequency-based filtering mechanism to identify and focus on words that occur with greater significance. In this release, we provide statistics on word frequency thresholds:
- Words with Frequency > 5: 464,197
- Words with Frequency > 100: 55,496
- Words with Frequency > 1,000: 8,724
- These statistics can be useful for various applications, such as text analysis, language processing, and statistical modeling.
- Directory Refactoring: As part of continuous improvement and organization, we have refactored the directory structure of our project. This restructuring enhances the overall maintainability and readability of the codebase, enabling easier navigation and future development.
v1.1.0
Release Description: Word Filtering and Frequency Processing
This release introduces a new feature that allows for efficient filtering and processing of words and their frequencies in a CSV file. The focus is on filtering non-Tamil words and removing trailing punctuation from a large dataset.
Key Features:
Word Filtering: The system filters out approximately 10,000 non-Tamil words from the input CSV file. By leveraging language-specific characteristics, the algorithm accurately identifies and removes words that do not belong to the Tamil language.
Trailing Punctuation Removal: Around 25,000 words in the dataset have trailing punctuation marks, which can impact subsequent analysis and natural language processing tasks. The system removes these trailing punctuations, ensuring cleaner and more meaningful word representations.
Output Format: The processed words and their corresponding frequencies are saved in a CSV file, enhancing compatibility and ease of use for further analysis and integration into downstream applications.
Dataset:
No changes same as https://github.com/aviiciii/tamil-word-frequency/releases/tag/v1.0.0
v1.0.0
First release contains of top words in Tamil based on frequency:
- 100
- 250
- 1000
- 5000
- 10,000
Data set used in this release are: