nTamil - Tamil Corpus

This nTamil project aims to create a comprehensive and high-quality collection of Tamil text data for natural language processing (NLP) especially for LLMs and linguistic research.

List of works added to this Repo

Tamil Wikipedia articles upto 01.06.2024 (CC BY-SA 4.0)
Charles University English-Tamil Parallel Corpus (CC BY-NC-SA 3.0)
Oscar 23.01 Tamil Meta Data (CC BY 4.0)
Project Madurai (Open to use and Distribute)
Tamil Wikisource books (CC BY-SA 4.0)
Tamil Mann Nationalized Books (CC BY-SA 4.0)
Leipzig Corpus
CC-100 Corpus
Ai4Bharat ( CC- 0)
Alpca-ora Translated for Tamil (GPL-3.0)

Name		Name	Last commit message	Last commit date
Latest commit History 52 Commits
Ai4Bharat		Ai4Bharat
Articles		Articles
Books		Books
News		News
Other Datasets		Other Datasets
Project Madurai		Project Madurai
Web Crawls		Web Crawls
LICENSE		LICENSE
Links to 2218 Books in TVA		Links to 2218 Books in TVA
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

nTamil - Tamil Corpus

List of works added to this Repo

About

Releases

Packages

Languages

License

velkadamban/Tamil-Corpus

Folders and files

Latest commit

History

Repository files navigation

nTamil - Tamil Corpus

List of works added to this Repo

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages