This nTamil project aims to create a comprehensive and high-quality collection of Tamil text data for natural language processing (NLP) especially for LLMs and linguistic research.
- Tamil Wikipedia articles upto 01.06.2024 (CC BY-SA 4.0)
- Charles University English-Tamil Parallel Corpus (CC BY-NC-SA 3.0)
- Oscar 23.01 Tamil Meta Data (CC BY 4.0)
- Project Madurai (Open to use and Distribute)
- Tamil Wikisource books (CC BY-SA 4.0)
- Tamil Mann Nationalized Books (CC BY-SA 4.0)
- Leipzig Corpus
- CC-100 Corpus
- Ai4Bharat ( CC- 0)
- Alpca-ora Translated for Tamil (GPL-3.0)