Skip to content

This nTamil project aims to create a comprehensive and high-quality collection of Tamil text data for natural language processing (NLP) especially for LLMs and linguistic research.

License

Notifications You must be signed in to change notification settings

velkadamban/Tamil-Corpus

Repository files navigation

nTamil - Tamil Corpus

This nTamil project aims to create a comprehensive and high-quality collection of Tamil text data for natural language processing (NLP) especially for LLMs and linguistic research.

List of works added to this Repo

  • Tamil Wikipedia articles upto 01.06.2024 (CC BY-SA 4.0)
  • Charles University English-Tamil Parallel Corpus (CC BY-NC-SA 3.0)
  • Oscar 23.01 Tamil Meta Data (CC BY 4.0)
  • Project Madurai (Open to use and Distribute)
  • Tamil Wikisource books (CC BY-SA 4.0)
  • Tamil Mann Nationalized Books (CC BY-SA 4.0)
  • Leipzig Corpus
  • CC-100 Corpus
  • Ai4Bharat ( CC- 0)
  • Alpca-ora Translated for Tamil (GPL-3.0)

About

This nTamil project aims to create a comprehensive and high-quality collection of Tamil text data for natural language processing (NLP) especially for LLMs and linguistic research.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages