PRALEKHA is a large-scale document-level benchmark for Cross-Lingual Document Alignment (CLDA) evaluation, comprising over 3 million aligned document pairs across 11 Indic languages and English, with 1.5 million being English-centric. We propose a comprehensive evaluation framework introducing Document Alignment Coefficient (DAC), a novel metric designed for fine-grained document alignment. Unlike existing approaches that use pooled document-level embeddings, DAC aligns smaller chunks within documents and computes similarity based on the ratio of aligned chunks. This method improves over baseline pooling approaches, especially in noisy scenarios, with 15–20% precision and 5–10% F1 score gains.
Follow these steps to set up the environment and get started with the pipeline:
Clone this repository to your local system:
git clone https://github.com/AI4Bharat/Pralekha.git
cd Pralekha
Create and activate a new Conda environment for this project:
conda create -n pralekha python=3.9 -y
conda activate pralekha
Install the required Python packages:
pip install -r requirements.txt
The pipeline expects a directory structure in the following format:
- A main directory containing language subdirectories named using their 3-letter ISO codes (e.g.,
eng
for English,hin
for Hindi,tam
for Tamil, etc.) - Each language subdirectory will contain
.txt
documents named in the format{doc_id}.txt
, wheredoc_id
serves as the unique identifier for each document.
Below is an example of the expected directory structure:
data/
├── eng/
│ ├── tech-innovations-2023.txt
│ ├── sports-highlights-day5.txt
│ ├── press-release-456.txt
│ ├── ...
├── hin/
│ ├── daily-briefing-april.txt
│ ├── market-trends-yearend.txt
│ ├── इंडिया-न्यूज़123.txt
│ ├── ...
├── tam/
│ ├── kollywood-review-movie5.txt
│ ├── 2023-pilgrimage-guide.txt
│ ├── கடலோர-மாநில-செய்தி.txt
│ ├── ...
...
To process documents into granular shards, use the doc2granular-shards.sh
script.
This script splits documents into chunks of varying granularities:
- G = 1 → Sentence-level
- G = 2, 4, 8 → Chunk-level (2, 4, or 8 sentences per chunk)
Run the script:
bash doc2granular-shards.sh
Generate embeddings for your dataset using one of the two supported models: LaBSE or SONAR.
bash create_embeddings.sh
Choose the desired model by editing the script as needed. Both models can be run sequentially or independently by enabling/disabling the respective sections.
The final step is to execute the pipeline based on your chosen method:
For baseline
approaches:
bash run_baseline_pipeline.sh
For the proposed DAC
approach:
bash run_dac_pipeline.sh
Each pipeline comes with a variety of configurable parameters, allowing you to tailor the process to your specific requirements. Please review and edit the scripts as needed before running to ensure they align with your desired configurations.
This dataset is released under the CC BY 4.0 license.
If you used this repository or our models, please cite our work:
@article{suryanarayanan2024pralekha,
title={Pralekha: An Indic Document Alignment Evaluation Benchmark},
author={Suryanarayanan, Sanjay and Song, Haiyue and Khan, Mohammed Safi Ur Rahman and Kunchukuttan, Anoop and Khapra, Mitesh M and Dabre, Raj},
journal={arXiv preprint arXiv:2411.19096},
year={2024}
}
For any questions or feedback, please contact:
- Raj Dabre (raj.dabre@cse.iitm.ac.in)
- Sanjay Suryanarayanan (sanj.ai@outlook.com)
- Haiyue Song (haiyue.song@nict.go.jp)
- Mohammed Safi Ur Rahman Khan (safikhan2000@gmail.com)
Please get in touch with us for any copyright concerns.