ldig (Language Detection with Infinity Gram)

This is a prototype of language detection for short message service (twitter) with 99.1% accuracy for 17 languages.

ldig can also be used with success on longer documents. This version was initiated in the context of research conducted at the University of Corsica, for the automatic processing of less resourced languages, and in particular of Corsican. The results recorded in Kevers (2022) showed an average accuracy between 99.10% and 99.71% for 18 languages (17 official EU languages + Corsican).

ldig-python3 fork

The motivations for this fork are :

adaptation of ldig to python3
add alphabets (greek an cyrilic) not supported in the original version.

Usage

You can use ldig with the provided model or you can retrain a new model from your own data.

Standard use with provided models :

Extract model directory
```
 tar xf models/[select model archive]
```

Detect

 ldig.py -m [model directory] [text data file]

Train new models

Compile maxsubst executable (if not already done)

 cd maxsubst
 g++ -Icybozulib/include maxsubst.cpp -o maxsubst

Prepare your data

Learning data must be placed in a file formated as follow :
```
 CorrectLabel [TAB] Metadata [TAB] Text.
```

Initialisation

 python3 ldig.py -m [ModelDir] -x [MaxSubStBin] --init [LearnCorpusFile]

Several options are available :

 --ff=[LowerLimitOfFrequency] : threshold of feature frequency
 -n [NgramUpperBound] : n-gram upper bound

Learning

 python3 ldig.py -m [ModelDir] --learn [TxtCorpusFile] -e [LearningRate]

Several options are available :

 -r [RegularizationConstant] : regularization constant
 --wr [NumWholeRegularizations] : number of whole regularizations

Optimisation (optional)

 python3 ldig.py -m [ModeliDir] --shrink

Data format

As input data, Each "document" (tweet or other) is one line in text file as the below format.

[label]\t[some metadata separated '\t']\t[text without '\t']

[label] is a language name like en, de, fr and so on. Metadata is optional, but the tabulation symbol has to be present. (ldig doesn't use metadata and label for detection, of course :D)

The output data of lidg is as the below.

[correct label]\t[detected label]\t[original metadata and text]

Estimation Tool

ldig has a estimation tool.

./server.py -m [model directory]

Open http://localhost:48000 and input target text into textarea. Then ldig outputs language probabilities and feature parameters in the text.

Supported Languages (with provided models)

cs Czech
da Dannish
de German
en English
es Spanish
fi Finnish
fr French
id Indonesian
it Italian
nl Dutch
no Norwegian
pl Polish
pt Portuguese
ro Romanian
sv Swedish
tr Turkish
vi Vietnamese

Supported Languages (with Laurent Kevers models, data available at /~https://github.com/lkevers/ldig-models-TAL62-3)

The models have to be generated from the data following the documented procedure.

These models are designed to support 17 official languages of the European Union, plus Corsican.

bg / bul - Bulgarian
co / cos - Corsican
cs / ces - Czech
da / dan - Danish
de / deu - German
el / ell - Greek
en / eng - English
fi / fin - Finnish
fr / fra - French
hu / hun - Hungarian
it / ita - Italian
lt / lit - Lithuanian
nl / nld - Dutch
pl / pol - Polish
pt / por - Portuguese
ro / ron - Romanian
sp / spa - Spanish
sv / swe - Swedish

Documents

Presentation in English
Presentation in Japanese
Blog Articles about ldig
Laurent Kevers publications using ldig-python3 :
- KEVERS, L. (2022). L’identification de langue, un outil au service du corse et de l’évaluation des ressources linguistiques. Traitement Automatique des Langues, 62(3). Numéro spécial " Diversité linguistique".https://www.atala.org/content/tal_62_3_-0 .
- KEVERS, L. & RETALI -MEDORI , S. (2020). Towards a Corsican Basic Language Resource Kit. In Proceedings of the 12th Language Resources and Evaluation Conference (p. 2726- 2735). Marseille, France : European Language Resources Association. https://www.aclweb.org/anthology/2020.lrec-1.332
For more information about NLP and Corsican : https://bdlc.univ-corse.fr/tal/

Copyright & License

(c) 2021 Laurent Kevers / University of Corsica (changes made for ldig-python3)
All codes and resources are available under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
maxsubst		maxsubst
models		models
static		static
LICENSE		LICENSE
da.py		da.py
ldig.py		ldig.py
readme.md		readme.md
server.py		server.py
test_da.py		test_da.py
testcase.py		testcase.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ldig (Language Detection with Infinity Gram)

ldig-python3 fork

Usage

Data format

Estimation Tool

Supported Languages (with provided models)

Supported Languages (with Laurent Kevers models, data available at /~https://github.com/lkevers/ldig-models-TAL62-3)

Documents

Copyright & License

About

Releases

Packages

Languages

License

lkevers/ldig-python3

Folders and files

Latest commit

History

Repository files navigation

ldig (Language Detection with Infinity Gram)

ldig-python3 fork

Usage

Data format

Estimation Tool

Supported Languages (with provided models)

Supported Languages (with Laurent Kevers models, data available at /~https://github.com/lkevers/ldig-models-TAL62-3)

Documents

Copyright & License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages