This is a prototype of language detection for short message service (twitter) with 99.1% accuracy for 17 languages.
ldig can also be used with success on longer documents. This version was initiated in the context of research conducted at the University of Corsica, for the automatic processing of less resourced languages, and in particular of Corsican. The results recorded in Kevers (2022) showed an average accuracy between 99.10% and 99.71% for 18 languages (17 official EU languages + Corsican).
The motivations for this fork are :
- adaptation of ldig to python3
- add alphabets (greek an cyrilic) not supported in the original version.
You can use ldig with the provided model or you can retrain a new model from your own data.
Standard use with provided models :
-
Extract model directory
tar xf models/[select model archive]
-
Detect
ldig.py -m [model directory] [text data file]
Train new models
-
Compile maxsubst executable (if not already done)
cd maxsubst g++ -Icybozulib/include maxsubst.cpp -o maxsubst
-
Prepare your data
Learning data must be placed in a file formated as follow :
CorrectLabel [TAB] Metadata [TAB] Text.
-
Initialisation
python3 ldig.py -m [ModelDir] -x [MaxSubStBin] --init [LearnCorpusFile]
Several options are available :
--ff=[LowerLimitOfFrequency] : threshold of feature frequency -n [NgramUpperBound] : n-gram upper bound
-
Learning
python3 ldig.py -m [ModelDir] --learn [TxtCorpusFile] -e [LearningRate]
Several options are available :
-r [RegularizationConstant] : regularization constant --wr [NumWholeRegularizations] : number of whole regularizations
-
Optimisation (optional)
python3 ldig.py -m [ModeliDir] --shrink
As input data, Each "document" (tweet or other) is one line in text file as the below format.
[label]\t[some metadata separated '\t']\t[text without '\t']
[label] is a language name like en, de, fr and so on. Metadata is optional, but the tabulation symbol has to be present. (ldig doesn't use metadata and label for detection, of course :D)
The output data of lidg is as the below.
[correct label]\t[detected label]\t[original metadata and text]
ldig has a estimation tool.
./server.py -m [model directory]
Open http://localhost:48000 and input target text into textarea. Then ldig outputs language probabilities and feature parameters in the text.
- cs Czech
- da Dannish
- de German
- en English
- es Spanish
- fi Finnish
- fr French
- id Indonesian
- it Italian
- nl Dutch
- no Norwegian
- pl Polish
- pt Portuguese
- ro Romanian
- sv Swedish
- tr Turkish
- vi Vietnamese
Supported Languages (with Laurent Kevers models, data available at /~https://github.com/lkevers/ldig-models-TAL62-3)
The models have to be generated from the data following the documented procedure.
These models are designed to support 17 official languages of the European Union, plus Corsican.
- bg / bul - Bulgarian
- co / cos - Corsican
- cs / ces - Czech
- da / dan - Danish
- de / deu - German
- el / ell - Greek
- en / eng - English
- fi / fin - Finnish
- fr / fra - French
- hu / hun - Hungarian
- it / ita - Italian
- lt / lit - Lithuanian
- nl / nld - Dutch
- pl / pol - Polish
- pt / por - Portuguese
- ro / ron - Romanian
- sp / spa - Spanish
- sv / swe - Swedish
-
Blog Articles about ldig
-
Laurent Kevers publications using ldig-python3 :
- KEVERS, L. (2022). L’identification de langue, un outil au service du corse et de l’évaluation des ressources linguistiques. Traitement Automatique des Langues, 62(3). Numéro spécial " Diversité linguistique".https://www.atala.org/content/tal_62_3_-0 .
- KEVERS, L. & RETALI -MEDORI , S. (2020). Towards a Corsican Basic Language Resource Kit. In Proceedings of the 12th Language Resources and Evaluation Conference (p. 2726- 2735). Marseille, France : European Language Resources Association. https://www.aclweb.org/anthology/2020.lrec-1.332
-
For more information about NLP and Corsican : https://bdlc.univ-corse.fr/tal/
- (c) 2011-2012 Nakatani Shuyo / Cybozu Labs Inc. All rights reserved.
- (c) 2021 Laurent Kevers / University of Corsica (changes made for ldig-python3)
- All codes and resources are available under the MIT License.