Bengal Language Model

Bengali language model is build with fastai's ULMFit and ready for prediction and classfication task.

Installation

pip install bnlm

Dependencies

use pytorch >=1.0.0 and <=1.3.0

Evaluation Result

Language Model

Accuracy 48.26% on validation dataset
Perplexity: ~22.79

Features and API

Download pretrained Model

To start, first download pretrained Language Model and Sentencepiece model

from bnlm.bnlm import download_models

download_models()

Predict N Words

predict_n_words take three parameter as input:

input_sen(Your incomplete input text)
N(Number of word for prediction)
model_path(Pretrained model path)

from bnlm.bnlm import BengaliTokenizer
from bnlm.bnlm import predict_n_words
model_path = 'model'
input_sen = "আমি বাজারে"
output = predict_n_words(input_sen, 3, model_path)
print("Word Prediction: ", output)

Get Sentence Encoding

from bnlm.bnlm import BengaliTokenizer
from bnlm.bnlm import get_sentence_encoding
model_path = 'model'
sp_model = "model/bn_spm.model"
input_sentence = "আমি ভাত খাই।"
encoding = get_sentence_encoding(input_sentence, model_path, sp_model)
print("sentence encoding is: ", encoding)

Get Embedding Vectors

from bnlm.bnlm import BengaliTokenizer
from bnlm.bnlm import get_embedding_vectors
model_path = 'model'
sp_model = "model/bn_spm.model"
input_sentence = "আমি ভাত খাই।"
embed = get_embedding_vectors(input_sentence, model_path, sp_model)
print("sentence embedding is : ", embed)

Sentence Similarity

from bnlm.bnlm import BengaliTokenizer
from bnlm.bnlm import get_sentence_similarity
model_path = 'model'
sp_model = "model/bn_spm.model"
sentence_1 = "সে খুব করে কথা বলে।"
sentence_2 = "তার কথা খুবেই মিষ্টি।"
sim = get_sentence_similarity(sentence_1, sentence_2, model_path, sp_model)
print("Similarity is: %0.2f"%sim)

# Output:  0.72

Get Simillar Sentences

get_similar_sentences take four parameter

input sentence
N(Number of sentence you want to predict)
model_path(Pretrained Model Path)
sp_model(pretrained sentencepiece model)

from bnlm.bnlm import BengaliTokenizer
from bnlm.bnlm import get_similar_sentences

model_path = 'model'
sp_model = "model/bn_spm.model"

input_sentence = "আমি বাংলায় গান গাই।"
sen_pred = get_similar_sentences(input_sentence, 3, model_path, sp_model)
print(sen_pred)
# output: ['আমি বাংলায় গান গাই ।', 'আমি ইংরেজিতে গান গাই।', 'আমি বাংলায় গানও গাই।']

Classification

upcomming

Training

To train with your own corpus follow this repository

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
bnlm		bnlm
docs		docs
.travis.yml		.travis.yml
LICENSE		LICENSE
README.md		README.md
setup.py		setup.py
tests.py		tests.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Bengal Language Model

Contents

Installation

Dependencies

Evaluation Result

Language Model

Features and API

Download pretrained Model

Predict N Words

Get Sentence Encoding

Get Embedding Vectors

Sentence Similarity

Get Simillar Sentences

Classification

Training

Contributor

About

Releases 1

Packages

Languages

License

sagorbrur/bnlm

Folders and files

Latest commit

History

Repository files navigation

Bengal Language Model

Contents

Installation

Dependencies

Evaluation Result

Language Model

Features and API

Download pretrained Model

Predict N Words

Get Sentence Encoding

Get Embedding Vectors

Sentence Similarity

Get Simillar Sentences

Classification

Training

Contributor

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages