BNLP is a natural language processing toolkit for Bengali Language. This tool will help you to tokenize Bengali text, Embedding Bengali words, Embedding Bengali Document, Bengali POS Tagging, Bengali Name Entity Recognition, Bangla Text Cleaning for Bengali NLP purposes.
- Tokenization
- Embeddings
- Part of speech tagging
- Named Entity Recognition
- Text Cleaning
- Corpus
- Letters, vowels, punctuations, stopwords
- Command Line Interface (CLI)
pip install bnlp_toolkit
or Upgrade
pip install -U bnlp_toolkit
- Python: 3.8, 3.9, 3.10, 3.11
- OS: Linux, Windows, Mac
git clone https://github.com/sagorbrur/bnlp.git
cd bnlp
python setup.py install
from bnlp import BasicTokenizer
tokenizer = BasicTokenizer()
raw_text = "আমি বাংলায় গান গাই।"
tokens = tokenizer(raw_text)
print(tokens)
# output: ["আমি", "বাংলায়", "গান", "গাই", "।"]BNLP provides a command-line interface for quick text processing without writing Python code.
# Tokenize text
bnlp tokenize "আমি বাংলায় গান গাই।"
# Output: ['আমি', 'বাংলায়', 'গান', 'গাই', '।']
# Named Entity Recognition
bnlp ner "সজীব ওয়াজেদ জয় ঢাকায় থাকেন।"
# Part-of-Speech Tagging
bnlp pos "আমি ভাত খাই।"
# Get word embeddings (similar words)
bnlp embedding "বাংলা" --similar
# Clean text
bnlp clean "[email protected] আমি বাংলায়" --remove-email
# Download models
bnlp download all # Download all models
bnlp download word2vec # Download specific model
# List available models
bnlp list-models
# Access corpus data
bnlp corpus stopwords
bnlp corpus letters| Command | Description |
|---|---|
tokenize |
Tokenize Bengali text (supports: basic, nltk, sentencepiece) |
ner |
Named Entity Recognition |
pos |
Part-of-Speech tagging |
embedding |
Word embeddings (supports: word2vec, fasttext, glove) |
clean |
Text cleaning and normalization |
download |
Download pre-trained models |
list-models |
List all available models |
corpus |
Access Bengali corpus data (stopwords, letters, digits, etc.) |
# Get help
bnlp --help
bnlp tokenize --help
# Output as JSON
bnlp tokenize "আমি বাংলায় গান গাই।" --json
# Use different tokenizer
bnlp tokenize "আমি বাংলায় গান গাই।" --type nltk
# Sentence tokenization
bnlp tokenize "আমি বাংলায় গান গাই। তুমি কি গাও?" --type nltk --sentence
# Get similar words with custom count
bnlp embedding "বাংলা" --similar --topn 5Full documentation are available here
If you are using previous version of bnlp check the documentation archive
Check CONTRIBUTING.md page for details.
- Semantics Lab
- All the developers who are contributing to enrich Bengali NLP.