Bengali Natural Language Processing(BNLP)

BNLP is a natural language processing toolkit for Bengali Language. This tool will help you to tokenize Bengali text, Embedding Bengali words, Embedding Bengali Document, Bengali POS Tagging, Bengali Name Entity Recognition, Bangla Text Cleaning for Bengali NLP purposes.

Features

Tokenization
Embeddings
Part of speech tagging
- CRF-based POS tagging
Named Entity Recognition
- CRF-based NER
Text Cleaning
Corpus
- Letters, vowels, punctuations, stopwords
Command Line Interface (CLI)

Installation

PIP installer

pip install bnlp_toolkit

or Upgrade

pip install -U bnlp_toolkit

Python: 3.8, 3.9, 3.10, 3.11
OS: Linux, Windows, Mac

Build from source

git clone https://github.com/sagorbrur/bnlp.git
cd bnlp
python setup.py install

Sample Usage

from bnlp import BasicTokenizer

tokenizer = BasicTokenizer()

raw_text = "আমি বাংলায় গান গাই।"
tokens = tokenizer(raw_text)
print(tokens)
# output: ["আমি", "বাংলায়", "গান", "গাই", "।"]

Command Line Interface

BNLP provides a command-line interface for quick text processing without writing Python code.

Basic Usage

# Tokenize text
bnlp tokenize "আমি বাংলায় গান গাই।"
# Output: ['আমি', 'বাংলায়', 'গান', 'গাই', '।']

# Named Entity Recognition
bnlp ner "সজীব ওয়াজেদ জয় ঢাকায় থাকেন।"

# Part-of-Speech Tagging
bnlp pos "আমি ভাত খাই।"

# Get word embeddings (similar words)
bnlp embedding "বাংলা" --similar

# Clean text
bnlp clean "[email protected] আমি বাংলায়" --remove-email

# Download models
bnlp download all          # Download all models
bnlp download word2vec     # Download specific model

# List available models
bnlp list-models

# Access corpus data
bnlp corpus stopwords
bnlp corpus letters

CLI Commands

Command	Description
`tokenize`	Tokenize Bengali text (supports: basic, nltk, sentencepiece)
`ner`	Named Entity Recognition
`pos`	Part-of-Speech tagging
`embedding`	Word embeddings (supports: word2vec, fasttext, glove)
`clean`	Text cleaning and normalization
`download`	Download pre-trained models
`list-models`	List all available models
`corpus`	Access Bengali corpus data (stopwords, letters, digits, etc.)

CLI Options

# Get help
bnlp --help
bnlp tokenize --help

# Output as JSON
bnlp tokenize "আমি বাংলায় গান গাই।" --json

# Use different tokenizer
bnlp tokenize "আমি বাংলায় গান গাই।" --type nltk

# Sentence tokenization
bnlp tokenize "আমি বাংলায় গান গাই। তুমি কি গাও?" --type nltk --sentence

# Get similar words with custom count
bnlp embedding "বাংলা" --similar --topn 5

Documentation

Full documentation are available here

If you are using previous version of bnlp check the documentation archive

Contributor Guide

Check CONTRIBUTING.md page for details.

Thanks To

Semantics Lab
All the developers who are contributing to enrich Bengali NLP.

Name		Name	Last commit message	Last commit date
Latest commit History 526 Commits
.github		.github
.vscode		.vscode
bnlp		bnlp
docs		docs
model		model
notebook		notebook
tests		tests
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
bnlp.svg		bnlp.svg
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Bengali Natural Language Processing(BNLP)

Features

Installation

PIP installer

Build from source

Sample Usage

Command Line Interface

Basic Usage

CLI Commands

CLI Options

Documentation

Contributor Guide

Thanks To

About

Uh oh!

Releases 20

Packages

Uh oh!

Contributors 9

Uh oh!

Languages

License

sagorbrur/bnlp

Folders and files

Latest commit

History

Repository files navigation

Bengali Natural Language Processing(BNLP)

Features

Installation

PIP installer

Build from source

Sample Usage

Command Line Interface

Basic Usage

CLI Commands

CLI Options

Documentation

Contributor Guide

Thanks To

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 20

Packages 0

Uh oh!

Contributors 9

Uh oh!

Languages

Packages