Link

Navigation Structure

Table of contents

  1. ParsBERT: Transformer-based Model for Persian Language Understanding
  2. Introduction
  3. Evaluation
  4. Results
    1. Sentiment Analysis (SA) task
    2. Text Classification (TC) task
    3. Named Entity Recognition (NER) task
  5. How to use
    1. TensorFlow 2.0
    2. Pytorch
  6. NLP Tasks Tutorial
  7. Cite
  8. Acknowledgments
  9. Contributors
  10. Releases
    1. Release v0.1 (May 27, 2019)

ParsBERT: Transformer-based Model for Persian Language Understanding

ParsBERT is a monolingual language model based on Google’s BERT architecture. This model is pre-trained on large Persian corpora with various writing styles from numerous subjects (e.g., scientific, novels, news) with more than 3.9M documents, 73M sentences, and 1.3B words.

Paper presenting ParsBERT: arXiv:2005.12515

All the models (downstream tasks) are uncased and trained with whole word masking. (coming soon stay tuned)


Introduction

ParsBERT trained on a massive amount of public corpora (Persian Wikidumps, MirasText) and six other manually crawled text data from a various type of websites (BigBang Page scientific, Chetor lifestyle, Eligasht itinerary, Digikala digital magazine, Ted Talks general conversational, Books novels, storybooks, short stories from old to the contemporary era).

As a part of ParsBERT methodology, an extensive pre-processing combining POS tagging and WordPiece segmentation was carried out to bring the corpora into a proper format.

ParsBERT Demo

ParsBERT Playground

Evaluation

ParsBERT is evaluated on three NLP downstream tasks: Sentiment Analysis (SA), Text Classification, and Named Entity Recognition (NER). For this matter and due to insufficient resources, two large datasets for SA and two for text classification were manually composed, which are available for public use and benchmarking. ParsBERT outperformed all other language models, including multilingual BERT and other hybrid deep learning models for all tasks, improving the state-of-the-art performance in Persian language modeling.

Results

The following table summarizes the F1 score obtained by ParsBERT as compared to other models and architectures.

Sentiment Analysis (SA) task

Dataset ParsBERT mBERT DeepSentiPers
Digikala User Comments 81.74* 80.74 -
SnappFood User Comments 88.12* 87.87 -
SentiPers (Multi Class) 71.11* - 69.33
SentiPers (Binary Class) 92.13* - 91.98

Text Classification (TC) task

Dataset ParsBERT mBERT
Digikala Magazine 93.59* 90.72
Persian News 97.19* 95.79

Named Entity Recognition (NER) task

Dataset ParsBERT mBERT MorphoBERT Beheshti-NER LSTM-CRF Rule-Based CRF BiLSTM-CRF
PEYMA 93.10* 86.64 - 90.59 - 84.00 -
ARMAN 98.79* 95.89 89.9 84.03 86.55 - 77.45

If you tested ParsBERT on a public dataset and you want to add your results to the table above, open a pull request or contact us. Also make sure to have your code available online so we can add it as a reference

How to use

TensorFlow 2.0

from transformers import AutoConfig, AutoTokenizer, TFAutoModel

config = AutoConfig.from_pretrained("HooshvareLab/bert-base-parsbert-uncased")
tokenizer = AutoTokenizer.from_pretrained("HooshvareLab/bert-base-parsbert-uncased")
model = AutoModel.from_pretrained("HooshvareLab/bert-base-parsbert-uncased")

text = "ما در هوشواره معتقدیم با انتقال صحیح دانش و آگاهی، همه افراد می‌توانند از ابزارهای هوشمند استفاده کنند. شعار ما هوش مصنوعی برای همه است."
tokenizer.tokenize(text)

>>> ['ما', 'در', 'هوش', '##واره', 'معتقدیم', 'با', 'انتقال', 'صحیح', 'دانش', 'و', 'اگاهی', '،', 'همه', 'افراد', 'میتوانند', 'از', 'ابزارهای', 'هوشمند', 'استفاده', 'کنند', '.', 'شعار', 'ما', 'هوش', 'مصنوعی', 'برای', 'همه', 'است', '.']

Pytorch

from transformers import AutoConfig, AutoTokenizer, AutoModel

config = AutoConfig.from_pretrained("HooshvareLab/bert-base-parsbert-uncased")
tokenizer = AutoTokenizer.from_pretrained("HooshvareLab/bert-base-parsbert-uncased")
model = AutoModel.from_pretrained("HooshvareLab/bert-base-parsbert-uncased")

NLP Tasks Tutorial

  • Named Entity Recognition
  • Sentiment (soon).
  • Text Classification (soon).
  • Topic modeling (soon).
  • Question Answering (soon).

Cite

Please cite the following paper in your publication if you are using ParsBERT in your research:

@article{ParsBERT,
    title={ParsBERT: Transformer-based Model for Persian Language Understanding},
    author={Mehrdad Farahani, Mohammad Gharachorloo, Marzieh Farahani, Mohammad Manthouri},
    journal={ArXiv},
    year={2020},
    volume={abs/2005.12515}
}

Acknowledgments

We hereby, express our gratitude to the Tensorflow Research Cloud (TFRC) program for providing us with the necessary computation resources. We also thank Hooshvare Research Group for facilitating dataset gathering and scraping online text resources.

Contributors

Releases

Release v0.1 (May 27, 2019)

This is the first version of our ParsBERT based on BERTBASE