Skip to content

Dense Passage Retrieval (Context Encoder) for Bahasa Indonesia. It can be used for Semantic Similarity Search task.

Notifications You must be signed in to change notification settings

firqaaa/DPR-bahasa

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 

Repository files navigation

pipeline_tag tags license language metrics datasets
feature-extraction
feature-extraction
transformers
apache-2.0
id
accuracy
f1
precision
recall
squad_v2

indo-dpr-ctx_encoder-single-squad-base

Indonesian Dense Passage Retrieval trained on translated SQuADv2.0 dataset in DPR format.

Evaluation

Class Precision Recall F1-Score Support
hard_negative 0.9963 0.9963 0.9963 183090
positive 0.8849 0.8849 0.8849 5910
Metric Value
Accuracy 0.9928
Macro Average 0.9406
Weighted Average 0.9928

Note: This report is for evaluation on the dev set, after 12000 batches.

Usage

from transformers import DPRContextEncoder, DPRContextEncoderTokenizer

tokenizer = DPRContextEncoderTokenizer.from_pretrained('firqaaa/indo-dpr-ctx_encoder-single-squad-base')
model = DPRContextEncoder.from_pretrained('firqaaa/indo-dpr-ctx_encoder-single-squad-base')
input_ids = tokenizer("Ibukota Indonesia terletak dimana?", return_tensors='pt')["input_ids"]
embeddings = model(input_ids).pooler_output

You can use it using haystack as follows:

from haystack.nodes import DensePassageRetriever
from haystack.document_stores import InMemoryDocumentStore

retriever = DensePassageRetriever(document_store=InMemoryDocumentStore(),
                                  query_embedding_model="firqaaa/indo-dpr-ctx_encoder-single-squad-base",
                                  passage_embedding_model="firqaaa/indo-dpr-ctx_encoder-single-squad-base",
                                  max_seq_len_query=64,
                                  max_seq_len_passage=256,
                                  batch_size=16,
                                  use_gpu=True,
                                  embed_title=True,
                                  use_fast_tokenizers=True)

About

Dense Passage Retrieval (Context Encoder) for Bahasa Indonesia. It can be used for Semantic Similarity Search task.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published