https://github.com/BM-K/Sentence-Embedding-is-all-you-need

Korean-Sentence-Embedding

Korean sentence embedding repository. You can download the pre-trained models and inference right away, also it provides environments where individuals can train models.

Quick tour

import torch from transformers import AutoModel, AutoTokenizer def cal_score(a, b): if len(a.shape) == 1: a = a.unsqueeze(0) if len(b.shape) == 1: b = b.unsqueeze(0) a_norm = a / a.norm(dim=1)[:, None] b_norm = b / b.norm(dim=1)[:, None] return torch.mm(a_norm, b_norm.transpose(0, 1)) * 100 model = AutoModel.from_pretrained('BM-K/KoSimCSE-roberta-multitask') AutoTokenizer.from_pretrained('BM-K/KoSimCSE-roberta-multitask') sentences = ['치타가 들판을 가로 질러 먹이를 쫓는다.', '치타 한 마리가 먹이 뒤에서 달리고 있다.', '원숭이 한 마리가 드럼을 연주한다.'] inputs = tokenizer(sentences, padding=True, truncation=True, return_tensors="pt") embeddings, _ = model(**inputs, return_dict=False) score01 = cal_score(embeddings[0][0], embeddings[1][0]) score02 = cal_score(embeddings[0][0], embeddings[2][0])

Performance

Semantic Textual Similarity test set results

Model	AVG	Cosine Pearson	Cosine Spearman	Euclidean Pearson	Euclidean Spearman	Manhattan Pearson	Manhattan Spearman	Dot Pearson	Dot Spearman
KoSBERT^†_SKT	77.40	78.81	78.47	77.68	77.78	77.71	77.83	75.75	75.22
KoSBERT	80.39	82.13	82.25	80.67	80.75	80.69	80.78	77.96	77.90
KoSRoBERTa	81.64	81.20	82.20	81.79	82.34	81.59	82.20	80.62	81.25
KoSentenceBART	77.14	79.71	78.74	78.42	78.02	78.40	78.00	74.24	72.15
KoSentenceT5	77.83	80.87	79.74	80.24	79.36	80.19	79.27	72.81	70.17
KoSimCSE-BERT^†_SKT	81.32	82.12	82.56	81.84	81.63	81.99	81.74	79.55	79.19
KoSimCSE-BERT	83.37	83.22	83.58	83.24	83.60	83.15	83.54	83.13	83.49
KoSimCSE-RoBERTa	83.65	83.60	83.77	83.54	83.76	83.55	83.77	83.55	83.64
KoSimCSE-BERT-multitask	85.71	85.29	86.02	85.63	86.01	85.57	85.97	85.26	85.93
KoSimCSE-RoBERTa-multitask	85.77	85.08	86.12	85.84	86.12	85.83	86.12	85.03	85.99