[Language Model] BERTopic

Generative AI/Language Model

[Language Model] BERTopic

데이터 세상 2021. 6. 14. 21:09

728x90

BERTopic

BERT 임베딩 및 클래스 기반 TF-IDF를 활용하여 조밀한 클러스터를 생성하여 주제 설명에 중요한 단어를 유지하면서 쉽게 해석 가능한 주제를 허용하는 주제 모델링 기술

https://maartengr.github.io/BERTopic/index.html

Home - BERTopic

BERTopic BERTopic is a topic modeling technique that leverages transformers and c-TF-IDF to create dense clusters allowing for easily interpretable topics whilst keeping important words in the topic descriptions. It even supports visualizations similar to

maartengr.github.io

Algorithm

Sentence Transformer (SBERT)

a Python framework for state-of-the-art sentence, text and image embeddings

BERT로 유사한 두 문장을 찾으려면 두 개의 문장을 한 개의 BERT 모델에 넣어야 유사도가 평가된다.
따라서 문장이 10000개 있으면 10000 * 9999 / 2 번의 연산 후에야 랭킹을 할 수 있다.
Clustering이나 search에서는 각 문장을 벡터 공간에 매핑하는 작업을 보통 쓰며, BERT를 이용할 때는 출력을 평균내거나 [CLS] 토큰의 출력값을 이용한다.
하지만 이랬을 때의 결과는 각 단어의 GloVe 벡터를 평균낸 것보다 좋지 않다.

BERT 네트워크에 siamese, triplet 구조를 적용
기존의 BERT가 하지 못했던 large-scale의 similarity comparison, clustering, information retrieval 등이 가능

STS(Semantic Textual Similarity) benchmark (STSb)

https://paperswithcode.com/sota/semantic-textual-similarity-on-sts-benchmark

Full list of pre-trained models

https://www.sbert.net/docs/pretrained_models.html

UMAP

차원 축소 알고리즘 중 하나

[NLP/Text Categorization] - 차원축소 (Dimensionality Reduction)

차원축소 (Dimensionality Reduction)

차원축소 (Dimensionality Reduction) 가지고 있는 방대한 양의 데이터에서 필요한 특성만 추출하는 방법 너무 많은 정보를 잃지 않으면서 데이터를 간소화 새로운 데이터를 잘 예측해주는 '설명력'있

yumdata.tistory.com

HDBSCAN

https://hdbscan.readthedocs.io/en/latest/index.html

The hdbscan Clustering Library — hdbscan 0.8.1 documentation

hdbscan.readthedocs.io

import hdbscan

# min_cluster_size: set it to the smallest size grouping that you wish to consider a cluster.
# cluster_selction_method: "eom", "leaf"
hdbscan_model = hdbscan.HDBSCAN(min_cluster_size=20, cluster_selection_method="eom")

c-TF-IDF (Class-based TF-IDF)

https://github.com/MaartenGr/cTFIDF

MaartenGr/cTFIDF

Creating class-based TF-IDF matrices. Contribute to MaartenGr/cTFIDF development by creating an account on GitHub.

github.com

plotly

https://plotly.com/

Plotly: The front end for ML and data science models

Plotly creates & stewards the leading data viz & UI tools for ML, data science, engineering, and the sciences. Language support for Python, R, Julia, and JavaScript.

plotly.com

References

728x90

저작자표시 비영리 변경금지 (새창열림)

'Generative AI > Language Model' 카테고리의 다른 글

[Language Model] Transformer Model (0)	2022.05.04
[Language Model] Attention Model (0)	2022.05.04
[Language Model] KR-BERT (0)	2021.04.13
[NLP] 통계적 언어 모델(Statistical Language Model, SLM) (0)	2021.04.12
[Language Model] T5(Text-to-Text Transfer Transformer) (0)	2021.04.08

현재글[Language Model] BERTopic

데이터와 인공지능 훑어보기