[NLP] Gensim

NLP/NLP 기초

[NLP] Gensim

데이터 세상 2021. 4. 11. 18:05

728x90

Gensim

Python library for topic modelling, document indexing and similarity retrieval with large corpora

자연어 처리를 위한 python 패키지
Topic modeling 에 자주 이용되는 Latent Dirichlet Allocation(LDA) 이나 Random Projection(RP) 와 같은 방법들이 구현되어 있는 Python 라이브러리
Version 업데이트 되면서 Word2Vec 과 Doc2Vec 같은 embedding 방법들도 포함

pypi.org/project/gensim/

gensim

Python framework for fast Vector Space Modelling

pypi.org

Gensim 설치

pip install gensim

gensim.models.Phrases

Bigram / Trigram 분석 추가를 위해 gensim.models.Phrases 이용

# min_count: 최소한 min_count보다 많이 등장한 token들에 대해서 만들어줌
# threshold: default는 10.0이며, 이 값이 작을수록 두 token을 붙여서 새로운 token으로 만드는 경향이 높아집니다.
#         즉, 값을 조절하면서, 잘 만들어지는지를 체크 필요
#         웬만하면 복합어로 만들고 싶은 경우, 0.01과 같은 값을 넘겨줌.
# delimiter: token들이 합쳐져서 새로운 word가 만들어질 때, 연결점을 어떻게 표시할지를 의미
 
bigram = gensim.models.Phrases(doc_words, min_count=1, threshold=2)
bigram_mod = gensim.models.phrases.Phraser(bigram)
trigram = gensim.models.Phrases(bigram[doc_words], min_count=1, threshold=3)
trigram_mod = gensim.models.phrases.Phraser(trigram)
doc_words = [bigram_mod[d_word] for d_word in doc_words]
doc_words = [trigram_mod[bigram_mod[d_word]] for d_word in doc_words]

728x90

저작자표시 비영리 변경금지

'NLP > NLP 기초' 카테고리의 다른 글

PII (Personally Identifiable Information, 개인 식별 정보) (0)	2022.12.13
[NLP] NLP Dataset (0)	2021.06.11
[NLP] 자연어 처리를 위한 수학 (0)	2021.04.11
[NLP] Semantic Analysis (0)	2021.04.05
[NLP] Syntactic Analysis (0)	2021.04.05

현재글[NLP] Gensim

데이터와 인공지능 훑어보기

[NLP] Gensim

Gensim

Gensim 설치

gensim.models.Phrases

'NLP > NLP 기초' 카테고리의 다른 글

'NLP/NLP 기초'의 다른글

티스토리툴바

[NLP] Gensim

Gensim

Gensim 설치

gensim.models.Phrases

'NLP > NLP 기초' 카테고리의 다른 글

'NLP/NLP 기초'의 다른글

관련글

티스토리툴바