[NLP] NLTK(Natural Language Toolkit)

NLP/NLTK

[NLP] NLTK(Natural Language Toolkit)

데이터 세상 2021. 3. 10. 23:01

728x90

NLTK(Natural Language Toolkit)

교육용으로 개발된 자연어 처리 및 문서 분석용 파이썬 패키지
영어 텍스트에 대한 전처리 및 분석을 하기 위한 도구로 활용
50여 개가 넘는 corpus 리소스를 활용해 영어 텍스트를 분석할 수 있게 제공

Terminologies

English	한국어	Description
Document	문서	-
Corpus	말뭉치	A set of documents
Token	토큰	Meaningful elements in a text such as words or pharases or symbols
Morphemes	형태소	Smallest meningful unit in a language
POS	품사	Part-of-Speech (ex. noun)

NLTK 설치

pip install nltk

Corpus (말뭉치)

자연어 분석 작업을 위해 만든 샘플 문서 집합
corpus 서브 패키지에서는 다양한 연구용 corpus 제공
nltk.download 명령으로 다운로드 받아야 함

import nltk.tokenize import word_toeknize

# for tokenizing
nltk.download("punkt")

# for stemming
nltk.download("wordnet")

# for tagset
nltk.download('tagsets')

# for pos_tag
nltk.download('averaged_perceptron_tagger')

※ Docker 이용 시 nltk 패키지 다운로드 받는 방법

DockerFile 내에 하기와 같은 부분 추가

# nltk.downloader를 이용해서 원하는 패키지 다운로드
RUN python -W ignore -m nltk.downloader punkt
RUN python -W ignore -m nltk.downloader stopwords
RUN python -W ignore -m nltk.downloader wordnet
RUN python -W ignore -m nltk.downloader averaged_perceptron_tagger

# nltk data 접근 시 경로 오류 발생하는 경우
RUN mv /root/nltk_data /usr/local/share/nltk_data

Tokenizing

텍스트에 대해 특정 기준 단위로 문장을 나누는 것을 의미
- 문장을 단어 기준으로 나누거나 전체 글을 문장 단위로 나누는 것들
- 한글의 경우 'ㄱ', 'ㄴ', 'ㅏ', 'ㅗ' 같은 음소도 하나의 토큰이 된다.

tokenizer

문자열을 토큰으로 분리하는 함수

word tokenizer

단어와 특수 문자의 경우 따로 구분된 리스트 반환

import nltk.tokenize import word_toeknize

sentence = "Welcome to NLTK!! This is for testing the nltk tokenizer."
print(word_tokenize(sentence))

>>
['Welcome', 'to', 'NLTK', '!', '!', 'This', 'is', 'for', 'testing', 'the', 'nltk', 'tokenizer', '.']

sentence tokenizer

from nltk.tokenize import sent_tokenize

sentence = "Welcome to NLTK!! This is for testing the nltk tokenizer."
print(sent_tokenize(sentence))

>>
['Welcome to NLTK!!', 'This is for testing the nltk tokenizer.']

Regular expression tokenizer

from nltk.tokenize import RegexpTokenizer

retokenize = RegexpTokenizer("[\w]+")
sentence = "Welcome to NLTK!! This is for testing the nltk tokenizer."
print(retokenize.tokenize(sentence))

>>
['Welcome', 'to', 'NLTK', 'This', 'is', 'for', 'testing', 'the', 'nltk', 'tokenizer']

Morphological Analysis

형태소
- 언어학에서 일정한 의미가 있는 가장 작은 말의 단위
형태소 분석
- 단어로부터 어근, 접두사, 접미사, 품사 등 다양한 언어적 속성을 파악하고 이를 이용하여 형태소를 찾아내거나 처리하는 작업

Stemming(어간 추출)

변화된 단어의 접미사나 어미를 제거하여 같은 의미를 가지는 형태소의 기본형을 찾는 방법
단순 어미를 제거할 뿐이므로 단어의 원형이 정확히 찾아지는 않음

from nltk.stem import PorterStemmer, LancasterStemmer

st1 = PorterStemmer()
st2 =  LancasterStemmer()

words = ["fly", "flies", "flying", "flew", "flown"]

print("Porter Stemmer   :", [st1.stem(w) for w in words])
print("Lancaster Stemmer:", [st2.stem(w) for w in words])

>>
Porter Stemmer   : ['fli', 'fli', 'fli', 'flew', 'flown']
Lancaster Stemmer: ['fly', 'fli', 'fly', 'flew', 'flown']

Lemmatizing(원형 복원)

같은 의미를 가지는 여러 단어를 사전형으로 통일하는 작업
품사를 지정하는 경우 좀 더 정확한 원형을 찾을 수 있다

from nltk.stem import WordNetLemmatizer

lm = WordNetLemmatizer()
words = ["fly", "flies", "flying", "flew", "flown"]
print([lm.lemmatize(w, pos="v") for w in words])

>>
['fly', 'fly', 'fly', 'fly', 'fly']

POS (Part-Of-Speech Tagging, 품사 부착)

품사는 낱말을 문법적인 기능이나 형태, 뜻에 따라 구분한 것

nltk.help.upenn_tagset

tag의 자세한 설명을 확인할 수 있음

import nltk
# nltk.download('tagsets')
nltk.help.upenn_tagset("VB")
>>
VB: verb, base form
    ask assemble assess assign assume atone attention avoid bake balkanize
    bank begin behold believe bend benefit bevel beware bless boil bomb
    boost brace break bring broil brush build ...

pos_tag

단어 토큰에 품사를 부착하여 튜플로 출력

from nltk.tag import pos_tag
from nltk.tokenize import word_tokenize

# nltk.download('averaged_perceptron_tagger')
sentence = "Welcome to NLTK!! This is for testing the nltk tokenizer."
tagged_list = pos_tag(word_tokenize(sentence))
print(tagged_list)

>>
[('Welcome', 'VB'), ('to', 'TO'), ('NLTK', 'NNP'), ('!', '.'), ('!', '.'), 
('This', 'DT'), ('is', 'VBZ'), ('for', 'IN'), ('testing', 'VBG'), ('the', 'DT'), 
('nltk', 'JJ'), ('tokenizer', 'NN'), ('.', '.')]

untag

태그 튜플을 제거

from nltk.tag import pos_tag, untag
from nltk.tokenize import word_tokenize

sentence = "Welcome to NLTK!! This is for testing the nltk tokenizer."
tagged_list = pos_tag(word_tokenize(sentence))
print(untag(tagged_list))

>>
['Welcome', 'to', 'NLTK', '!', '!', 'This', 'is', 'for', 'testing', 'the', 'nltk', 'tokenizer', '.']

같은 토큰이라도 품사가 다르면 다른 토큰으로 처리하고 싶은 경우

원래의 토큰과 품사를 붙여서 새로운 토큰 이름을 만들어 사용

from nltk.tag import pos_tag
from nltk.tokenize import word_tokenize

sentence = "Welcome to NLTK!! This is for testing the nltk tokenizer."
tagged_list = pos_tag(word_tokenize(sentence))

def tokenizer(doc):
    return ["/".join(p) for p in tagged_list]

print(tokenizer(sentence))

>>
['Welcome/VB', 'to/TO', 'NLTK/NNP', '!/.', '!/.', 'This/DT', 'is/VBZ', 
'for/IN', 'testing/VBG', 'the/DT', 'nltk/JJ', 'tokenizer/NN', './.']

Tag	Description
CC	Coorinating conjunction
CD	Cardinal number
DT	Determiner	관형사
EX	Existntial there
FW	Foreign word
IN	Preposition or subordinating conjunction
JJ	Adjective	large
JJR	Adjective, comparative	larger
JJS	Adjective, superlative	largest
LS	List item marker
MD	Modal	could, will
NN	Noun, singular or mass	명사 (단수형 혹은 집합형), cat, tree
NNS	Noun, plural	desks
NNP	Proper noun, singular	단수 고유명사, Sarah
NNPS	Proper noun, plural	Indians, Americans
PDT	Predeterminer	all, both, half
POS	Possessive ending	parent's
PRP	Personal pronoun	hers, herself, him, himself
PRP$	Possessive pronoun	her, his, mine, my, our
RB	Adverb	occasionally, swiftly
RBR	Adverb, comparative	greater
RBS	Adverb, superlative	biggest
RP	Particle	about
SYM	Symbol
TO	to	to 전치사
UH	Interjection	goodbye
VB	Verb, base form	동사, ask
VBD	Verb, past tense	pleased
VBG	Verb, gerund or present participle	judging
VBN	Verb, past participle	reunified
VBP	Verb, non-3rd person singular present	동사 현재형, wrap
VBZ	Verb, 3rd person singular present	bases
WDT	Wh-determiner	that, what
WP	Wh-pronoun	who
WP$	Possessive wh-pronoun
WRB	Wh-adverb	how

References

728x90

저작자표시 비영리 변경금지

'NLP > NLTK' 카테고리의 다른 글

[NLP] spaCy (0)	2021.05.10

현재글[NLP] NLTK(Natural Language Toolkit)

데이터와 인공지능 훑어보기

[NLP] NLTK(Natural Language Toolkit)

NLTK(Natural Language Toolkit)

Terminologies

NLTK 설치

Corpus (말뭉치)

Tokenizing

tokenizer

word tokenizer

sentence tokenizer

Regular expression tokenizer

Morphological Analysis

Stemming(어간 추출)

Lemmatizing(원형 복원)

POS (Part-Of-Speech Tagging, 품사 부착)

References

'NLP > NLTK' 카테고리의 다른 글

'NLP/NLTK'의 다른글

티스토리툴바

[NLP] NLTK(Natural Language Toolkit)

NLTK(Natural Language Toolkit)

Terminologies

NLTK 설치

Corpus (말뭉치)

Tokenizing

tokenizer

word tokenizer

sentence tokenizer

Regular expression tokenizer

Morphological Analysis

Stemming(어간 추출)

Lemmatizing(원형 복원)

POS (Part-Of-Speech Tagging, 품사 부착)

References

'NLP > NLTK' 카테고리의 다른 글

'NLP/NLTK'의 다른글

관련글

티스토리툴바