본문 바로가기

TIL/잡다한

[NLTK] 자연어 처리 모듈 nltk 설치 및 사용법

반응형

자연어 처리에서 잘쓰이는 nltk 모듈


어떤 한 문장에서 단어들을 분리시켜주는 역할을 하는데

사용 결과는 다음과 같다

[참조: https://www.reddit.com/r/pythontips/comments/4mu9qq/word_count_using_text_mining_module_nltk_natural/ ]


>>> from nltk.corpus import stopwords >>> from nltk.tokenize import RegexpTokenizer >>> zen = """ The Zen of Python, by Tim Peters Beautiful is better than ugly. Explicit is better than implicit. Simple is better than complex. Complex is better than complicated. Flat is better than nested. Sparse is better than dense. Readability counts. Special cases aren't special enough to break the rules. Although practicality beats purity. Errors should never pass silently. Unless explicitly silenced. In the face of ambiguity, refuse the temptation to guess. There should be one-- and preferably only one --obvious way to do it. Although that way may not be obvious at first unless you're Dutch. Now is better than never. Although never is often better than *right* now. If the implementation is hard to explain, it's a bad idea. If the implementation is easy to explain, it may be a good idea. Namespaces are one honking great idea -- let's do more of those! 

"""


맨위에 import 부분에서 에러가 뜰 때가 있는데

다음과 같이 실행하면 된다


cmd에서 python을 실행후 nltk을 import 하고 download함수를 사용하면 다음과 같은 창이 뜨는데



귀찮으니 전부다 다운로드 하면 된다.


자이제 이런식으로 사용하면 된다, 단어 추출 (a, the같은 단어는 자동으로 제외)

>>> set(w.title() for w in zen_no_punc if w.lower() not in stopwords.words()) {'Although', 'Ambiguity', 'Bad', 'Beats', 'Beautiful', 'Better', 'Break', 'Cases', 'Complex', 'Complicated', 'Counts', 'Dense', 'Dutch', 'Easy', 'Enough', 'Errors', 'Explain', 'Explicit', 'Explicitly' ... etc}


단어의 수도 셀 수가 있다


>>> from collections import Counter

>>> word_count_dict = Counter(w.title() for w in zen_no_punc if w.lower() not in stopwords.words()) >>> word_count_dict.most_common() [('Better', 8), ('One', 3), ('Never', 3), ('Although', 3), ('Idea', 3), ('Way', 2), ('Implementation', 2), ('Explain', 2), ('Complex', 2), ('May', 2), ('Unless', 2), ('Obvious', 2), ('Special', 2), ('Preferably', 1), ('Guess', 1), ('Errors', 1), ('Dense', 1), ('Temptation', 1), ('Great', 1), ('Peters', 1), ('Tim', 1), ('Enough', 1) ... etc


반응형