자연어 처리에서 잘쓰이는 nltk 모듈
어떤 한 문장에서 단어들을 분리시켜주는 역할을 하는데
사용 결과는 다음과 같다
>>> from nltk.corpus import stopwords >>> from nltk.tokenize import RegexpTokenizer >>> zen = """ The Zen of Python, by Tim Peters Beautiful is better than ugly. Explicit is better than implicit. Simple is better than complex. Complex is better than complicated. Flat is better than nested. Sparse is better than dense. Readability counts. Special cases aren't special enough to break the rules. Although practicality beats purity. Errors should never pass silently. Unless explicitly silenced. In the face of ambiguity, refuse the temptation to guess. There should be one-- and preferably only one --obvious way to do it. Although that way may not be obvious at first unless you're Dutch. Now is better than never. Although never is often better than *right* now. If the implementation is hard to explain, it's a bad idea. If the implementation is easy to explain, it may be a good idea. Namespaces are one honking great idea -- let's do more of those!
"""
맨위에 import 부분에서 에러가 뜰 때가 있는데
다음과 같이 실행하면 된다
cmd에서 python을 실행후 nltk을 import 하고 download함수를 사용하면 다음과 같은 창이 뜨는데
귀찮으니 전부다 다운로드 하면 된다.
자이제 이런식으로 사용하면 된다, 단어 추출 (a, the같은 단어는 자동으로 제외)
단어의 수도 셀 수가 있다
>>> from collections import Counter
>>> word_count_dict = Counter(w.title() for w in zen_no_punc if w.lower() not in stopwords.words()) >>> word_count_dict.most_common() [('Better', 8), ('One', 3), ('Never', 3), ('Although', 3), ('Idea', 3), ('Way', 2), ('Implementation', 2), ('Explain', 2), ('Complex', 2), ('May', 2), ('Unless', 2), ('Obvious', 2), ('Special', 2), ('Preferably', 1), ('Guess', 1), ('Errors', 1), ('Dense', 1), ('Temptation', 1), ('Great', 1), ('Peters', 1), ('Tim', 1), ('Enough', 1) ... etc'TIL > 잡다한' 카테고리의 다른 글
윈도우에 angr설치 (0) | 2018.05.23 |
---|---|
Error: Unable to buile libVEX (0) | 2018.05.23 |
[CUDA] Problem with libGL.so on 64bit Ubuntu (0) | 2018.03.16 |
[Network] 우분투 eth0 오류 (0) | 2018.03.16 |
[코드마커삽입]Insert marker in C/C++ Source Code, Inline assembly (0) | 2018.03.11 |