Toy Project 1 (8) - Text preprocessing

Tokenization을 진행하기 위해 정리된 csv 파일을 x_data와 y_target으로 나눈다. 이 프로젝트의 경우 Term1에는 False data가 있고 Term2에는 True data가 있기 때문에 아래와 같이 진행했다.

data와 target으로 나눈 후 학습을 위한 train data와 평가를 위한 test data로 나눈다.

from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(data, target, test_size=0.2, shuffle=True, stratify=target, random_state=7)

mecab을 import하여 토큰화를 진행한다.

사용 빈도가 높은 단어 토큰을 확인한다.

빈도수가 높은 단어를 선택한다. word imbedding 방법으로 단순한 count vector를 사용한다.

전처리가 끝났다. 아래는 사용한 코드

from konlpy.tag import Mecab
mecab = Mecab()
import json
import os
from pprint import pprint

def tokenize(doc):
    return ['/'.join(t) for t in mecab.pos(doc)]

train_path = '/Users/jason/Test/data/train_docs.json'
test_path = '/Users/jason/Test/data/test_docs.json'

if os.path.isfile(train_path):
    with open(train_path) as f:
        train_docs = json.load(f)
    with open(test_path) as f:
        test_docs = json.load(f)
else:
    train_docs = [tokenize(row) for row in x_train]
    test_docs = [tokenize(row) for row in x_test]
    # JSON 파일로 저장
    with open(train_path, 'w', encoding="utf-8") as make_file:
        json.dump(train_docs, make_file, ensure_ascii=False, indent="\t")
    with open(test_path, 'w', encoding="utf-8") as make_file:
        json.dump(test_docs, make_file, ensure_ascii=False, indent="\t")
        
import matplotlib.pyplot as plt
from matplotlib import font_manager, rc
%matplotlib inline

plt.rcParams['axes.unicode_minus']=False #한글 처리
rc('font',family='AppleGothic')

plt.figure(figsize=(20,10))
text.plot(50)

selected_words = [f[0] for f in text.vocab().most_common(10000)]

def term_frequency(doc):
    return [doc.count(word) for word in selected_words]

x_train = [term_frequency(d) for d in train_docs]
x_test = [term_frequency(d) for d in test_docs]

import numpy as np

X_train = np.asarray(x_train).astype('float32')
X_test = np.asarray(x_test).astype('float32')

Y_train = np.asarray(y_train).astype('float32')
Y_test = np.asarray(y_test).astype('float32')

'프로젝트 > Project1' 카테고리의 다른 글

Toy Project 1 (9) - 모델 / 학습 (0)	2021.11.02
Toy Project 1 (7) - M1 mac 설정하기(삽질) (0)	2021.09.21
Toy Project 1 (6) - HWP XML PARSING하기 2 (0)	2021.09.12
Toy Project 1 (5) - HWP XML PARSING하기 1 (0)	2021.06.11
Toy Project 1 (4) - HWP파일 XML로 바꾸기 (0)	2021.05.30

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`

to three commas

Toy Project 1 (8) - Text preprocessing - Tokenization

'프로젝트 > Project1' 카테고리의 다른 글

티스토리툴바

단축키

내 블로그

블로그 게시글

모든 영역

Toy Project 1 (8) - Text preprocessing - Tokenization

'프로젝트 > Project1' 카테고리의 다른 글

'프로젝트/Project1' Related Articles

티스토리툴바

단축키

내 블로그

블로그 게시글

모든 영역