자연어 전처리 인코딩, 패딩

자연어 VOCABULARY 만들기

자연어 처리에서는 텍스트를 숫자로 바꾸는 다양한 기법들이 있습니다.
그러한 기법들을 사용하기 위해 첫 단계로 각 단어를 고유한 정수에 맵핑(mapping)시키는 전처리 작업입니다.
Vocab을 설명하기 위해서는 특정 단어가 많이 있어야(제거 해야할 필요성이 있음) 보여 줄 수 있기에 여러 문장, 단어를 사용

import nltk
from nltk.tokenize import sent_tokenize
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

text = 'This is a good place. I want to go climbing right now. I dont know where this place is'

# 자연어 처리를 위한 패키지 다운받기(한국어는 다른 패키지 사용해야 합니다)
nltk.download('punkt')
nltk.download('stopwords')

# 문장 나누기(텍스트 데이터 -> 문장 단위로 토큰화)
sentences = sent_tokenize(text)
print(sentences)

# 단어 토큰화(문장 단위 -> 토큰(단어)단위 토큰화)
word_tokenize(sentences[0])

# 출력 결과
# ['This is a good place.', 'I want to go climbing right now.', 'I dont know where this place is']
#
# ['This', 'is', 'a', 'good', 'place', '.']



# 단어 사전 만들기
vocab = {}
preprocessed_sentences = []

stop_words = set(stopwords.words('english'))

for sentence in sentences:
    tokenized_sentence = word_tokenize(sentence)
    result = []

    for word in tokenized_sentence: 
        word = word.lower() 
        if word not in stop_words : # 단어 토큰화 된 결과에 대해서 불용어를 제거한다.(보통 따로 추가로 단어장 만들어서 추가로 제거)
            if len(word) > 2: 
                result.append(word)
                if word not in vocab:
                    vocab[word] = 0 
                vocab[word] += 1
    preprocessed_sentences.append(result) 
print(preprocessed_sentences)

# [['good', 'place'], ['want', 'climbing', 'right'], ['dont', 'know', 'place']]

print('VOCAB :',vocab)
# VOCAB : {'good': 1, 'place': 2, 'want': 1, 'climbing': 1, 'right': 1, 'dont': 1, 'know': 1}


# 이제 빈도수 별로 인코딩을 하는데 Counter 함수를 많이 이용합니다.
from collections import Counter
all_words_list = sum(preprocessed_sentences, []) # 2차원리스트 -> 1차원리스트
vocab = Counter(all_words_list)

print(vocab)
# Counter({'place': 2, 'good': 1, 'want': 1, 'climbing': 1, 'right': 1, 'dont': 1, 'know': 1})

패딩처리 하기(Padding)

컴퓨터는 길이가 전부 동일한 문서들에 대해 하나의 행렬로 보고, 한꺼번에 묶어서 처리할 수 있습니다.
다시 말해 병렬 연산을 위해서 문장의 길이를 통일 시켜주는 작업이 필요합니다.
파이토치의 from torch.nn.utils.rnn import pad_sequence 또는
케라스의 from tensorflow.keras.preprocessing.sequence import pad_sequences를 이용하여 패딩하면 됩니다.

print(preprocessed_sentences)
# [['good', 'place'], ['want', 'climbing', 'right'], ['dont', 'know', 'place']]

from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.preprocessing.text import Tokenizer

# 토크나이저
tokenizer = Tokenizer()
tokenizer.fit_on_texts(preprocessed_sentences)
encoded = tokenizer.texts_to_sequences(preprocessed_sentences)
print(encoded)
[[2, 1], [3, 4, 5], [6, 7, 1]]

padded = pad_sequences(encoded)
print(padded) #기본 옵션으로 패딩
# array([[2, 1, 0],
#        [3, 4, 5],
#        [6, 7, 1]], dtype=int32)


padded = pad_sequences(encoded, padding='post') #padding의 기본 옵션이 'post'이고 뒤에 0을 붙인다는 의미입니다. 
padded
# array([[2, 1, 0, 0, 0],
#        [3, 4, 5, 0, 0],
#        [6, 7, 1, 0, 0]], dtype=int32)

한국어 처리는 nltk가 아닌 보통 다른 패키지를 konly같은 한국어용 버전을 따로 다운받아서 사용함

import konlpy
from konlpy.tag import Mecab

###################
mecab = Mecab()
sentence = '안녕하세요 테스트용 텍스트 입니다.'
temp_X = mecab.morphs(sentence)
temp_X

# 출력 결과
['안녕', '하', '세요', '테스트', '용', '텍스트', '입니다', '.']