패딩(Padding)

문장이나 문서의 길이가 다를경우 병렬 연산을 위해 문장의 길이를 임의로 동일하게 맞춰주는 작업이 필요하다.

패딩(Padding)처리는 전체 백터의 크기를 맞춰주고, 크기를 늘린만큼 작은 벡터에 생긴 공간에 0을 채워준다.

# Keras

케라스에서는 pad_sequences() 함수를 통해 시퀀스의 패딩 처리를 손쉽게 할 수 있다.

from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

sentences = [['barber', 'person'], ['barber', 'good', 'person'],
             ['barber', 'huge', 'person'], ['knew', 'secret'],
             ['secret', 'kept', 'huge', 'secret'], ['huge', 'secret'],
             ['barber', 'kept', 'word'], ['barber', 'kept', 'word'],
             ['barber', 'kept', 'secret'],
             [
                 'keeping', 'keeping', 'huge', 'secret', 'driving', 'barber',
                 'crazy'
             ], ['barber', 'went', 'huge', 'mountain']]

# 정수 인코딩
tokenizer = Tokenizer()
tokenizer.fit_on_texts(sentences)  # 빈도수를 기준으로 단어 집합을 생성

encoded = tokenizer.texts_to_sequences(sentences)  # 텍스트 시퀀스의 모든 단어들을 각 정수로 맵핑
# [[1, 5], [1, 8, 5], [1, 3, 5], [9, 2], [2, 4, 3, 2], [3, 2], [1, 4, 6], 
#  [1, 4, 6], [1, 4, 2], [7, 7, 3, 2, 10, 1, 11], [1, 12, 3, 13]]

padded = pad_sequences(encoded)  # 패딩 처리
padded

array([[ 0,  0,  0,  0,  0,  1,  5],
       [ 0,  0,  0,  0,  1,  8,  5],
       [ 0,  0,  0,  0,  1,  3,  5],
       [ 0,  0,  0,  0,  0,  9,  2],
       [ 0,  0,  0,  2,  4,  3,  2],
       [ 0,  0,  0,  0,  0,  3,  2],
       [ 0,  0,  0,  0,  1,  4,  6],
       [ 0,  0,  0,  0,  1,  4,  6],
       [ 0,  0,  0,  0,  1,  4,  2],
       [ 7,  7,  3,  2, 10,  1, 11],
       [ 0,  0,  0,  1, 12,  3, 13]])

만약 뒤에 0을 넣고 싶다면, padding='post'를 인자로 추가해주면 된다.

padded = pad_sequences(encoded, padding = 'post')
padded

array([[ 1,  5,  0,  0,  0,  0,  0],
       [ 1,  8,  5,  0,  0,  0,  0],
       [ 1,  3,  5,  0,  0,  0,  0],
       [ 9,  2,  0,  0,  0,  0,  0],
       [ 2,  4,  3,  2,  0,  0,  0],
       [ 3,  2,  0,  0,  0,  0,  0],
       [ 1,  4,  6,  0,  0,  0,  0],
       [ 1,  4,  6,  0,  0,  0,  0],
       [ 1,  4,  2,  0,  0,  0,  0],
       [ 7,  7,  3,  2, 10,  1, 11],
       [ 1, 12,  3, 13,  0,  0,  0]])

maxlen 이라는 인자를 통해 길이를 맞춰줄 수 있다.

padded = pad_sequences(encoded, padding = 'post', maxlen = 5)
padded

array([[ 1,  5,  0,  0,  0],
       [ 1,  8,  5,  0,  0],
       [ 1,  3,  5,  0,  0],
       [ 9,  2,  0,  0,  0],
       [ 2,  4,  3,  2,  0],
       [ 3,  2,  0,  0,  0],
       [ 1,  4,  6,  0,  0],
       [ 1,  4,  6,  0,  0],
       [ 1,  4,  2,  0,  0],
       [ 3,  2, 10,  1, 11],
       [ 1, 12,  3, 13,  0]])

길이가 maxlen 보다 작은 문장들은 손실이 되어짐을 볼 수 있다.

학습시킬 문장 데이터들을 사전에 분석해 최대 몇 개의 단어 토큰으로 구성되어 있는지 파악해야 한다.

너무 크게 잡으면 빈 공간이 많이 생겨 자원의 낭비가 발생하고, 반대로 너무 작게 잡으면 입력데이터가 손상되는 상황이 발생하게 된다.

<참고 사이트> wikidocs.net/83544

'NLP' 카테고리의 다른 글

벡터의 유사도(Vector Similarity) (0)	2022.01.28
카운트 기반의 단어 표현(Count based word Representation) (0)	2022.01.28
정수 인코딩 (Integer Encoding) (0)	2022.01.27
어간 추출(Stemming)과 표제어 추출(Lemmatization) (0)	2022.01.27
토큰화 Tokenization (0)	2022.01.27

# Keras

'NLP' 카테고리의 다른 글

티스토리툴바