Classification - [8.1] Word2vec Negative Sampling

- Word2Vec은 Output layer를 거친 값에 소프트맥스를 적용해 확률값으로 변환합니다. 그리고 [1 0 0 0] 같은 sparse matrix 형태의 정답과 비교해 역전파하여 weight matrix를 업데이트 합니다.
하지만 만약 vocab이 1만개라면, 소프트맥스 공식의 분모 값을 구하는 과정에서 많은 연산이 필요합니다. vocab에 존재하는 모든 단어들에 대해 내적을 한 뒤 exp를 취해줘야 하기 때문에 연산량이 많아집니다.

- 불필요한 계산을 줄이고 학습 시간을 줄인다면 더 많은 데이터를 학습 시킬 수 있습니다. 그러면 sentence가 아니라 paragraph(단락) 또한 학습 시킬 수 있고 단어의 이해범위가 더 늘어날 것입니다.

- 이러한 문제를 해결하기 위해 "Distributed Representations of Words and Phrases and their Compositionality" 논문에서 제안한 Negative Sampling을 알아보겠습니다.

Negative Sampling

negative sampling의 학습과정을 살펴보겠습니다.

(1) 아래의 이미지는 기존 word2vec 학습 방법입니다. 중심 단어와 주변 단어를 기준으로 입력값과 출력값을 설정했습니다. 해당 방식은 Skip-gram 방식이기 때문에 중심단어를 사용해 주변단어를 예측하는 데이터 셋 형태가 만들어졌습니다. negative sampling은 아래 이미지의 우측과 같이 레이블을 부여했습니다. 중심단어와 주변단어의 경우에는 1로 positive한 라벨링을 부여했습니다.

위와 같은 방법으로 positive 라벨링을 부여하고 이제는 negative 데이터를 추가해야 합니다. negative 데이터는 위의 문장과 관련없는 데이터를 추가합니다. 아래 우측 부분이 최종적인 negative sampling 데이터 형태입니다.

여기서 negative 데이터를 몇개를 추가해야 할까요? 논문에서 작은 데이터에서는 5-20, 큰 데이터에서는 2-5를 추가하라고 제안하고 있습니다.

그리고 negative 데이터는 어떻게 선정해서 넣을까요? 그냥 랜덤하게 해도 되지만 논문에서는 샘플링 될 단어의 빈도수에 따라 추가하는 방법을 제안합니다. 아래의 수식에 따라 negative sampling된 단어를 선정합니다. 각 단어의 빈도수를 3/4 승 한 다음, 그것을 모두 더한 값으로 나눕니다. 즉, 빈도수가 큰 단어일수록 negative sampling이 될 확률이 높습니다.

(2) 데이터가 준비되면 학습 모델을 생성합니다.

출처 : https://mangastorytelling.tistory.com/entry/Skip-Gram-with-Negative-Sampling-SGNS

기존 word2vec 모델과 차이가 있습니다. 첫번째 특이한 부분은 입력1값과 입력 2값의 embedding layer를 각각 따로 설정합니다.

[ 1 0 0 0 ] 형태의 sparse한 입력값들이 embedding layer를 거치면 [ 0.2 0.1 0.32 ] dense layer로 변환됩니다.

각 단어 임베딩 값들을 dot해서 예측값이 나오게 되고 레이블과 loss를 계산합니다. 그 후 해당 단어들에 대해서만 임베딩 벡터 값을 업데이트 합니다.

궁금증

Q1 : Negative Sampling으로 학습하면 positive 단어 N개와 negative 단어 5~10개만 대상으로 단어 Weight Matrix 값이 업데이트 되는 것인가? 그렇기 때문에 기존 Word2vec보다 연산 가정이 적어지는 것인가?

A1 : 먼저 위의 질문대로 positive 단어와 negative 단어만 Weight Matrix 값이 업데이트 되는 것은 맞습니다. 그러나 Word2vec 또한 입력 단어만 Weight Matrix 값이 업데이트 되기 때문에 negative sampling만의 특성이라고 보기는 힘듭니다.

간단한 Word2vec 모델을 만들고 테스트 해보겠습니다.

from keras.layers import Embedding, Dense, Input
from keras.models import Model

input_layer = Input(shape=(10,))
hidden_layer = Dense(3, use_bias=False)(input_layer)
output_layer = Dense(10, activation='softmax', use_bias=False)(hidden_layer)

model = Model(inputs=input_layer, outputs=output_layer)
model.summary()
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])


train_x = [[0, 1, 0, 0, 0, 0, 0, 0, 0, 0], [0, 1, 0, 0, 0, 0, 0, 0, 0, 0]]
train_y = [[0, 0, 0, 0, 0, 0, 0, 1, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 1, 0]]

embeddings = model.layers[1].get_weights()
print(embeddings)

model.fit(train_x, train_y, epochs=5)

embeddings = model.layers[1].get_weights()
print(embeddings)

위와 같이 간단한 word2vec 모델을 만들었습니다. 그리고 sparse matrix 형태의 train_x와 train_y를 학습 시킨 후 input과 hidden 사이에 있는 weight matrix 값을 확인해봅니다.

해당 결과 값을 보면 2번째 행에 있는 값들만 변경된 것을 확인할 수 있습니다.

negative sampling은 어떻게 되는지 간단한 실험을 해보겠습니다.

from keras.layers import Embedding, Dense, Flatten, Input, concatenate
from keras.layers.convolutional import Conv1D
from keras.layers.pooling import MaxPool1D
from keras.models import Model
import keras
from tensorflow.keras.models import Sequential, Model
from tensorflow.keras.layers import Embedding, Reshape, Activation, Input
from tensorflow.keras.layers import Dot
from tensorflow.keras.utils import plot_model
from IPython.display import SVG

vocab_size = 10
embed_size = 3

w_inputs = Input(shape=(1, ), dtype='int32')
word_embedding = Embedding(vocab_size, embed_size)(w_inputs)

c_inputs = Input(shape=(1, ), dtype='int32')
context_embedding  = Embedding(vocab_size, embed_size)(c_inputs)

dot_product = Dot(axes=2)([word_embedding, context_embedding])
dot_product = Reshape((1,), input_shape=(1, 1))(dot_product)
output = Activation('sigmoid')(dot_product)

model = Model(inputs=[w_inputs, c_inputs], outputs=output)
model.summary()
model.compile(loss='binary_crossentropy', optimizer='adam')


word = np.array([1, 2], dtype='int32')
context = np.array([2, 3], dtype='int32')
train_y = np.array([1, 1], dtype='int32')

embeddings = model.layers[3].get_weights()
print(embeddings)

model.fit([word, context], train_y, epochs=2)

embeddings = model.layers[3].get_weights()
print(embeddings)

세번째 layer는 첫번째 임베딩 행렬 값을 의미합니다. 결과를 보면 아래와 같이 나오는데 word에 입력한 첫번째 단어 두번째 단어 값만 업데이트 된 것을 확인할 수 있습니다. 네번째 layer 값을 보면 context 단어인 두번째와 세번째 단어가 업데이트 된 것을 확인할 수 있습니다.

Q2 : 그럼 negative sampling의 효과는 무엇인가?

A2 : 위에서 말했듯이 기존 word2vec 모델에서 역전파 하기 위해서는 softmax를 해야합니다. vocab이 너무 많은 경우 softmax를 적용할 때 연산과정이 많아질 수 있습니다.

하지만 negative sampling은 이진 분류 과제로 task를 변형했고 output layer의 값을 계산하는 과정 또한 적어졌기 때문에 효과적일 수 있습니다.

'슬기로운 NLP 생활' 카테고리의 다른 글

Classification - [15] Evaluation (0)	2020.12.09
Classification - [14] Model (0)	2020.12.09
Classification - [13] BERT (0)	2020.10.07
Classification - [12] GPT (0)	2020.10.07
Classification - [11] ELMo (0)	2020.09.24