Rubini

Data Scientist @Tekclan

In this blog we going to learn about what is word2vec and their types. Why word2vec is differs from normal conventional method like count vectorization,one hot encoder and architecture of skip-gram.

Overview of word2vec

In general, everyone think word2vec is to convert a word to vector(ie.,in numbers) but not just convert into numerical format which does not have any meaning for that word. It gives a meaning to the word in vector space.

The famous example to understand what word2vec exactly does
“King – Man + Woman = Queen”

Why word2vec

We in the need to convert a word into numerical value to make computer to understand. To convert into numeric we moving towards nlp where we have lot of functions. So with that use of functions we are able to convert with our own need.

Word2vec

Word2vec is created by Tomas Mikolov at Google. word2vec is an algorithm which takes input from the corpus file and output as vector representation of each word in corpus.

Types:

  • Skip gram

  • CBOW(Continuous Bag of Words)

Both Skip gram and CBOW will looks similar only difference is skip gram uses the target word to find the context word(ie.,neighbour words) whereas CBOW find the target word with the use of context word.

Skip-gram architecture

Skip-gram is simple neural network with one hidden layer. It takes input from corpus file and fed into the hidden layer and output will be vector of context word. To process the corpus file in neural network we need to convert those words into zeros and ones. To do this process we use the technique called “One hot encoding”. One hot encoding just convert words into zeros and ones. For example.,

Lets take a document “natural language processing and machine learning is fun and exciting”

Now the corpus file contains list of unique words. Now we encode one hot encoding for the natural. So other than natural all words will be zero.

Words One hot encoder
natural 1
language 0
processing 0
and 0
Machine 0
learning 0
is 0
fun 0
exciting 0

Table1:One hot encoding

Like above table, we will process for all the words in the corpus where the target word is one and all other words will be zero. Once we done one hot encoding, we will pass the input to the hidden layer where there is no activation function.This will pass to the output layer which has softmax function to find the probability of the words.

In skip-gram, to decide target and context word we need to know about window size. Window size is a parameter tells how much neighbour words we need to train the model. For example

If window size = 2

Sentence = “natural language processing and machine learning is fun and exciting”

Target Context
natural language,
processing
language natural,
processing,
and
processing natural,
language,
and,
machine
and language,
processing,
machine,
learning
Machine processing,
and,
learning,
is
learning and ,
machine,
is,
fun
is machine,
learning,
fun,
exciting
fun learning,
is,
exciting
exciting is,
fun

Table2: window size

Now we have the good understanding of skip-gram. Lets see some of the demerits of skip-gram.

  • For large corpus file, words gets increases so neuron also increases. To compute large file in the output layer where we using softmax function to find the probability will be expensive.

  • The time required to train the model is high.

Why name skip-gram?

We all know about n-gram. N-gram process the nearby words in sequence. For example

Sentence: “I like flower“

Unigram: “I”,”like”,”flower”
Bigram: “I like”,”like flower”

skip-gram also does the same but it skip the target word and process the context word as n-gram. So we call it as Skip-gram.