Garygedegege
Target:
-
Basics first, then key methods used in NLP: Recurrent networks, attention, transformers, etc.
-
A big picture understanding of human languages and the difficulties in understanding and producing them.
-
An understanding of and ability to build systems (in PyTorch) for some of the major problems in NLP: Word meaning, dependency parsing, machine translation, question answering.
Assignments and Project
Main question: How to represent the word into vector?
Word2vec (Mikolov et al. 2013) is a framework for learning word vectors.
Definition
-
Objective Function(Minimize the function) $$ J(\theta)=-\frac{1}{T} \log L(\theta)=-\frac{1}{T} \sum_{t=1}^{T} \sum_{\substack{-m \leq j \leq m \ j \neq 0}} \log P\left(w_{t+j} \mid w_{t} ; \theta\right) $$ How to calculate
$P(w_{t+j} \mid w_{t};\theta)$ ?-
use two vectors per word
$w$ -
$v_w$ when$w$ is a ==center word== -
$u_w$ when$w$ is a ==context word==
-
-
Then for a center word
$c$ and a context word$o$ :- $$ P(o \mid c)=\frac{\exp \left(u_{o}^{T} v_{c}\right)}{\sum_{w \in V} \exp \left(u_{w}^{T} v_{c}\right)} $$
-
prediction function
Gradient $$ \begin{aligned} &\mathcal{U}{\text {new }} \leftarrow \mathcal{U}{\text {old }}-\alpha \nabla_{\mathcal{U}} J \ &\mathcal{V}{\text {old }} \leftarrow \mathcal{V}{\text {old }}-\alpha \nabla_{\mathcal{V}} J \end{aligned} $$
- 但这个包实际上在深度学习中不常用
Gensim word vector visualization notebook
Review
-
$J(\theta)$ is a function of all windows in the corpus (often, billions!)- • So is
$\nabla_{\theta}J(\theta)$ very expensive to compute
- • So is
- Iteratively take gradients at each such window for SGD
- But in each window, we only have at most
$2m + 1$ words, so$\nabla_{\theta} J_{t}(\theta)$ is very sparse!
- We might only update the word vectors that actually appear!
- Solution: either you need sparse matrix update operations to only update certain rows of full embedding matrices
$U$ and$V$ , or you need to keep around a hash for word vectors.
Word2vec algorithm family(Skip-grams)
If you have millions of word vectors and do distributed computing, it is important to not have to send gigantic updates around!
- word vectors weill be row vectors
Why two vectors?
- Easier optimization. Average both at the end
- But can implement the algorithm with just one vector per word … and it help
Two model variants:
-
Skip-grams (SG)
Predict context (“outside”) words (position independent) given center word
-
Continuous Bag of Words (CBOW)
Predict center word from (bag of) context words
We presented: Skip-gram model !!!