CS224n Notes

Garygedegege

Lecture 01: Introduction & Word2vec

1.1 Introduction

Target：

Basics first, then key methods used in NLP: Recurrent networks, attention, transformers, etc.
A big picture understanding of human languages and the difficulties in understanding and producing them.
An understanding of and ability to build systems (in PyTorch) for some of the major problems in NLP: Word meaning, dependency parsing, machine translation, question answering.

Assignments and Project

1.2 WordVec

Main question: How to represent the word into vector?

Word2vec (Mikolov et al. 2013) is a framework for learning word vectors.

Definition

Objective Function(Minimize the function) $$ J(\theta)=-\frac{1}{T} \log L(\theta)=-\frac{1}{T} \sum_{t=1}^{T} \sum_{\substack{-m \leq j \leq m \ j \neq 0}} \log P\left(w_{t+j} \mid w_{t} ; \theta\right) $$ How to calculate $P(w_{t+j} \mid w_{t};\theta)$?
- use two vectors per word $w$
  - $v_w$ when $w$ is a ==center word==
  - $u_w$ when $w$ is a ==context word==
- Then for a center word $c$ and a context word $o$:
  - $$ P(o \mid c)=\frac{\exp \left(u_{o}^{T} v_{c}\right)}{\sum_{w \in V} \exp \left(u_{w}^{T} v_{c}\right)} $$

prediction function

Gradient $$ \begin{aligned} &\mathcal{U}{\text {new }} \leftarrow \mathcal{U}{\text {old }}-\alpha \nabla_{\mathcal{U}} J \ &\mathcal{V}{\text {old }} \leftarrow \mathcal{V}{\text {old }}-\alpha \nabla_{\mathcal{V}} J \end{aligned} $$

Genism Package for word2vec

但这个包实际上在深度学习中不常用

Gensim word vector visualization notebook

Lecture 02: Neural Classifier

2.1 Word Vectors

Review

$J(\theta)$ is a function of all windows in the corpus (often, billions!)
- • So is $\nabla_{\theta}J(\theta)$ very expensive to compute
Iteratively take gradients at each such window for SGD
But in each window, we only have at most $2m + 1$ words, so $\nabla_{\theta} J_{t}(\theta)$ is very sparse!

$$ \nabla_{\theta} J_{t}(\theta)=\left[\begin{array}{l} 0 \\ \vdots \\ \nabla_{v_{\text {like }}} \\ \vdots \\ 0 \\ \nabla_{u_{f}} \\ \vdots \\ \nabla_{\text {ulearning }} \\ \vdots \end{array}\right] \in \mathbb{R}^{2 d V} $$

We might only update the word vectors that actually appear!
Solution: either you need sparse matrix update operations to only update certain rows of full embedding matrices $U$ and $V$, or you need to keep around a hash for word vectors.

Word2vec algorithm family（Skip-grams）

If you have millions of word vectors and do distributed computing, it is important to not have to send gigantic updates around!

word vectors weill be row vectors

Why two vectors?

Easier optimization. Average both at the end
But can implement the algorithm with just one vector per word … and it help

Two model variants:

Skip-grams (SG)

Predict context (“outside”) words (position independent) given center word
Continuous Bag of Words (CBOW)

Predict center word from (bag of) context words

We presented: Skip-gram model !!!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CS224n Notes.md

CS224n Notes.md

CS224n Notes

Lecture 01: Introduction & Word2vec

1.1 Introduction

1.2 WordVec

Lecture 02: Neural Classifier

2.1 Word Vectors

Files

CS224n Notes.md

Latest commit

History

CS224n Notes.md

File metadata and controls

CS224n Notes

Lecture 01: Introduction & Word2vec

1.1 Introduction

1.2 WordVec

Lecture 02: Neural Classifier

2.1 Word Vectors