Transformers (Part 2)
Illustrating how a Transformer works from the very basics.
This covers self-attention, multi-headed attention, intuition behind dot product and query, key, value, BERT masked language modelling.
Part 1 here: https://www.youtube.com/watch?v=iBamMr2WEsQ
ChatGPT video here: https://www.youtube.com/watch?v=wA8rjKueB3Q
Slides can be found on:
https://github.com/tanchongmin/TensorFlow-Implementations
0:14 Recap (Word Embedding, Feed Forward Neural Network)
4:00 Intuition behind Query, Key, Value
14:35 Mapping of dimension to Q, K, V
20:50 Intuition behind Dot Product
27:11 Intuition behind scaling of softmax
32:35 Adding up the final values after attention
36:28 Multi-headed attention
43:33 Visualizing 2 attention heads
46:13 Skip-connection and Layer Normalization
50:58 Intuition behind Positional Encoding
54:20 BERT
1:01:40 Final Comments