Skip to content

CoderEthan学习站

Main Navigation

分布式训练

Transformer个人梳理

深色模式

Sidebar Navigation

01_deep_learning_theory

01-feedforward_network

02-back_propagation

03-bp_example_demo

04-convolution_neural_network

05-deep_learning_model

06-pytorch_install

07-operators

08-activation_functions

09-recurrent_neural_network

10-seq2seq

11-1attentions

11-2attention-extension

12-weight-initialization

13-optimizers

14-regularization

15-deep-learning-tuning-guide

20-pytorch-tensor

21-pytorch-autograd

22-pytorch-module

23-1training-example-1

23-2decoder

23-3encoder

23-4transformer

24-pytorch-optimizer

25-pytorch-lr-scheduler

26-pytorch-dataloader

27-pytorch-model-save

28-pytorch-tensorboard

29-pytorch-graph-mode

30-1training-example-cv

30-3main

31-1stable-diffusion

31-2SDXL

31-3VAE

40-nlp-bert_ner

41-nlp-t5_question-answering

42-nlp-gpt

43-scaling-law

44-distribute-training

45-LLM-History

46-LLM-GPT-Extension

46-nlp-llama

47-LLM-DeepSeek-Structure

47-nlp-deepseek

02_distribute_training

00_large-scale-model-trainning

01_coding

01_offload-and-recompute

02_amp

03_coding

03_pytorch-DP

04_pytorch-DDP

05_pytorch-DDP-IMPL

05_pytorch-DDP-IMPL_DDP_ORIGIN

06_collective-comm

06_torchrun-process-group

07_ZeRO-Optimizer

08_pytorch-ZeRO-1

09_pytorch-FSDP-v1

10_pytorch-FSDP-v2

11_deepspeed-ZeRO-1-2-IMPL

12_deepspeed-ZeRO-3-IMPL

13_megatron-ZeRO-1-IMPL

14_TP-Theory

15_megatron-TP-IMPL

16_pytorch-TP-IMPL

17_PP-Theory

18_pytorch-PP-IMPL

19_deepspeed-PP-IMPL

20_megatron-PP-IMPL

21_SP-Theory

22_megatron-SP-IMPL

23_3D-Parallel-Theory

24_megatron-3D-Parallel-IMPL

25_pytorch-3D-Parallel-IMPL

26_CP-Theory

27_megatron-CP-IMPL

28_MOE-Theory

28_MOE-Theory_DeepSeekMOE

29_megatron-MOE-IMPL

30_deepspeed-MOE-IMPL

31_deepspeed-code-IMPL

32_collective-operations

33_pytorch_distribute

03_Transformer

01-Transformer的由来

02-Transformer架构解读

03-Transformer源码构建

04_some_notes

00-DL_base_notes

01-class_logs

02-some_detials

03-Bert理解

04-个人补充内容

05-Review_DL

本文目录

02-some_detials

残差连接别忘了，其前向可以带有原始信息，反向梯度为1，可以保证参数更新
残差连接放在前面和后面的区别
bert是一种动态词向量
Bert为啥不适用文本生成
Bert的mask真正做了随机吗（train时）
kv cache（past_key_values）
Bert和transformer的区别于联系
Bert和其他大语言模型

在 GitHub 上编辑此页面 OR 提出修改意见

最后更新于:

Pager

上一篇01-class_logs

下一篇03-Bert理解

ICP备案号: 蜀ICP备2024103116号
公安备案号: 川公网安备51012202001928

版权所有 © 2024-present Ethan.Liu