Deep Dive into LLMs like ChatGPT

- [Deep Dive into LLMs like ChatGPT](https://www.youtube.com/watch?v=7xTGNNLPyMI) # Pretraining - **pretraining & inference** : 1. download and preprocess the internet - 需要大量文档，large diversity & high quality，来获取足够多的知识 - [fineweb](https://huggingface.co/datasets/HuggingFaceFW/fineweb) 2. tokenization - 将 txt 重新拆分呈现为一维的 symbols 序列 - 减少 vocabulary（symbol set） size - symbol/token 可以看作是 unique id - [Tiktokenizer](https://tiktokenizer.vercel.app/) 3. neural network training - ![](https://img.jonahgao.com/oss/note/2025p2/dive_llm_neural_taining.png) - **输入**：windows of token, window 长度不固定，一般有最大限制(max context length) - **输出**：prediction，预测下一个token是什么 - 从训练文本我们知道正确答案是什么，通过对神经网络进行 tuning，使得正确答案的概率更高 - internals: - ![](https://img.jonahgao.com/oss/note/2025p2/dive_llm_neural_internals.png) - 参数化的数学函数，无 memory，stateless - 不断更新、调整参数/weights，让 predictions 跟 training set 一致 - [Transformer Neural Net 3D visualizer](https://bbycroft.net/llm) - inference - 根据模型生成新的数据，predict one token at a time - ![400](https://img.jonahgao.com/oss/note/2025p2/dive_llm_neural_inference.png) - 每次都是按概率投硬币，生成下一个 token。 - llm.c Let's Reproduce GPT-2：[https://github.com/karpathy/llm.c/discussions/677](https://github.com/karpathy/llm.c/discussions/677) - **base model**: internet document simulator - 模型示例： - [openai/gpt-2](https://github.com/openai/gpt-2) - Llama 3 - a release of a model 包括： - 运行 Transformer 的代码（例如 200 行 python 代码） - Transformer 的 parameters（例如 1.6 亿个数字） - 测试：Hyperbolic, for inference of base model: https://app.hyperbolic.xyz - base model 不是一个 assistant，只是一个 ==token autocomplete==，并且是一个 stochastic system（随机系统）。 - 并不是很有用，predict 也并不完全可信（只是对 internet documents 的 recollection）。 - 1.6 亿个 parameters 可以看作是对 internet 的一种==有损压缩== - 通过 parameters 存储了大量 knowledge - knowledge 不是精确的，而是 vegue、probabilistic 和 statistical 的 - 训练文本中质量高的、出现次数多的，更有可能被 recite（例如维基百科） - **hallucination**：例如输入模型训练时间点之后的内容，模型会猜测输出不真实的信息 - 具备一定的应用能力： - Few-shot prompting & in context learning ability：通过 prompt 使 base model 变成 assistant - shot：给 AI 提供一些 examples > [!summary] > - **pretraining stage**：将 internet documents 拆分为 tokens，通过神经网络来 predict token sequences > - **base model** 是 pretraining stage 的产出，具备一定应用能力，但可以做到更好（通过 post-training）； --- # Post-training Supervised Finetuning - 相比 pretraining 训练成本更低，但也极其重要，将 model 转换成 assistant。 - pre-training 需要3 个月 vs post-training 3 小时 - 基于 data set of **conversations**（来自人工标注）继续训练神经网络 - 让模型学会如何在 inference 时回应 human queries - **tokenization** of conversations - 协议/格式：将 coversation 编解码为 token - 将 structured object 转换为一维 tokens - 加入新的 special token 表示一轮会话的开始、角色、结束等 - ![600](https://img.jonahgao.com/oss/note/2025p2/dive_llm_conversation_tokenization.png) - 将 conversations 转换为 tokens 后，后续的流程就跟 pre-training 一样了，包括训练和推理。 - 推理时构造如下的 token prefix ： - ![500](https://img.jonahgao.com/oss/note/2025p2/dive_llm_conversation_inference.png) - ==a statistical simulation of a human labeler== - [InstructGPT](https://arxiv.org/abs/2203.02155) - fine tune LLM on conversations - 人工标注，构造 conversations - prompt + ideal assistant response - helpful，truthful，harmless - conversations 的生成： - 人工标注：[OpenAssistant Conversations Dataset](https://huggingface.co/datasets/OpenAssistant/oasst1) - LLM 可以用于生成 conversations > [!summary] Post-training > - 与 pre-training 的区别：训练数据集不同，来自 conversations --- # LLM Psychology ## Hallucinations - ![500](https://img.jonahgao.com/oss/note/2025p2/dive_llm_hallucinations.png) - 对于不知道的知识进行 sample from probabilieis，consistent with the sytle of the answer in its training set - best guess - meta Llama 的处理： 1. knowledge probing technique: 识别出 model 知道什么和不知道什么 - 给定一段话，让另一个 LLM 从中提取生成 factual questions（问题 + 正确答案），对 LLM 进行测试其是否知道 2. 在 training set 中添加 examples：对于 model 不知道的事情正确的回答就是不知道 - 添加 conversations：factual question，答案是 i don’t known ## Tools - Allow the model to search - 是解决 Hallucinations 的一种方式。 - 示例： - Human："Who is Orson Kovacs" - Assistant: "<SEARCH_START> Who is Orson Kovacs <SEARCH_END>" - 引入新的格式/协议，模型可以输出特殊的 token（SEARCH），模型识别到特殊 tokens 后停止继续生成并转去搜索，将搜索的结果加入 context window - context window 可以看作是模型的 working memory。context window 中的 data 可以被模型访问到，可以 feed 给 neural network。 - 在 training set 中加入 example，让模型学会使用工具（web search），模型决定何时去 search。 ## Vague recollection vs Working memory - Knowledge in the parameters == Vague recollection - 类似于你一个月前读过的东西 - Knowledge in the tokens of context window == Working memory ## Knowledge of self - Users：What model are you? Who built you? - AI 没有自我身份的认知（identify），只是 token simulator。 - 解决： - hardcoded dataset：给出此类问题的正确答案 - [allenai/olmo-2-hard-coded](https://huggingface.co/datasets/allenai/olmo-2-hard-coded) - system message：invisible messages，加入模型的 identify ---- # Computational capabilities ## Models need tokens to rethink - ![](https://img.jonahgao.com/oss/note/2025p2/dive_llm_computational_spread.png) - 右边的更好 - 创建了 intermediate calculations，much easier for the model，it‘s not too much work per token，可以处理 single forward pass of network 无法解决的问题 - ==spread out== its compuatation over the tokens, ask models to create intermediate results. - every single token is only spending finite amount of computation on the model. - 也就是说 inference 一个 token 的计算能力是**有限**的（single forward pass of a network），避免只靠一个 token 计算一个复杂的问题。 - 使用 code **tool** - 避免模型 try to do it all in their memory ```prompt Emily buys 23 apples and 177 oranges. Each orange costs $4. The total cost of all the fruit is $869. What is the cost of each apple? Use code. ``` > [!summary] Models need tokens to rethink > 让模型有机会执行更多 forward passes of network ## Models can't count - ```prompt How many dots are below? ................................................................................................. ``` - 模型不是很擅长计数，原因也是与上面类似的，在单个 token inference 时要求了太多计算。 - in a single token, it has to count the number of dots in its context window. It has to do that in the **single forward pass** of a network. ## Models are not good with spellings ```prompt Print back the following string, but only print every 3rd character, starting from the first one. ubiquitous ``` - Models see tokens (text chunks), not individual letters! --- # Post-Training: Reinforcement Learning - training pipleline 的第三步。last major stage of training。 - stages of learning a textbook: 1. exposition <=> pretraining - background knowledge - base model 2. worked problems <=> supervised finetuning - problem + demonstrated solution, for imitation - sft model - 模拟 human expert 3. practice problems <=> reinforcement learning - prompts to practice, trial & error until you reach the correct answer - 知道最终答案，但不知道 solution，尝试去练习 solution - 动机： - we are not in a good postition to create these token sequences for the LLM - 认知不同，人类并不一定知道哪些 token sequences 对 LLM 来说是更好的 - let the LLM to dicover the token sequences that work for it, what token sequences reliably gets to the answer - ![400](https://img.jonahgao.com/oss/note/2025p2/dive_llm_rl1.png) 1. 给定 prompt 和 final answer，不断尝试让 LLM 生成 solutions 2. 挑选正确且短的 solution，鼓励 LLM 去生成这类 solutions - 短或者有其他好的属性的 solution 3. 鼓励：基于这些 solutions 做训练 > [!summary] > - Reinforcement Learning 的 data sets 不来自人工标注，而是来自 LLM 自身生成的 solutions。 - RL 的开发目前还处于早期阶段，在该领域内还没有标准化。 - [DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning](https://arxiv.org/abs/2501.12948) - thinking model - thinking and trying different ways, giving higher accuracy - AlphaGo - ![400](https://img.jonahgao.com/oss/note/2025p2/dive_llm_go.png) - Supervised Learning 只是模仿人类顶尖的 player，永远无法超越他们 - 而 RL 通过自我对抗可以做到超越 - learning in unverifiable domains - 没有固定答案，LLM 无法自动评价 solutions（score） - prompt："write a joke about pelicans" - RLHF（Reinforcement Learning From Human Feedback） - Naive approach: - Run RL as usual, of 1,000 updates of 1,000 prompts of 1,000 rollouts. - cost: 1,000,000,000 scores from humans - RHLF approach: 1. Take 1,000 prompts, get 5 rollouts, order them from best to worst (cost: 5,000 scores from humans) 2. Train a neural net simulator of human preferences ("reward model") 3. Run RL as usual, but using the simulator instead of actual humans - 专门训练一个 reward model，模拟人类评价 - 该模型的输出是 a single number: score - consistent with human orderings - **upside**: - run RL in arbitrary domains - improves the performance of the model, possibly due to the "discriminator - generator gap" - 对 human labbers 来说，辨别/评价比生成更简单，准确度更高 - **dowside**： - We are doing RL with respect to a ==lossy== simulation of humans. It might be misleading! - RL discovers ways to "game" the model. - get high score in a fake way - 不能像普通 RL 一样无限次运行 --- # Previewing of Things to Come - Multimodal - audio, images, video, natural conversations - tasks -> agents - long, coherent, error-correcting contexts - pervasive, invisible - computer-using - test-time tranining?, etc - 模型训练完后就是固定的了，唯一有变动的是 tokens in the context windows - in-context learning, dynamically adjustable - 只靠增大 context windows 不适合 long running tasks ---- # Where to Keep Track of them - reference https://lmarena.ai/ - subscribe to https://buttondown.com/ainews - X / Twitter --- # Where To Find Them - Proprietary models: on the respective websites of the LLM providers - Open weights models (DeepSeek, Llama): an inference provider, e.g. TogetherAI - Run them locally! LMStudio --- # Summary | | Pre-Training | Post-Training(SuperVised Finetuning) | Post-Training(Reinforcement Learning) | | -------- | ------------------------ | ------------------------------------ | ------------------------------------- | | data-set | internet | conversations of human labbers | LLM self generated solutions | | product | base model | sft model | RL model | | function | internet token simulator | imitation of human experts | thinking and cognitive strategies |