DeepSeek LLM
Longtermism
1 Deepseek-LLM
Yes this paper is old by now; it was released in 2024, which is ancient in AI terms.
Yes, there are further advances to the Deepseek line beyond this paper. However, here is why I’m writing about it:
- To solidify my own understanding.
- This is the first paper released by Deepseek and I want to track the evolution of their lab’s thinking through the papers they release, to hopefully establish a bigger picture on some areas of model training that I care about.
So with that in mind BEWARE! I am not covering every detail of this paper. These are just my public notes and annotations on the paper itself.
The paper:
DeepSeek LLM: Scaling Open-Source Language Models with Longtermism
2 Introducion
In the introduction, the authors write about how the community is focused on training fixed-size models and are neglecting scaling law exploration. They seem to say that there are varying conclusions on the scaling of model data with increased compute budgets in earlier work. In this paper, they examine those laws on batch size and learning rate and find a certain trend with model size. And they seem to arrive at a conclusion that the choice of data sets helps scaling behavior. The paper also talks about how they pre-trained and post-trained and aligned the model. This was, of course, in the days where DPO was considered state-of-the-art in open source.
They mentioned using 2 trillion tokens of pre-training, in addition to some architecture choices, such as a using a multi-stage learning rate scheduler. I also mentioned using about a million instances for supervised fine-tuning (SFT).
3 Pre-training
3.1 Data
The paper mentions Data Richness and Diversity. However, I am yet to obtain a solid definition of delta-richness anywhere. They mentioned the use of deduplication, filtering, and remixing.Specifically mentioned filtering as a means for enhancing the density of information. Not unlike other papers, the deduplication across multiple dumps of Common Crawl showed that four times as many documents can be elemented than a single dump.
It’s the filtering stage I was most curious about. They developed a criterion for document quality assessment.This involved analyzing linguistic and semantic attributes.Providing a view of data from individual and global perspectives, they say. However, I did not find much detail about that data curation recipe.
They still do remixing, which was also seen in other state-of-the-art recipes.Primarily, this was done to increase the presence of under-represented domains.The aim of this seems to be enhancing diversity. However, it’s not clear whether this translates to downstream task information.
They say they use byte-level, byte-pair coding.Use pre-totalization It’s about merging of tokens from different character categories. They split numbers into individual digits.It sets a very high norm at convention of the tokens of the vocabulary.
They say they use the multilingual corpus and include 15 special tools while also preserving space for other tokens and making sure the vocab size is power of two.
3.2 Architecture and Training
Although the model followed LlaMA by Touvron et al. While maintaining LLaMA’s core architecture, they used 95 layers for their 67B and 30 layers for the 7B. They mentioned on the paper that they opted for depth over width as a deficiency measure. Another variation was using group query attention which was announced in April that same year, though Meta did not use it until later in Llama2.
They also mentioned changing the initialization. Slightly higher peak learning rate and the use of FP32 in gradient accumulation. They do experiment with a small 1.6B Param model.
4 A List of all DS papers since inception
4.1 2023
| # | date | type | title / arXiv / repo | short note |
|---|---|---|---|---|
| 1 | 25 Oct 2023 | paper | DreamCraft3D: Hierarchical 3-D Generation… | first DeepSeek-branded paper (3-D gen.) |
| 2 | 02 Dec 2023 | paper + repo | DeepSeek LLM – Scaling Open-Source LMs + deepseek-ai/DeepSeek-LLM |
7B & 67B base + chat |
4.2 2024
| # | date | type | title / arXiv / repo | short note |
|---|---|---|---|---|
| 3 | 11 Jan 2024 | paper | DeepSeekMoE | MoE up-scaling study |
| 4 | 25 Jan 2024 | paper + repo | DeepSeek-Coder + deepseek-ai/DeepSeek-Coder |
1.3-33B, 87 prog. langs |
| 5 | 05 Feb 2024 | paper | DeepSeekMath | math-specialised 7B |
| 6 | 08 Mar 2024 | paper + repo | DeepSeek-VL + deepseek-ai/DeepSeek-VL |
vision-language model |
| 7 | 07 May 2024 | paper + repo | DeepSeek-V2 + deepseek-ai/DeepSeek-V2 |
MoE w/ MLA, 236 B total |
| 8 | 23 May 2024 | paper | DeepSeek-Prover | LLM theorem prover |
| 9 | 17 Jun 2024 | paper + repo | DeepSeek-Coder-V2 + deepseek-ai/DeepSeek-Coder-V2 |
128 k ctx, 338 langs |
| 10 | 02 Jul 2024 | paper | Expert-Specialised Fine-Tuning… | MoE fine-tuning tricks |
| 11 | 15 Aug 2024 | paper | DeepSeek-Prover-V1.5 | RL + MCTS prover |
| 12 | 26 Aug 2024 | paper | Fire-Flyer AI-HPC | hw/sw co-design report |
| 13 | 28 Aug 2024 | paper | Auxiliary-Loss-Free Load Balancing… | MoE balancing |
| 14 | 17 Oct 2024 | paper + repo | Janus + deepseek-ai/Janus |
unified VL understanding & gen |
| 15 | 12 Nov 2024 | paper | JanusFlow | AR + rectified-flow VL model |
| 16 | 13 Dec 2024 | paper + repo | DeepSeek-VL2 + deepseek-ai/DeepSeek-VL2 |
MoE vision-language |
| 17 | 27 Dec 2024 | paper + repo | DeepSeek-V3 + deepseek-ai/DeepSeek-V3 |
671 B MoE, FP8 training |
4.3 2025
| # | date | type | title / arXiv / repo | short note |
|---|---|---|---|---|
| 18 | 22 Jan 2025 | paper + repo | DeepSeek-R1 + deepseek-ai/DeepSeek-R1 |
reasoning model (RL) |
| 19 | 29 Jan 2025 | paper | Janus-Pro | data- & model-scaled Janus |
| 20 | 11 Feb 2025 | paper | CodeI/O | code I/O pattern distillation |
| 21 | 16 Feb 2025 | paper | Native Sparse Attention | hw-aligned sparse attn |
| 22 | 03 Apr 2025 | paper | Inference-Time Scaling… | reward-model scaling |
| 23 | 30 Apr 2025 | paper | DeepSeek-Prover-V2 | RL sub-goal prover |
| 24 | 14 May 2025 | paper | Insights into DeepSeek-V3… | hw-scaling reflections |
| 25 | 24-28 Feb 2025 | repo batch | DeepSeek Open-Source Week – five infra repos released (no papers) | |
| 26 | 20 Oct 2025 | paper | DeepSeek-OCR | context optical compression |