DeepSeek LLM

Longtermism

Published

November 10, 2025

1 Deepseek-LLM

Yes this paper is old by now; it was released in 2024, which is ancient in AI terms.
Yes, there are further advances to the Deepseek line beyond this paper. However, here is why I’m writing about it:

To solidify my own understanding.
This is the first paper released by Deepseek and I want to track the evolution of their lab’s thinking through the papers they release, to hopefully establish a bigger picture on some areas of model training that I care about.

So with that in mind BEWARE! I am not covering every detail of this paper. These are just my public notes and annotations on the paper itself.

The paper:
DeepSeek LLM: Scaling Open-Source Language Models with Longtermism

2 Introducion

In the introduction, the authors write about how the community is focused on training fixed-size models and are neglecting scaling law exploration. They seem to say that there are varying conclusions on the scaling of model data with increased compute budgets in earlier work. In this paper, they examine those laws on batch size and learning rate and find a certain trend with model size. And they seem to arrive at a conclusion that the choice of data sets helps scaling behavior. The paper also talks about how they pre-trained and post-trained and aligned the model. This was, of course, in the days where DPO was considered state-of-the-art in open source.

They mentioned using 2 trillion tokens of pre-training, in addition to some architecture choices, such as a using a multi-stage learning rate scheduler. I also mentioned using about a million instances for supervised fine-tuning (SFT).

3 Pre-training

3.1 Data

The paper mentions Data Richness and Diversity. However, I am yet to obtain a solid definition of delta-richness anywhere. They mentioned the use of deduplication, filtering, and remixing.Specifically mentioned filtering as a means for enhancing the density of information. Not unlike other papers, the deduplication across multiple dumps of Common Crawl showed that four times as many documents can be elemented than a single dump.

It’s the filtering stage I was most curious about. They developed a criterion for document quality assessment.This involved analyzing linguistic and semantic attributes.Providing a view of data from individual and global perspectives, they say. However, I did not find much detail about that data curation recipe.

They still do remixing, which was also seen in other state-of-the-art recipes.Primarily, this was done to increase the presence of under-represented domains.The aim of this seems to be enhancing diversity. However, it’s not clear whether this translates to downstream task information.

They say they use byte-level, byte-pair coding.Use pre-totalization It’s about merging of tokens from different character categories. They split numbers into individual digits.It sets a very high norm at convention of the tokens of the vocabulary.

They say they use the multilingual corpus and include 15 special tools while also preserving space for other tokens and making sure the vocab size is power of two.

3.2 Architecture and Training

Although the model followed LlaMA by Touvron et al. While maintaining LLaMA’s core architecture, they used 95 layers for their 67B and 30 layers for the 7B. They mentioned on the paper that they opted for depth over width as a deficiency measure. Another variation was using group query attention which was announced in April that same year, though Meta did not use it until later in Llama2.

They also mentioned changing the initialization. Slightly higher peak learning rate and the use of FP32 in gradient accumulation. They do experiment with a small 1.6B Param model.

4 A List of all DS papers since inception

4.1 2023

#	date	type	title / arXiv / repo	short note
1	25 Oct 2023	paper	DreamCraft3D: Hierarchical 3-D Generation…	first DeepSeek-branded paper (3-D gen.)
2	02 Dec 2023	paper + repo	DeepSeek LLM – Scaling Open-Source LMs + `deepseek-ai/DeepSeek-LLM`	7B & 67B base + chat

4.2 2024

#	date	type	title / arXiv / repo	short note
3	11 Jan 2024	paper	DeepSeekMoE	MoE up-scaling study
4	25 Jan 2024	paper + repo	DeepSeek-Coder + `deepseek-ai/DeepSeek-Coder`	1.3-33B, 87 prog. langs
5	05 Feb 2024	paper	DeepSeekMath	math-specialised 7B
6	08 Mar 2024	paper + repo	DeepSeek-VL + `deepseek-ai/DeepSeek-VL`	vision-language model
7	07 May 2024	paper + repo	DeepSeek-V2 + `deepseek-ai/DeepSeek-V2`	MoE w/ MLA, 236 B total
8	23 May 2024	paper	DeepSeek-Prover	LLM theorem prover
9	17 Jun 2024	paper + repo	DeepSeek-Coder-V2 + `deepseek-ai/DeepSeek-Coder-V2`	128 k ctx, 338 langs
10	02 Jul 2024	paper	Expert-Specialised Fine-Tuning…	MoE fine-tuning tricks
11	15 Aug 2024	paper	DeepSeek-Prover-V1.5	RL + MCTS prover
12	26 Aug 2024	paper	Fire-Flyer AI-HPC	hw/sw co-design report
13	28 Aug 2024	paper	Auxiliary-Loss-Free Load Balancing…	MoE balancing
14	17 Oct 2024	paper + repo	Janus + `deepseek-ai/Janus`	unified VL understanding & gen
15	12 Nov 2024	paper	JanusFlow	AR + rectified-flow VL model
16	13 Dec 2024	paper + repo	DeepSeek-VL2 + `deepseek-ai/DeepSeek-VL2`	MoE vision-language
17	27 Dec 2024	paper + repo	DeepSeek-V3 + `deepseek-ai/DeepSeek-V3`	671 B MoE, FP8 training

4.3 2025

#	date	type	title / arXiv / repo	short note
18	22 Jan 2025	paper + repo	DeepSeek-R1 + `deepseek-ai/DeepSeek-R1`	reasoning model (RL)
19	29 Jan 2025	paper	Janus-Pro	data- & model-scaled Janus
20	11 Feb 2025	paper	CodeI/O	code I/O pattern distillation
21	16 Feb 2025	paper	Native Sparse Attention	hw-aligned sparse attn
22	03 Apr 2025	paper	Inference-Time Scaling…	reward-model scaling
23	30 Apr 2025	paper	DeepSeek-Prover-V2	RL sub-goal prover
24	14 May 2025	paper	Insights into DeepSeek-V3…	hw-scaling reflections
25	24-28 Feb 2025	repo batch	DeepSeek Open-Source Week – five infra repos released (no papers)
26	20 Oct 2025	paper	DeepSeek-OCR	context optical compression

--- title: "DeepSeek LLM" subtitle: "Longtermism" footer: 'Laith Zumot © 2025 · lzumot.github.io · CC BY-NC-SA 4.0' date: last-modified format: html: toc: true number-sections: true fontsize: 11pt css: styles.css code-fold: true code-tools: true self-contained: true # single file you can drop on any host fig-responsive: true --- ## Deepseek-LLM Yes this paper is old by now; it was released in 2024, which is ancient in AI terms. Yes, there are further advances to the Deepseek line beyond this paper. However, here is why I'm writing about it: - To solidify my own understanding. - This is the first paper released by Deepseek and I want to track the evolution of their lab's thinking through the papers they release, to hopefully establish a bigger picture on some areas of model training that I care about. So with that in mind **BEWARE!** I am not covering every detail of this paper. These are just my public notes and annotations on the paper itself. ![](images/dslogo.png){width=25% style="display: block; margin-left: auto; margin-right: auto;"} The paper: [DeepSeek LLM: Scaling Open-Source Language Models with Longtermism](https://arxiv.org/abs/2401.02954) ## Introducion In the introduction, the authors write about how the community is focused on training fixed-size models and are neglecting scaling law exploration. They seem to say that there are varying conclusions on the scaling of model data with increased compute budgets in earlier work. In this paper, they examine those laws on batch size and learning rate and find a certain trend with model size. And they seem to arrive at a conclusion that the choice of data sets helps scaling behavior. The paper also talks about how they pre-trained and post-trained and aligned the model. This was, of course, in the days where DPO was considered state-of-the-art in open source. They mentioned using 2 trillion tokens of pre-training, in addition to some architecture choices, such as a using a multi-stage learning rate scheduler. I also mentioned using about a million instances for supervised fine-tuning (SFT). ## Pre-training ### Data The paper mentions Data Richness and Diversity. However, I am yet to obtain a solid definition of delta-richness anywhere. They mentioned the use of deduplication, filtering, and remixing.Specifically mentioned filtering as a means for enhancing the density of information. Not unlike other papers, the deduplication across multiple dumps of Common Crawl showed that four times as many documents can be elemented than a single dump. It's the filtering stage I was most curious about. They developed a criterion for document quality assessment.This involved analyzing linguistic and semantic attributes.Providing a view of data from individual and global perspectives, they say. However, I did not find much detail about that data curation recipe. They still do remixing, which was also seen in other state-of-the-art recipes.Primarily, this was done to increase the presence of under-represented domains.The aim of this seems to be enhancing diversity. However, it's not clear whether this translates to downstream task information. They say they use byte-level, byte-pair coding.Use pre-totalization It's about merging of tokens from different character categories. They split numbers into individual digits.It sets a very high norm at convention of the tokens of the vocabulary. They say they use the multilingual corpus and include 15 special tools while also preserving space for other tokens and making sure the vocab size is power of two. ### Architecture and Training Although the model followed LlaMA by Touvron et al. While maintaining LLaMA's core architecture, they used 95 layers for their 67B and 30 layers for the 7B. They mentioned on the paper that they opted for depth over width as a deficiency measure. Another variation was using group query attention which was announced in April that same year, though Meta did not use it until later in Llama2. They also mentioned changing the initialization. Slightly higher peak learning rate and the use of FP32 in gradient accumulation. They do experiment with a small 1.6B Param model. ## A List of all DS papers since inception ### 2023 | # | date | type | title / arXiv / repo | short note | | - | ----------- | ------------ | ------------------------------------------------------------------------------------------------------- | --------------------------------------- | | 1 | 25 Oct 2023 | paper | [DreamCraft3D: Hierarchical 3-D Generation…](https://arxiv.org/abs/2310.16818) | first DeepSeek-branded paper (3-D gen.) | | 2 | 02 Dec 2023 | paper + repo | [DeepSeek LLM – Scaling Open-Source LMs](https://arxiv.org/abs/2401.02954) + `deepseek-ai/DeepSeek-LLM` | 7B & 67B base + chat | ### 2024 | # | date | type | title / arXiv / repo | short note | | -- | ----------- | ------------ | --------------------------------------------------------------------------------------- | ------------------------------ | | 3 | 11 Jan 2024 | paper | [DeepSeekMoE](https://arxiv.org/abs/2401.06066) | MoE up-scaling study | | 4 | 25 Jan 2024 | paper + repo | [DeepSeek-Coder](https://arxiv.org/abs/2401.14196) + `deepseek-ai/DeepSeek-Coder` | 1.3-33B, 87 prog. langs | | 5 | 05 Feb 2024 | paper | [DeepSeekMath](https://arxiv.org/abs/2402.03300) | math-specialised 7B | | 6 | 08 Mar 2024 | paper + repo | [DeepSeek-VL](https://arxiv.org/abs/2403.05525) + `deepseek-ai/DeepSeek-VL` | vision-language model | | 7 | 07 May 2024 | paper + repo | [DeepSeek-V2](https://arxiv.org/abs/2405.04434) + `deepseek-ai/DeepSeek-V2` | MoE w/ MLA, 236 B total | | 8 | 23 May 2024 | paper | [DeepSeek-Prover](https://arxiv.org/abs/2405.14333) | LLM theorem prover | | 9 | 17 Jun 2024 | paper + repo | [DeepSeek-Coder-V2](https://arxiv.org/abs/2406.11931) + `deepseek-ai/DeepSeek-Coder-V2` | 128 k ctx, 338 langs | | 10 | 02 Jul 2024 | paper | [Expert-Specialised Fine-Tuning…](https://arxiv.org/abs/2407.01906) | MoE fine-tuning tricks | | 11 | 15 Aug 2024 | paper | [DeepSeek-Prover-V1.5](https://arxiv.org/abs/2408.08152) | RL + MCTS prover | | 12 | 26 Aug 2024 | paper | [Fire-Flyer AI-HPC](https://arxiv.org/abs/2408.14158) | hw/sw co-design report | | 13 | 28 Aug 2024 | paper | [Auxiliary-Loss-Free Load Balancing…](https://arxiv.org/abs/2408.15664) | MoE balancing | | 14 | 17 Oct 2024 | paper + repo | [Janus](https://arxiv.org/abs/2410.13848) + `deepseek-ai/Janus` | unified VL understanding & gen | | 15 | 12 Nov 2024 | paper | [JanusFlow](https://arxiv.org/abs/2411.07975) | AR + rectified-flow VL model | | 16 | 13 Dec 2024 | paper + repo | [DeepSeek-VL2](https://arxiv.org/abs/2412.10302) + `deepseek-ai/DeepSeek-VL2` | MoE vision-language | | 17 | 27 Dec 2024 | paper + repo | [DeepSeek-V3](https://arxiv.org/abs/2412.19437) + `deepseek-ai/DeepSeek-V3` | 671 B MoE, FP8 training | ### 2025 | # | date | type | title / arXiv / repo | short note | | -- | -------------- | ------------ | --------------------------------------------------------------------------- | ----------------------------- | | 18 | 22 Jan 2025 | paper + repo | [DeepSeek-R1](https://arxiv.org/abs/2501.12948) + `deepseek-ai/DeepSeek-R1` | reasoning model (RL) | | 19 | 29 Jan 2025 | paper | [Janus-Pro](https://arxiv.org/abs/2501.17811) | data- & model-scaled Janus | | 20 | 11 Feb 2025 | paper | [CodeI/O](https://arxiv.org/abs/2502.07316) | code I/O pattern distillation | | 21 | 16 Feb 2025 | paper | [Native Sparse Attention](https://arxiv.org/abs/2502.11089) | hw-aligned sparse attn | | 22 | 03 Apr 2025 | paper | [Inference-Time Scaling…](https://arxiv.org/abs/2504.02495) | reward-model scaling | | 23 | 30 Apr 2025 | paper | [DeepSeek-Prover-V2](https://arxiv.org/abs/2504.21801) | RL sub-goal prover | | 24 | 14 May 2025 | paper | [Insights into DeepSeek-V3…](https://arxiv.org/abs/2505.09343) | hw-scaling reflections | | 25 | 24-28 Feb 2025 | repo batch | **DeepSeek Open-Source Week** – five infra repos released (no papers) | | | 26 | 20 Oct 2025 | paper | [DeepSeek-OCR](https://arxiv.org/abs/2510.18234) | context optical compression |