Inside Kimi 1.5: A self-contained summary of its reinforcement learning efforts

I’ve been meaning to dive into the RL efforts by the Kimi team, and had come across the kimi k2 paper a while ago, which further lead me to a path to find 1.5 where they go deeper into the training and infra efforts.

Here are my abridged notes and thoughts on the RL formulation proposed by them.

Some key things to lookout for in the blog:

Their formulation of online mirror descent/policy optimisation
RL framework/infra that supports long-context training (and how they reuse older trajectories)

Some notation clarifications before we dive in:

$\mathcal{D}=\left\{\left(x_i, y_i^*\right)\right\}_{i=1}^n$ Refers to the pairs of input problem and the ground truth.

$x$ is the problem, y is the ground truth.

$z_j \sim \pi_{\theta_j} | x, z_1, \dots, z_{j-1}$ Refers to a sequence of tokens (or a part of the entire CoT), this doesn’t necessarily correspond to the Step A, Step B in the chain of thought, but can be thought of as the an intermediate token in that sequence.

$y \sim \pi_\theta( \cdot | x, z_1, \dots, z_m)$ Refers to the final output/answer from the model, once the chain of thought sequence ends.

$s = (x, z_1; |s|)$ refers to a partial solution generated by the model. $|s|$ refers to the number of steps in the sequence. We use this notation at times, because this is referring to a node from the tree $T$.

$r$ : refers to a reward model that can assign score to a given problem and solution pair $x,y$ . For the rest of the blog, $r(x,y,y*) \in \{0,1\}$

What is this tree $T$ ? Each node of this tree $T$ can be seen as a partial generation from the model. Imagine checkpoints in the generations. (read Beam search). Authors refer to this as a search tree, that can allow us to diversify generations and get feedback from Critic Models.

However, instead of using formalizing the problem with a search tree based approach, they try to reformulate it in a way that previous generations from the model can guide the current direction. They mention that ‘thoughts’ and ‘feedback’; i.e the generation from the base model and the critic model, both can be seen as an intermediate ‘reasoning’ step. So instead of doing this back and forth between generation and feedback, they want to train a model so that it can learn how to generate new thoughts, while reflecting in this process. Thus, in their setup, there exists no search tree $T$, but rather simply long context CoT produced by the model.

Policy Optimization

The paper optimizes the following relative entropy regularized policy optimization problem.

Let’s first break down what does the policy optimization objective look like?’