数据格式 | Twinkle

消息

Mon, 01 Jan 0001 00:00:00 +0000

消息代表了模型对话的单轮信息。消息的定义为：


class FunctionCall(TypedDict, total=False):
 name: str
 arguments: Union[str, Dict[str, Any]]

class ToolCall(TypedDict, total=False):
 id: str
 type: Literal['function']
 function: FunctionCall

class Message(TypedDict, total=False):
 role: Literal['system', 'user', 'assistant', 'tool']
 type: str
 content: Union[str, List[Dict[str, str]]]
 tool_calls: List[ToolCall]
 tool_call_id: str
 reasoning_content: str
 images: Optional[List[Union[str, Any]]]
 videos: Optional[List[Union[str, Any]]]
 audios: Optional[List[Union[str, Any]]]

本质上，Message是一个Dict。里面包含了若干字段，和开发者强相关的有：

role: 消息类型，包含了’system’, ‘user’, ‘assistant’, ’tool’四类。
- system: 系统指令消息，仅在第0个消息中出现
- user: 用户输入消息
- assistant: 模型回复的消息
- tool: 工具调用结果，类似user消息输入给模型
content: 消息正文，如果包含多模态信息，则需要有占位符：
- : 图片占位符
- : 视频占位符
- : 音频占位符

<image>图片中是一片草地，上面有三只兔子。

tool_calls: 工具调用列表，为模型输出给用户的信息，通常在assistant对应的content中解析出来。
- ToolCall 与 OpenAI chat-completion 协议对齐：外层是 {type: "function", function: {...}}，function 中的 name 是工具名，arguments 在 chat template 渲染时应为 dict（dispatch 时也接受 JSON 字符串）。
images: 消息中包含的原图片信息
videos: 消息中包含的原视频信息
audios: 消息中包含的原音频信息

轨迹

Mon, 01 Jan 0001 00:00:00 +0000

数据集ETL之后输入Template的原始数据结构是Trajectory(轨迹)。这是一个符合AgenticRL的命名方法，主要代表了模型多轮对话的实际表现。

class Trajectory(TypedDict, total=False):
 messages: List[Message]
 tools: List[Tool]
 user_data: List[Tuple[str, Any]]

messages: Message消息的列表，代表模型实际进行的多轮对话，通常是user和assistant交替出现。
tools: 模型在本次调用中的所有可用工具列表
user_data: 用户自定义数据，如KTO训练中的label

对于DPO等偏好对齐训练，预处理器返回{'positive': List[Trajectory], 'negative': List[Trajectory]}格式。

Trajectory是twinkle中所有数据集预处理输出，模板输入的标准接口。格式转换为由原始数据集转换为Trajectory，再到InputFeature。

模型输入

Mon, 01 Jan 0001 00:00:00 +0000

twinkle用于表示模型输入的类是InputFeature，该类适配于transformers/megatron等模型结构。

InputType = Union[List[List[int]], List[int], np.ndarray, Any]

class InputFeature(TypedDict, total=False):
 # Text-related fields
 input_ids: InputType
 attention_mask: InputType
 position_ids: InputType
 labels: InputType

InputFeature本质上是一个Dict。其输入来自于Template组件的输出。

input_ids: List[Messages]以模板进行嵌套之后的token list
attention_mask: 注意力掩膜
position_ids: 用于样本区分的位置编码
labels: 训练的label，已经进行了一个token的左位移

在packing或padding_free的情况下，input_ids等字段由多个样本的列表拼接而来。在多模态场景下，InputFeature包含多模态其他字段。

InputFeature是twinkle中所有模板输出、模型输入的标准接口。

模型输入

Mon, 01 Jan 0001 00:00:00 +0000

twinkle用于表示模型输入的类是InputFeature，该类适配于transformers/megatron等模型结构。

class ModelOutput(TypedDict, total=False):
 logits: OutputType
 loss: OutputType

ModelOutput本质上是一个Dict。其字段来自于模型的输出和loss计算。

logits: 一般是[BatchSize * SequenceLength * VocabSize]尺寸，和labels配合计算loss
loss: 实际残差

ModelOutput是twinkle中所有模型输出的标准接口。

采样输出

Mon, 01 Jan 0001 00:00:00 +0000

采样输出是用于表示采样过程的输入参数和返回结果的数据格式。

SamplingParams

采样参数用于控制模型的采样行为。

@dataclass
class SamplingParams:
 max_tokens: Optional[int] = None
 seed: Optional[int] = None
 stop: Union[str, Sequence[str], Sequence[int], None] = None
 temperature: float = 1.0
 top_k: int = -1
 top_p: float = 1.0
 repetition_penalty: float = 1.0

max_tokens: 生成的最大 token 数量
seed: 随机种子
stop: 停止序列,可以是字符串、字符串序列或 token id 序列
temperature: 温度参数,控制采样的随机性。0 表示贪心采样
top_k: Top-K 采样参数,-1 表示不使用
top_p: Top-P (nucleus) 采样参数
repetition_penalty: 重复惩罚系数

转换方法

SamplingParams 提供了转换方法来适配不同的推理引擎:

# 转换为 vLLM 的 SamplingParams
vllm_params = params.to_vllm(num_samples=4, logprobs=True, prompt_logprobs=0)

# 转换为 transformers 的 generate 参数
gen_kwargs = params.to_transformers(tokenizer=tokenizer)

SampleResponse

采样响应是采样器返回的结果数据结构。

@dataclass
class SampleResponse:
 trajectories: List[Trajectory]
 logprobs: Optional[List[List[float]]] = None
 prompt_logprobs: Optional[List[List[float]]] = None
 stop_reason: Optional[List[StopReason]] = None

trajectories: 采样生成的轨迹列表
logprobs: 生成 token 的对数概率
prompt_logprobs: prompt token 的对数概率
stop_reason: 停止原因,可以是 “length” (达到最大长度) 或 “stop” (遇到停止序列)

使用示例:

from twinkle.data_format import SamplingParams, SampleResponse
from twinkle.sampler import vLLMSampler

sampler = vLLMSampler(model_id='ms://Qwen/Qwen3.5-4B')
params = SamplingParams(max_tokens=512, temperature=0.7, top_p=0.9)
response: SampleResponse = sampler.sample(trajectories, sampling_params=params, num_samples=4)

# 访问生成的轨迹
for traj in response.trajectories:
 print(traj.messages)

模型输出

Mon, 01 Jan 0001 00:00:00 +0000

模型输出的详细类型定义。

OutputType

OutputType 定义了模型输出支持的数据类型:

OutputType = Union[np.ndarray, 'torch.Tensor', List[Any]]

支持 NumPy 数组、PyTorch 张量或任意类型的列表。

ModelOutput

ModelOutput 是 Twinkle 用于表示模型输出的标准类。该类适配于 transformers/megatron 等模型结构。

class ModelOutput(TypedDict, total=False):
 logits: OutputType
 loss: OutputType

ModelOutput 本质上是一个 Dict。其字段来自于模型的输出和 loss 计算。

logits: 一般是 [BatchSize * SequenceLength * VocabSize] 尺寸,和 labels 配合计算 loss
loss: 实际残差

ModelOutput 是 Twinkle 中所有模型输出的标准接口。

使用示例:

from twinkle.data_format import ModelOutput

# 在模型的 forward 方法中
def forward(self, inputs):
 ...
 return ModelOutput(
 logits=logits,
 loss=loss
 )

注意:ModelOutput 使用 TypedDict 定义,意味着它在运行时是一个普通的 dict,但在类型检查时会提供类型提示。