Experience Manager

🚀 Overview

Exploration in complex environments is a key challenge for autonomous agents. Traditional reinforcement learning approaches rely heavily on trial-and-error, often generating redundant trajectories and converging slowly. In contrast, humans efficiently leverage past experiences to guide future actions, learning faster and more systematically.

The Self-Navigating framework adopts this principle by enabling agents to internalize and reuse prior experiences. It shifts exploration from unguided trial-and-error to structured, knowledge-driven self-improvement, improving both learning efficiency and policy quality.

At the core of the framework, the Experience Manager oversees all aspects of experience handling, including:

Experience Pool Management – Constructing and updating the experience pool with new trajectories and summaries.
Experience Mode Control – Determining whether to add experiences during rollouts and whether to remove experience information during training.
Rollout & Training Context Management – Providing relevant historical context during rollouts and maintaining experience-strippped training messages.
Training Loss Management – Aggregating and processing losses with respect to experience-based adjustments for stable learning.

🧩 Core Features

The Self-Navigating framework enhances reinforcement learning by transforming how agents create, reuse, and refine experiences. It introduces structured mechanisms that make exploration more efficient, context-aware, and self-evolving.

At its core are two classes:

ExperienceManager — handles experience scheduling and allocation.
ExperienceWorker — manages context injection during rollout and cleanup during training.

1. Dynamic Experience Allocation

Purpose: Dynamically decide how much and when to use experience during both training and rollout stages.

How it works: This module performs two levels of adaptive allocation:

Task-Level Allocation
- Determines whether each training task should keep or discard experience.
- Controlled by train_sample_mode:
  - "allkeep" → all tasks retain experience
  - "alldiscard" → all tasks discard experience
  - "hybrid" → keep ratio controlled by train_sample_keepratio
- Key Codes:

# Class: ExperienceManager
# Function: allocate_train_mode()
mode_to_ratio = {
    "allkeep": 1.0,
    "alldiscard": 0.0,
    "hybrid": self.train_sample_keepratio
}
keep_ratio = mode_to_ratio.get(
    self.train_sample_mode, self.train_sample_keepratio
)
keep_count = int(len(tasks) * keep_ratio)
exp_modes = ['keep'] * keep_count + ['discard'] * (len(tasks) - keep_count)

Rollout-Level Allocation
- Determines the proportion of rollouts within one task that will include experience.
- Controlled by val_rollout_mode and train_rollout_mode:
  - "woexp" → no rollout uses experience (pure exploration)
  - "all" → all rollouts use experience (fully guided)
  - "mixed" → partially guided, rollout experience usage ratio determined by rollout_ratio
- The parameter rollout_ratio only takes effect when val_rollout_mode/train_rollout_mode="mixed". For example, rollout_expratio=0.3 means 30% of rollouts will include experience, while the remaining 70% proceed without it.

# Class: ExperienceManager
# Function: allocate_add_exp()
add_exp_choices = {
    "woexp": [False] * rollout_n,
    "mixed": sorted(
        [i < round(rollout_n * self.rollout_expratio) for i in range(rollout_n)],
        key=lambda _: random.random()
    ),
    "all": [True] * rollout_n
}[exp_mode]

✅Effect:

train_sample_mode controls how experience is used in training samples.
val_rollout_mode/train_rollout_mode defines the exploration regime (woexp / mixed / all).
rollout_ratio refines mixed mode by determining how many rollouts reuse experience.

Together, they enable dynamic balancing between exploration and exploitation.

2. Asynchronous Experience Summarization

Purpose: Convert raw trajectories into summarized experiences asynchronously, ensuring continuous learning without blocking.

How it works:

Periodically triggered by training steps (updated_freq).
Executes summarization jobs via a background thread (ThreadPoolExecutor).
Stores summarized results in the shared experience pool.

# Class: ExperienceManager
# FUnction: submit_summary_task()
summary_task = self.thread_pool.submit(
    self.em_client.call_summarizer,
    trajectories=trajectories,
    workspace_id=reme_config.workspace_id
    )

✅Effect: The experience pool grows in parallel with training, keeping the agent’s knowledge base continuously updated.

3. Context-Aware Rollout Management

Purpose: Make rollouts context-aware by injecting relevant past experiences into prompts.

How it works:

Retrieves top-K related experiences via EMClient.
Formats and prepends them to the rollout message.
Enhances rollout context without modifying the underlying task.

# Class: ExperienceWorker
# Function: manage_rollout_context()
history_experience = self.em_client.call_context_generator(
    trajectory=trajectory,
    retrieve_top_k=reme_config.retrieve_top_k,
    workspace_id=reme_config.workspace_id
    )
formatted_experience = self.experience_template.format(history_experience)
new_content = formatted_experience + trajectory.steps[-1]["content"]
trajectory.steps[-1]["content"] = new_content

✅Effect: Each rollout benefits from relevant prior knowledge, reducing redundant exploration.

4. Training Context Management

Purpose: Ensure training messages remain clean by removing injected experience when not needed.

How it works:

Detects experience templates in training messages using regex.
Removes them when train_mode="discard" while retaining extracted text for analysis.

# Class: ExperienceWorker
# Function: manage_training_context()
pattern = re.escape(self.experience_template).replace(r'\{\}', '(.*?)')
cleaned_message = re.sub(pattern, '', message, flags=re.DOTALL)

✅Effect: Guarantees that training data integrity aligns with the current experience policy.

5. Training Loss Processing

Purpose:
Ensure stable policy updates when mixing on-policy rollouts and off-policy experience replays, allowing the agent to leverage past trajectories without destabilizing learning.

How it Works:

This module computes a heterogeneous PPO loss that combines:
- On-policy loss: derived from fresh rollouts.
- Off-policy loss: derived from experience-augmented samples.
An experience mask (exp_mask) distinguishes the two. Each loss is clipped and optionally adjusted for negative advantages. Finally, the losses are combined and aggregated according to loss_agg_mode.

# Function: het_compute_token_on_off_policy_loss()

# 1️⃣ Compute policy ratio and approximate KL divergence
negative_approx_kl = log_prob - old_log_prob  # difference between new and old log-probabilities
ratio = torch.exp(negative_approx_kl)       # policy ratio r_t = exp(log_pi_new - log_pi_old)
ppo_kl = verl_F.masked_mean(-negative_approx_kl, response_mask)  # approximate KL divergence

# 2️⃣ Compute on-policy losses (exp_mask = 0)
on_pg_losses, _, _ = compute_pg_losses(cliprange_low, cliprange_high)
# Mask out experience tokens to ensure only fresh rollouts contribute
on_pg_loss = verl_F.masked_mean(on_pg_losses, (1.0 - exp_mask) * response_mask)

# 3️⃣ Compute off-policy losses (exp_mask = 1)
off_pg_losses, _, _ = compute_pg_losses(off_cliprange_low, off_cliprange_high)
# Mask to include only experience tokens
off_pg_loss = verl_F.masked_mean(off_pg_losses, exp_mask * response_mask)
# Ensure numerical stability
off_pg_loss = torch.tensor(0.0) if off_pg_loss.isnan().item() else off_pg_loss

# 4️⃣ Combine both losses using the experience mask
pg_losses = off_pg_losses * exp_mask + on_pg_losses * (1.0 - exp_mask)
# Aggregate token-level losses according to selected mode (e.g., "token-mean")
pg_loss = agg_loss(loss_mat=pg_losses, loss_mask=response_mask, loss_agg_mode=loss_agg_mode)

✅Effect:

exp_mask separates on-policy and off-policy contributions cleanly.
off_cliprange_high enforce trust regions for stable updates.

⚙️ Key Parameters & Configuration

Rollout Modes

train_rollout_mode (str)
: Controls how experiences are used during training rollouts.
Options:
- "woexp" → rollouts without any experience guidance (pure exploration)
- "mixed" → partially inject experiences based on rollout_ratio
- "all" → all rollouts include retrieved experiences
Default: "woexp".

val_rollout_mode (str)
: Controls how experiences are used during validation/test rollouts.
Same options as train_rollout_mode. Typically set to "woexp" for unbiased evaluation.
Default: "woexp".

rollout_ratio (float, range: [0.0, 1.0])
: When rollout mode is "mixed", this ratio determines the proportion of rollouts that include experience.
Example: 0.3 means 30% experience-guided, 70% exploratory.
Default: 0.0.

Training Sample Processing

train_sample_mode (str)
: Defines whether to keep or discard experience context in training samples after rollout.
Options:
- "allkeep" → all samples retain experience information
- "alldiscard" → strip experience from all samples (model learns from raw reasoning)
- "hybrid" → selective retention based on train_sample_keepratio
Default: "alldiscard".

train_sample_keepratio (float, range: [0.0, 1.0])
: When train_sample_mode="hybrid", controls the task-level proportion of samples that retain experience.
Default: 1.0.

Experience Retrieval & Injection

experience_template (str)
: Template string for wrapping retrieved experiences before injecting into prompts.
The {} placeholder is replaced with formatted experience content.
Example: "\n\nSome Related Experience to help you to complete the task:<EXP>{}</EXP>\n\n"

retrieve_top_k (int)
: Number of most relevant experiences to retrieve per query when enable_context_generator=True.
Default: 3.

ReMe Service Configuration

base_url (str)
: HTTP endpoint of the ReMe service (Reflective Memory Engine).
Handles experience summarization, storage, and retrieval.
Example: "http://127.0.0.1:8001".

workspace_id (str)
: Unique identifier for the experience workspace. Different workspaces maintain isolated experience pools.
Default: "default".

enable_summarizer (bool)
: Whether to activate automatic experience summarization from rollout trajectories.
When True, raw reasoning traces are distilled into reusable experience snippets.
Default: False.

enable_context_generator (bool)
: Whether to enable experience retrieval and context injection during rollouts.
Must be True for experience-guided rollouts to function.
Default: False.

updated_freq (int)
: Frequency (in training steps) to refresh the experience pool.
Set to 0 to disable periodic updates.
Default: 0.

Initialization & Utilities

init_exp_before_training (bool)
: Whether to pre-populate the experience pool before training begins.
Useful for warm-start scenarios with prior knowledge.
Default: False.

init_exp_only (bool)
: If True, only initializes the experience pool without starting training.
Useful for pre-computing embeddings or validating summarization quality.
Default: False.

summary_batch_size (int)
: Batch size for processing experience summarization requests.
Larger values improve throughput but require more memory.
Default: 8.

🧭 Quick Start & Recommended Configuration

Step 1: Set Up ReMe Service

Ensure you have completed the ReMe installation;
Ensure you have modified the environment configuration;
Start the ReMe service with the following command:

cd path_to_reme/ReMe

reme \
  config=default \
  backend=http \
  thread_pool_max_workers=256 \
  http.host="127.0.0.1" \
  http.port=8001 \
  http.limit_concurrency=256 \
  llm.default.model_name=qwen-max-2025-01-25 \
  embedding_model.default.model_name=text-embedding-v4 \
  vector_store.default.backend=local \
  op.rerank_memory_op.params.enable_llm_rerank=false \
  flow.summary_task_memory.flow_content="trajectory_preprocess_op >> (success_extraction_op|failure_extraction_op|comparative_extraction_op) >> memory_validation_op >> memory_deduplication_op >> update_vector_store_op"

Step 2: Recommended Configuration

Add the following configuration to your experiment settings:

exp_manager:
  val_rollout_mode: "woexp"    # rollout mode on dev/test set: ["mixed", "all", "woexp"]
  train_rollout_mode: "mixed"  # rollout mode on train set: ["mixed", "all", "woexp"]
  rollout_ratio: 0.5           # ratio of adding experience within group during rollout
  train_sample_mode: "alldiscard"    # sample mode after train rollout: ["allkeep", "alldiscard", "hybrid"]
  train_sample_keepratio: 0.0     # task-level ratio for keeping experience
  experience_template: "\n\nSome Related Experience to help you to complete the task:<EXP>{}</EXP>\n\n"  # template for inserting experience
  init_exp_before_training: True  # whether to init experience pool before training
  init_exp_only: False  # whether to only init experience pool
  summary_batch_size: 8  # batch size for experience summarization
  reme:
    base_url: "http://127.0.0.1:8001"  # base URL for ReMe service
    workspace_id: "default"  # workspace ID for ReMe
    enable_summarizer: True  # whether to enable experience summarizer
    enable_context_generator: True  # whether to enable context generator
    retrieve_top_k: 3  # number of top experiences to retrieve
    updated_freq: 0   # experience pool update frequency (in steps, 0=disabled)

Step 3: Understanding Different Usage Modes

The experience pool system supports multiple usage modes to fit different experimental needs. Configure the parameters according to the table below:

Mode Descriptions

Usage Mode	Description
No Pool	Training without any experience pool (baseline mode)
Init Only	Initialize experience pool from trajectories only, without using it during training
Init + Train	Initialize experience pool before training and use it throughout the training process
Init + Train + Update	Initialize experience pool, use during training, and continuously update it with new experiences
Exist + Train	Use an existing experience pool during training without initialization or updates
Exist + Train + Update	Use an existing experience pool during training and continuously update it with new experiences

Configuration Matrix

	No Pool	Init Only	Init + Train	Init + Train + Update	Exist + Train	Exist + Train + Update
`updated_freq`	0	0	0	k != 0	0	k != 0
`init_exp_only`	False	True	False	False	False	False
`init_exp_before_training`	False	True	True	True	False	False
`enable_summarizer`	False	True	False	True	False	True
`enable_context_generator`	False	False	True	True	True	True
`workspace_id`	-	New ID	New ID	New ID	Exist ID	Exist ID

Configuration Notes

updated_freq: Set to a non-zero value (e.g., 100) to enable periodic experience pool updates during training. 0 disables updates.
workspace_id:
Use a new workspace ID to create a fresh experience pool
Specify an existing workspace ID to reuse a previously created pool
When using vector_store.default.backend=local, the experience pool is saved at: ReMe/local_vector_store/{workspace_id}.jsonl
Recommended starting point: For most use cases, we recommend starting with Init + Train mode.