Overview
This benchmark evaluates Tool Memory effectiveness by comparing agent performance with and without tool memory across multiple epochs. The experiment uses mock search tools with varying performance characteristics for different query complexities.
Experimental Setup
Mock Search Tools
Three LLM-based mock search tools with different performance profiles:
| Tool | Simple Queries | Medium Queries | Complex Queries |
|---|---|---|---|
| SearchToolA | ⭐⭐⭐ Fast, high success (90%) | ❌ Poor (20% success) | ⚠️ Weak (50% success) |
| SearchToolB | ⚠️ Over-engineered (30%) | ⭐⭐⭐ Optimal (90% success) | ⚠️ Limited (50% success) |
| SearchToolC | ⚠️ Overkill (30%) | ⚠️ Excessive (40%) | ⭐⭐⭐ Best (90% success) |
Performance Characteristics:
- success_rate: Probability of successful execution (vs "Service busy" error)
- relevance_ratio: Probability of returning relevant results (vs random content)
- extra_time: Simulated latency (currently 0 in implementation)
Each tool uses LLM to classify query complexity and generate appropriate responses.
Query Dataset
Source: cookbook/tool_memory/query.json
- Train Set: 20 queries per complexity × 3 levels = 60 queries
- Test Set: 20 queries per complexity × 3 levels = 60 queries
- Complexity Levels: simple, moderate, complex
Benchmark Workflow
Single Epoch Process
Each epoch consists of 5 steps:
Step 1: Train without Memory
# Execute all train queries on TRAIN_WORKSPACE
# Agent selects tools without historical guidance
run_use_mock_search(TRAIN_WORKSPACE, train_queries, prompt_template)
# Add results to memory and get scored results
train_scored_results = add_tool_call_results(TRAIN_WORKSPACE, train_results)
Step 2: Test without Memory
# Execute all test queries on TEST_WORKSPACE (fresh workspace)
# Baseline performance without tool memory
run_use_mock_search(TEST_WORKSPACE, test_queries, prompt_template)
# Add results to memory (will be cleared in Step 4)
test_scored_results = add_tool_call_results(TEST_WORKSPACE, test_results)
Step 3: Summarize Tool Memory
# Summarize tool performance from TRAIN_WORKSPACE
summarize_tool_memory(TRAIN_WORKSPACE, "SearchToolA,SearchToolB,SearchToolC")
# Retrieve formatted tool memory content
memories = retrieve_tool_memory(TRAIN_WORKSPACE, tool_names)
The summarization produces memory content including: - Best/worst use cases per tool - Statistical metrics (avg score, success rate, token cost, time cost) - Usage recommendations
Step 4: Test with Memory
# Clear TEST_WORKSPACE to start fresh
delete_workspace(TEST_WORKSPACE)
# Inject tool memory into prompt
prompt_with_memory = f"Tool Information\n{memories}\nMust select one tool to answer\nQuery\n{query}"
# Execute test queries with memory guidance
run_use_mock_search(TEST_WORKSPACE, test_queries, prompt_with_memory)
# Add results and get scored results
test_scored_results_with_memory = add_tool_call_results(TEST_WORKSPACE, test_results)
Step 5: Compare Results
# Generate comparison table
print_comparison_table([train_no_memory_stats, test_no_memory_stats, test_with_memory_stats])
# Calculate improvements (baseline: test without memory)
improvements = calculate_improvements(test_no_memory_stats, test_with_memory_stats)
print_improvements(improvements)
Multi-Epoch Execution
# Run benchmark with 3 epochs
python cookbook/tool_memory/run_reme_tool_bench.py
# Test mode (5 queries per complexity level)
main(test_mode=True, run_epoch=3)
# Full mode (20 queries per complexity level)
main(test_mode=False, run_epoch=3)
Key Components
1. Tool Selection: UseMockSearchOp
# Agent uses LLM to select appropriate tool
tool_call = await self.select_tool(query, [SearchToolA(), SearchToolB(), SearchToolC()])
# Execute selected tool and record results
result = ToolCallResult(
create_time=timestamp,
tool_name=tool_call.name,
input={"query": query},
output=content,
token_cost=token_cost,
success=success,
time_cost=time_cost
)
2. Tool Call Result Evaluation
Results are automatically evaluated and scored:
- score: 0.0 (failure/irrelevant) or 1.0 (complete success)
- success: Tool execution status
- summary: Brief description
- evaluation: Detailed assessment
3. Tool Memory Schema
ToolMemory(
workspace_id="workspace_id",
memory_type="tool",
when_to_use="Brief usage scenario description",
content="Detailed performance analysis and recommendations",
score=0.85,
tool_call_results=[list of ToolCallResult],
metadata={"tool_name": "SearchToolA"}
)
Evaluation Metrics
Per-Scenario Metrics
- Avg Score: Average quality score (0.0-1.0)
- Total Calls: Number of tool invocations
- Success Rate: Percentage of successful executions
Improvement Calculation
improvement_percentage = ((with_memory_score - without_memory_score) / without_memory_score) * 100
Expected Results
Hypothesis
Tool Memory should enable the agent to: 1. Select optimal tools based on query complexity 2. Improve average score by 10-30% on test set 3. Increase consistency across multiple epochs
Sample Output
==================================================================================================
BENCHMARK RESULTS COMPARISON
==================================================================================================
Note: Avg Score = average quality score
+---------------------------+--------------+-----------+
| Scenario | Total Calls | Avg Score |
+===========================+==============+===========+
| Epoch1 - Train (No Memory)| 60 | 0.650 |
+---------------------------+--------------+-----------+
| Epoch1 - Test (No Memory) | 60 | 0.633 |
+---------------------------+--------------+-----------+
| Epoch1 - Test (With Memory)| 60 | 0.817 |
+---------------------------+--------------+-----------+
==================================================================================================
IMPROVEMENTS WITH TOOL MEMORY (Baseline: Test without memory)
==================================================================================================
Average Score : +29.07% ↑
==================================================================================================
Running the Benchmark
Prerequisites
pip install requests python-dotenv loguru tabulate
Start API Server
# Start ReMe API server
python reme_ai/app.py --port 8002
Execute Benchmark
# Full benchmark (3 epochs, 60+60 queries per epoch)
python cookbook/tool_memory/run_reme_tool_bench.py
# Quick test (3 epochs, 15+15 queries per epoch)
# Modify main() call: main(test_mode=True, run_epoch=3)
Output Files
tool_memory_benchmark_results.json: Complete benchmark results- Console output: Real-time progress and comparison tables
API Endpoints Used
/use_mock_search: Execute tool selection and search- Input:
workspace_id,query -
Output:
ToolCallResultJSON -
/add_tool_call_result: Add results to memory and get evaluation scores - Input:
workspace_id,tool_call_results(list) -
Output:
memory_listwith scored results -
/summary_tool_memory: Summarize tool performance - Input:
workspace_id,tool_names(comma-separated) -
Output: Updated
ToolMemorywith content -
/retrieve_tool_memory: Retrieve formatted tool memory - Input:
workspace_id,tool_names -
Output: Markdown-formatted memory content
-
/vector_store: Delete workspace - Input:
workspace_id,action: "delete"
Concurrency Control
- Max workers: 4 parallel queries
- Rate limiting: 1 second delay between submissions
- Timeout: 120 seconds per API call
References
- Tool Memory Schema:
reme_ai/schema/memory.py - Mock Tools Implementation:
reme_ai/agent/tools/mock_search_tools.py - LLM-based Search Op:
reme_ai/agent/tools/llm_mock_search_op.py - Tool Selection Op:
reme_ai/agent/tools/use_mock_search_op.py