This article is a translated version of the Chinese original.
Overview
DeepFinance is a training recipe for financial deep-research Agents built on top of the AgentJet framework. The goal: use GRPO reinforcement learning to teach an LLM to autonomously call financial tools, gather data from multiple sources, cross-validate it, and finally produce structured, well-cited investment research reports.
Unlike traditional SFT, DeepFinance does not rely on human-written "ground truth" answers as training supervision. Instead, it designs a multi-dimensional reward system that serves as the RL training signal — letting the model explore optimal report-writing strategies on its own, guided by feedback from 5 orthogonal scoring dimensions.
Training loop:
Financial question → Agent calls tools to collect data → Generates research report → Multi-dimensional Judge scoring → GRPO policy update → Next rollout
Pipeline
The training pipeline is composed of 4 core modules:
| Module | File | Responsibility |
|---|---|---|
| Reader | deep_finance_reader.py |
Loads JSON training data, assembles System Prompt + User Query |
| Workflow | deep_finance.py |
Defines the multi-turn ReAct Agent logic and maintains chat history |
| Judge | deep_finance_judge.py + judge/ |
Multi-dimensional reward scoring (the core innovation) |
| Config | deep_finance.yaml / *.sh |
Training hyperparameters, reward weights, environment configuration |
┌─────────────────────────────────────────────────────────────┐
│ AgentJet Training Framework │
│ │
│ ┌──────────────┐ ┌──────────────────────┐ │
│ │ DeepFinance │ │ ExampleDeepResearch │ │
│ │ Reader │───>│ Protocol (Workflow) │ │
│ │ Data load + │ │ Multi-turn ReAct │ │
│ │ Prompt assemb.│ └──────────┬───────────┘ │
│ └──────────────┘ │ │
│ v │
│ ┌────────────────────────┐ │
│ │ EnvService (FinWorld) │ │
│ │ 19 financial tools+MCP│ │
│ │ MongoDB caching │ │
│ └────────────┬───────────┘ │
│ │ │
│ v │
│ ┌────────────────────────┐ │
│ │ DeepFinanceJudge │ │
│ │ Multi-dim reward │ │
│ │ (built on OpenJudge) │ │
│ └────────────┬───────────┘ │
│ │ │
│ v │
│ ┌────────────────────────┐ │
│ │ GRPO Trainer (verl) │ │
│ │ Multi-node Ray cluster│ │
│ └────────────────────────┘ │
└─────────────────────────────────────────────────────────────┘
Workflow Design
Two-Stage Deep-Research Procedure
The Agent's System Prompt (prompt/finance_analyst_prompt.md) requires the model to follow a two-stage research method:
Stage 1: Outline first, then investigate
- Identify the type of user question (single-stock analysis / sector study / event interpretation / macro analysis / stock screening).
- Output a research outline first (H1/H2 headings + Key Questions per section). No tool calls in this stage.
- Investigate section-by-section along the outline, summarizing after every round of tool calls.
Stage 2: Deep analysis and report generation
- Once data is sufficient, generate a Markdown research report grounded in real data.
- If evidence is found lacking during writing, the Agent is allowed 1–2 additional tool-call rounds to gather more support.
- End the report with a
[TASK_COMPLETED]marker.
Citation Convention
The Agent is required to follow an academic-paper-style citation format:
- Every key factual sentence must end with a citation marker
[n]. - The report must include a
## Referencessection at the end. - Citations must be traceable to actual tool-returned data — fabrication is forbidden.
Tool Suite
DeepFinance integrates 19 financial tools, exposed to the Agent over the MCP (Model Context Protocol) channel via EnvService. They cover the full data needs of financial research.
| Category | Tool | Function |
|---|---|---|
| Entity & Computation | extract_entities_code |
Extract financial entities from natural language and look up codes |
history_calculate |
A-share historical price analysis (natural-language Q&A) | |
| General Capability | dashscope_search |
Internet search |
execute_code |
Python code execution | |
execute_shell |
Shell command execution | |
| Tonghuashun Data | crawl_ths_company |
Listed-company basic profile |
crawl_ths_holder |
Shareholder research | |
crawl_ths_operate |
Operations analysis | |
crawl_ths_finance |
Financial analysis | |
crawl_ths_worth |
Earnings forecasts | |
crawl_ths_news |
News & announcements | |
crawl_ths_concept |
Concept / thematic info | |
crawl_ths_equity |
Equity structure | |
crawl_ths_capital |
Capital operations | |
crawl_ths_position |
Major-holder positions | |
crawl_ths_bonus |
Dividends & financing | |
crawl_ths_event |
Major corporate events | |
crawl_ths_field |
Industry comparison |
Tool-call rules:
- At most 3 tools per turn, encouraging multi-round progressive investigation.
- The Agent must search to confirm information (e.g. ticker codes) before drilling down.
- After every round of tool calls, summarize first, then decide on the next research direction.
Reward Design
This is DeepFinance's core innovation. We design 5 orthogonal scoring dimensions (Graders) and combine them via configurable weights into the final reward, plus an additional tool-call penalty.
Overall Formula
Where the grader weights are normalized (Σw_i = 1), and tool_penalty is an additional adjustment.
The 5 Scoring Dimensions
| Dimension | Name | Evaluates | Core Question |
|---|---|---|---|
| Analytical Adequacy | RM Gallery | Overall report quality | Is the analysis thorough? Is the reasoning sound? |
| Presentation Quality | PresentationQuality | Layout and structure of the report | Is it pleasant to read? Is information easy to extract? |
| Citation Compliance | Grounding | Citation coverage and authenticity | Do all key facts have citations? Are the citations real? |
| Evidence Traceability | EBTU | Evidence anchoring of atomic claims | Can each number/fact be traced back to original tool-returned data? |
| Citation-Logic Audit | Audit | Logical entailment of citations | Do the citations actually support the corresponding statements? Any inflation/fabrication? |
Default weight configuration (tunable in the shell scripts):
RM_WEIGHT=0.5 # Analytical adequacy
PRESENTATION_QUALITY_WEIGHT=0.2 # Presentation quality
GROUNDING_WEIGHT=0.1 # Citation compliance
EBTU_WEIGHT=0.2 # Evidence traceability (optional)
AUDIT_WEIGHT=0.0 # Citation-logic audit (optional)
1) Analytical Adequacy (RM Gallery)
Goal: Evaluate the analytical depth, coverage, and reasoning of the report — answering "is the analysis any good?"
Mechanism: Uses the finance_composition evaluator. An independent Judge LLM (qwen-max) compares the generated report against a reference answer.
Evaluation aspects (sharded by financial domain):
- Analytical depth: how deeply the core question has been investigated.
- Coverage: whether multiple analytical angles are covered (fundamentals, financials, valuation, industry, news, …).
- Reasoning: completeness of the reasoning chain, soundness of conclusions.
I/O:
- Input: User Query + Agent-generated report + reference answer.
- Output: a normalized score in
[0, 1].
2) Presentation Quality
Goal: Evaluate user experience and information architecture — answering "is it well-laid-out and easy to read?"
Strictly does NOT evaluate: factual correctness, citation accuracy, content depth (those are handled by other graders).
8 sub-metrics (1/3/5 scale):
| Category | Metric | 5-point Standard |
|---|---|---|
| Scan-ability | A1 Conclusion-first | Standalone abstract / TL;DR at the top — main conclusion is visible without scrolling |
| A2 Structural Navigation | Clear hierarchy (H1/H2/H3); long pieces have explicit signposts | |
| A3 Visual Emphasis | Bold/italics used precisely to highlight core insights; high signal-to-noise | |
| Information Structuring | B1 Decomposing Dense Info | Complex data presented via tables / nested lists at a glance |
| B2 Comparison & Alignment | A vs B / past vs present uses tables, with horizontally comparable dimensions | |
| B3 Consistency & Rendering | Uniform formatting, clean Markdown rendering | |
| Editorial Clarity | C1 Visible Argument Chain | Logical chain is visible (claim → evidence → conclusion); citation anchors clear |
| C2 Risks & Actions | Standalone section listing risks/limitations and next-step recommendations |
Scoring:
Anti-gaming: empty tables, meaningless repeated lists, format-for-the-sake-of-format → flat 1 point.
3) Citation Compliance (Grounding)
Goal: Evaluate citation coverage and authenticity — answering "do all the key facts have sources, and are the citations real?"
Procedure:
- Extract User Query, Evidence (tool calls + returns), and the final report from the conversation trace.
- An LLM auditor identifies all "key factual sentences" (containing numbers / dates / financial metrics / definitive statements).
- Check whether each key sentence ends with a citation marker
[n]. - Check that each citation has a valid entry in the References section (valid URL or a complete no-url record).
- Check whether the citation content is consistent with Evidence (detect fake citations).
Output fields:
total_key_facts: total number of key factual sentences.cited_key_facts: number of those that end with a citation.fake_count: citations clearly contradicting the evidence.missing_count: key facts lacking a citation.invalid_reference_nums: malformed reference numbers.
Scoring:
citation_coverage = cited_key_facts / total_key_facts # citation coverage
grounding_score = 1 - fake_count / cited_key_facts # citation authenticity
final_score = 0.5 × coverage + 0.5 × grounding # combined score
4) Evidence Traceability (EBTU – Evidence-Backed Trace Units)
Goal: Audit each "atomic claim" in the report for evidence anchoring — answering "can every number, every fact, be traced back to data returned by a tool?"
Core principle: Evidence-first. The auditor must produce evidence anchors (step + quote) before issuing a verdict; reasoning backwards from a conclusion to "find" evidence is forbidden.
Audit procedure:
- Extract all atomic claims (Trace Units) from the report and tag each with a type (numeric / temporal / event / comparison / causal / …).
- Tag hardness:
hard(definitive fact) vssoft(explicitly marked as speculation/hypothesis). -
For each claim, find anchors in Evidence:
-
- Precise to a step number and an in-line quote (≤ 120 chars).
-
Numbers / dates must literally appear in the Evidence text.
-
Issue a verdict:
| Verdict | Meaning |
|---|---|
supported |
The anchor directly supports the claim |
contradicted |
The anchor explicitly conflicts with the claim |
no_evidence |
No support in Evidence, and the claim is presented as definitive |
speculative_ok |
The claim is explicitly speculative/hypothetical, not disguised fact |
unclear |
Evidence is related but insufficient to support or refute |
- Tag the issue type:
entity_mismatch/time_mismatch/value_mismatch/scope_mismatch/logic_leap/over_precision/missing_anchor.
Scoring (deterministic, computed in Python — not produced by the LLM):
base = (supported - 1.4×contradicted - 0.9×no_evidence - 0.4×unclear) / hard_units
misattrib_factor = max(0, 1 - 0.7 × misattrib_rate) # misattribution penalty
selection_factor = min(1, extracted_units / expected) # coverage factor
cov_factor = 0.65 + 0.35 × digit_coverage # number/date coverage
score = base × misattrib_factor × selection_factor × cov_factor
Key design point: the LLM only emits structured outputs (claim extraction + anchor labels + verdicts); the score itself is computed deterministically by code, avoiding the instability of LLM self-scoring.
Tool-Call Penalty
On top of the weighted score, an extra tool-call penalty encourages the Agent to actively gather data with tools:
| Tool-call count | Penalty |
|---|---|
| 0 | -1.0 |
| 1–2 | -0.5 |
| ≥ 3 | 0.0 (no penalty) |
Quick Start
Environment Setup
- Install AgentJet and its dependencies:
- Configure the
.envfile (API keys, model paths, data paths, etc.):
# Example .env
MODEL_PATH=/path/to/Qwen3-8B
TRAIN_DATA_PATH=/path/to/train.json
VAL_DATA_PATH=/path/to/val.json
TRAIN_REF_ANS_PATH=/path/to/train_ref_answers.json
VAL_REF_ANS_PATH=/path/to/val_ref_answers.json
CKPT_SAVE_PATH=/path/to/checkpoints
OPENJUDGE_API_KEY=your_api_key
RM_API_KEY=your_api_key
- Start EnvService (the financial-tools service).
Single-Node Debugging Mode
This script runs with --backbone="debug", which is ideal for validating the workflow and debugging.
Multi-Node Training Mode
# Submit on PAI-DLC or any multi-node environment
bash tutorial/example_deep_finance/deep_finance.sh
This script will:
- Generate the config file dynamically from the YAML template.
- Start Ray Head + the training task on the master node.
- Worker nodes automatically join the Ray cluster.
Key Parameters
| Parameter | Default | Description |
|---|---|---|
NUM_REPEAT |
4 | Group size — number of rollouts per query |
NUM_STEPS |
6 | Max number of interaction rounds per sample |
TRAIN_BATCH_SIZE |
32 | Training batch size |
RM_WEIGHT |
0.5 | Analytical-adequacy weight |
PRESENTATION_QUALITY_WEIGHT |
0.25 | Presentation-quality weight |
GROUNDING_WEIGHT |
0.25 | Citation-compliance weight |
EBTU_WEIGHT |
0.0 | Evidence-traceability weight (optional) |
AUDIT_WEIGHT |
0.0 | Citation-logic-audit weight (optional) |
Experimental Results


