Deep Finance (EN)

This article is a translated version of the Chinese original.

Overview

DeepFinance is a training recipe for financial deep-research Agents built on top of the AgentJet framework. The goal: use GRPO reinforcement learning to teach an LLM to autonomously call financial tools, gather data from multiple sources, cross-validate it, and finally produce structured, well-cited investment research reports.

Unlike traditional SFT, DeepFinance does not rely on human-written "ground truth" answers as training supervision. Instead, it designs a multi-dimensional reward system that serves as the RL training signal — letting the model explore optimal report-writing strategies on its own, guided by feedback from 5 orthogonal scoring dimensions.

Training loop:

Financial question → Agent calls tools to collect data → Generates research report → Multi-dimensional Judge scoring → GRPO policy update → Next rollout

Pipeline

The training pipeline is composed of 4 core modules:

Module	File	Responsibility
Reader	`deep_finance_reader.py`	Loads JSON training data, assembles System Prompt + User Query
Workflow	`deep_finance.py`	Defines the multi-turn ReAct Agent logic and maintains chat history
Judge	`deep_finance_judge.py` + `judge/`	Multi-dimensional reward scoring (the core innovation)
Config	`deep_finance.yaml` / `*.sh`	Training hyperparameters, reward weights, environment configuration

┌─────────────────────────────────────────────────────────────┐
│                    AgentJet Training Framework               │
│                                                             │
│  ┌──────────────┐    ┌──────────────────────┐               │
│  │ DeepFinance   │    │  ExampleDeepResearch │               │
│  │ Reader        │───>│  Protocol (Workflow) │               │
│  │ Data load +   │    │  Multi-turn ReAct    │               │
│  │ Prompt assemb.│    └──────────┬───────────┘               │
│  └──────────────┘               │                           │
│                                 v                           │
│                    ┌────────────────────────┐               │
│                    │  EnvService (FinWorld) │               │
│                    │  19 financial tools+MCP│               │
│                    │  MongoDB caching       │               │
│                    └────────────┬───────────┘               │
│                                 │                           │
│                                 v                           │
│                    ┌────────────────────────┐               │
│                    │  DeepFinanceJudge      │               │
│                    │  Multi-dim reward      │               │
│                    │  (built on OpenJudge)  │               │
│                    └────────────┬───────────┘               │
│                                 │                           │
│                                 v                           │
│                    ┌────────────────────────┐               │
│                    │  GRPO Trainer (verl)   │               │
│                    │  Multi-node Ray cluster│               │
│                    └────────────────────────┘               │
└─────────────────────────────────────────────────────────────┘

Workflow Design

Two-Stage Deep-Research Procedure

The Agent's System Prompt (prompt/finance_analyst_prompt.md) requires the model to follow a two-stage research method:

Stage 1: Outline first, then investigate

Identify the type of user question (single-stock analysis / sector study / event interpretation / macro analysis / stock screening).
Output a research outline first (H1/H2 headings + Key Questions per section). No tool calls in this stage.
Investigate section-by-section along the outline, summarizing after every round of tool calls.

Stage 2: Deep analysis and report generation

Once data is sufficient, generate a Markdown research report grounded in real data.
If evidence is found lacking during writing, the Agent is allowed 1–2 additional tool-call rounds to gather more support.
End the report with a [TASK_COMPLETED] marker.

Citation Convention

The Agent is required to follow an academic-paper-style citation format:

Every key factual sentence must end with a citation marker [n].
The report must include a ## References section at the end.
Citations must be traceable to actual tool-returned data — fabrication is forbidden.

Tool Suite

DeepFinance integrates 19 financial tools, exposed to the Agent over the MCP (Model Context Protocol) channel via EnvService. They cover the full data needs of financial research.

Category	Tool	Function
Entity & Computation	`extract_entities_code`	Extract financial entities from natural language and look up codes
	`history_calculate`	A-share historical price analysis (natural-language Q&A)
General Capability	`dashscope_search`	Internet search
	`execute_code`	Python code execution
	`execute_shell`	Shell command execution
Tonghuashun Data	`crawl_ths_company`	Listed-company basic profile
	`crawl_ths_holder`	Shareholder research
	`crawl_ths_operate`	Operations analysis
	`crawl_ths_finance`	Financial analysis
	`crawl_ths_worth`	Earnings forecasts
	`crawl_ths_news`	News & announcements
	`crawl_ths_concept`	Concept / thematic info
	`crawl_ths_equity`	Equity structure
	`crawl_ths_capital`	Capital operations
	`crawl_ths_position`	Major-holder positions
	`crawl_ths_bonus`	Dividends & financing
	`crawl_ths_event`	Major corporate events
	`crawl_ths_field`	Industry comparison

Tool-call rules:

At most 3 tools per turn, encouraging multi-round progressive investigation.
The Agent must search to confirm information (e.g. ticker codes) before drilling down.
After every round of tool calls, summarize first, then decide on the next research direction.

Reward Design

This is DeepFinance's core innovation. We design 5 orthogonal scoring dimensions (Graders) and combine them via configurable weights into the final reward, plus an additional tool-call penalty.

Overall Formula

final_reward = Σ(w_i × grader_i_score) + tool_penalty

Where the grader weights are normalized (Σw_i = 1), and tool_penalty is an additional adjustment.

The 5 Scoring Dimensions

Dimension	Name	Evaluates	Core Question
Analytical Adequacy	RM Gallery	Overall report quality	Is the analysis thorough? Is the reasoning sound?
Presentation Quality	PresentationQuality	Layout and structure of the report	Is it pleasant to read? Is information easy to extract?
Citation Compliance	Grounding	Citation coverage and authenticity	Do all key facts have citations? Are the citations real?
Evidence Traceability	EBTU	Evidence anchoring of atomic claims	Can each number/fact be traced back to original tool-returned data?
Citation-Logic Audit	Audit	Logical entailment of citations	Do the citations actually support the corresponding statements? Any inflation/fabrication?

Default weight configuration (tunable in the shell scripts):

RM_WEIGHT=0.5                       # Analytical adequacy
PRESENTATION_QUALITY_WEIGHT=0.2     # Presentation quality
GROUNDING_WEIGHT=0.1                # Citation compliance
EBTU_WEIGHT=0.2                     # Evidence traceability (optional)
AUDIT_WEIGHT=0.0                    # Citation-logic audit (optional)

1) Analytical Adequacy (RM Gallery)

Goal: Evaluate the analytical depth, coverage, and reasoning of the report — answering "is the analysis any good?"

Mechanism: Uses the finance_composition evaluator. An independent Judge LLM (qwen-max) compares the generated report against a reference answer.

Evaluation aspects (sharded by financial domain):

Analytical depth: how deeply the core question has been investigated.
Coverage: whether multiple analytical angles are covered (fundamentals, financials, valuation, industry, news, …).
Reasoning: completeness of the reasoning chain, soundness of conclusions.

I/O:

Input: User Query + Agent-generated report + reference answer.
Output: a normalized score in [0, 1].

2) Presentation Quality

Goal: Evaluate user experience and information architecture — answering "is it well-laid-out and easy to read?"

Strictly does NOT evaluate: factual correctness, citation accuracy, content depth (those are handled by other graders).

8 sub-metrics (1/3/5 scale):

Category	Metric	5-point Standard
Scan-ability	A1 Conclusion-first	Standalone abstract / TL;DR at the top — main conclusion is visible without scrolling
	A2 Structural Navigation	Clear hierarchy (H1/H2/H3); long pieces have explicit signposts
	A3 Visual Emphasis	Bold/italics used precisely to highlight core insights; high signal-to-noise
Information Structuring	B1 Decomposing Dense Info	Complex data presented via tables / nested lists at a glance
	B2 Comparison & Alignment	A vs B / past vs present uses tables, with horizontally comparable dimensions
	B3 Consistency & Rendering	Uniform formatting, clean Markdown rendering
Editorial Clarity	C1 Visible Argument Chain	Logical chain is visible (claim → evidence → conclusion); citation anchors clear
	C2 Risks & Actions	Standalone section listing risks/limitations and next-step recommendations

Scoring:

score = Σ(8 sub-scores) / 40    # normalized to [0, 1]

Anti-gaming: empty tables, meaningless repeated lists, format-for-the-sake-of-format → flat 1 point.

3) Citation Compliance (Grounding)

Goal: Evaluate citation coverage and authenticity — answering "do all the key facts have sources, and are the citations real?"

Procedure:

Extract User Query, Evidence (tool calls + returns), and the final report from the conversation trace.
An LLM auditor identifies all "key factual sentences" (containing numbers / dates / financial metrics / definitive statements).
Check whether each key sentence ends with a citation marker [n].
Check that each citation has a valid entry in the References section (valid URL or a complete no-url record).
Check whether the citation content is consistent with Evidence (detect fake citations).

Output fields:

total_key_facts: total number of key factual sentences.
cited_key_facts: number of those that end with a citation.
fake_count: citations clearly contradicting the evidence.
missing_count: key facts lacking a citation.
invalid_reference_nums: malformed reference numbers.

Scoring:

citation_coverage = cited_key_facts / total_key_facts     # citation coverage
grounding_score = 1 - fake_count / cited_key_facts        # citation authenticity
final_score = 0.5 × coverage + 0.5 × grounding            # combined score

4) Evidence Traceability (EBTU – Evidence-Backed Trace Units)

Goal: Audit each "atomic claim" in the report for evidence anchoring — answering "can every number, every fact, be traced back to data returned by a tool?"

Core principle: Evidence-first. The auditor must produce evidence anchors (step + quote) before issuing a verdict; reasoning backwards from a conclusion to "find" evidence is forbidden.

Audit procedure:

Extract all atomic claims (Trace Units) from the report and tag each with a type (numeric / temporal / event / comparison / causal / …).
Tag hardness: hard (definitive fact) vs soft (explicitly marked as speculation/hypothesis).
For each claim, find anchors in Evidence:
- Precise to a step number and an in-line quote (≤ 120 chars).
Numbers / dates must literally appear in the Evidence text.
Issue a verdict:

Verdict	Meaning
`supported`	The anchor directly supports the claim
`contradicted`	The anchor explicitly conflicts with the claim
`no_evidence`	No support in Evidence, and the claim is presented as definitive
`speculative_ok`	The claim is explicitly speculative/hypothetical, not disguised fact
`unclear`	Evidence is related but insufficient to support or refute

Tag the issue type: entity_mismatch / time_mismatch / value_mismatch / scope_mismatch / logic_leap / over_precision / missing_anchor.

Scoring (deterministic, computed in Python — not produced by the LLM):

base = (supported - 1.4×contradicted - 0.9×no_evidence - 0.4×unclear) / hard_units
misattrib_factor = max(0, 1 - 0.7 × misattrib_rate)     # misattribution penalty
selection_factor = min(1, extracted_units / expected)   # coverage factor
cov_factor = 0.65 + 0.35 × digit_coverage               # number/date coverage
score = base × misattrib_factor × selection_factor × cov_factor

Key design point: the LLM only emits structured outputs (claim extraction + anchor labels + verdicts); the score itself is computed deterministically by code, avoiding the instability of LLM self-scoring.

Tool-Call Penalty

On top of the weighted score, an extra tool-call penalty encourages the Agent to actively gather data with tools:

Tool-call count	Penalty
0	-1.0
1–2	-0.5
≥ 3	0.0 (no penalty)

Quick Start

Environment Setup

Install AgentJet and its dependencies:

cd /path/to/AgentJet
bash install.sh

Configure the .env file (API keys, model paths, data paths, etc.):

# Example .env
MODEL_PATH=/path/to/Qwen3-8B
TRAIN_DATA_PATH=/path/to/train.json
VAL_DATA_PATH=/path/to/val.json
TRAIN_REF_ANS_PATH=/path/to/train_ref_answers.json
VAL_REF_ANS_PATH=/path/to/val_ref_answers.json
CKPT_SAVE_PATH=/path/to/checkpoints
OPENJUDGE_API_KEY=your_api_key
RM_API_KEY=your_api_key

Start EnvService (the financial-tools service).

Single-Node Debugging Mode

bash tutorial/example_deep_finance/deep_finance_single.sh

This script runs with --backbone="debug", which is ideal for validating the workflow and debugging.

Multi-Node Training Mode

# Submit on PAI-DLC or any multi-node environment
bash tutorial/example_deep_finance/deep_finance.sh

This script will:

Generate the config file dynamically from the YAML template.
Start Ray Head + the training task on the master node.
Worker nodes automatically join the Ray cluster.

Key Parameters

Parameter	Default	Description
`NUM_REPEAT`	4	Group size — number of rollouts per query
`NUM_STEPS`	6	Max number of interaction rounds per sample
`TRAIN_BATCH_SIZE`	32	Training batch size
`RM_WEIGHT`	0.5	Analytical-adequacy weight
`PRESENTATION_QUALITY_WEIGHT`	0.25	Presentation-quality weight
`GROUNDING_WEIGHT`	0.25	Citation-compliance weight
`EBTU_WEIGHT`	0.0	Evidence-traceability weight (optional)
`AUDIT_WEIGHT`	0.0	Citation-logic-audit weight (optional)