🌍 Appworld Experiment

We tested ReMe on Appworld using qwen3-8b:

Method pass@1 pass@2 pass@4
without ReMe 0.083 0.140 0.228
with ReMe 0.109 (+2.6%) 0.175 (+3.5%) 0.281 (+5.3%)

Pass@K measures the probability that at least one of the K generated samples successfully completes the task ( score=1).
The current experiment uses an internal AppWorld environment, which may have slight differences.

You can find more details on reproducing the experiment in quickstart.md.

🧊 Frozenlake Experiment

without ReMe with ReMe

GIF 1

GIF 2

We tested on 100 random frozenlake maps using qwen3-8b:

Method pass rate
without ReMe 0.66
with ReMe 0.72 (+6.0%)

You can find more details on reproducing the experiment in quickstart.md.

🔧 BFCL-V3 Experiment

We tested ReMe on BFCL-V3 multi-turn-base (randomly split 50train/150val) using qwen3-8b:

Method pass@1 pass@2 pass@4
without ReMe 0.2472 0.2733 0.2922
with ReMe 0.3061 (+5.89%) 0.3500 (+7.67%) 0.3888 (+9.66%)