🌍 Appworld Experiment
We tested ReMe on Appworld using qwen3-8b:
Method | pass@1 | pass@2 | pass@4 |
---|---|---|---|
without ReMe | 0.083 | 0.140 | 0.228 |
with ReMe | 0.109 (+2.6%) | 0.175 (+3.5%) | 0.281 (+5.3%) |
Pass@K measures the probability that at least one of the K generated samples successfully completes the task (
score=1).
The current experiment uses an internal AppWorld environment, which may have slight differences.
You can find more details on reproducing the experiment in quickstart.md.
🧊 Frozenlake Experiment
without ReMe | with ReMe |
---|---|
We tested on 100 random frozenlake maps using qwen3-8b:
Method | pass rate |
---|---|
without ReMe | 0.66 |
with ReMe | 0.72 (+6.0%) |
You can find more details on reproducing the experiment in quickstart.md.
🔧 BFCL-V3 Experiment
We tested ReMe on BFCL-V3 multi-turn-base (randomly split 50train/150val) using qwen3-8b:
Method | pass@1 | pass@2 | pass@4 |
---|---|---|---|
without ReMe | 0.2472 | 0.2733 | 0.2922 |
with ReMe | 0.3061 (+5.89%) | 0.3500 (+7.67%) | 0.3888 (+9.66%) |