Auto-Research
Twinkle Auto is a terminal-based intelligent training assistant that lets you control, monitor, and debug ML training through natural language. It combines a chat-driven AI agent with an automated health monitor that can detect and fix training failures autonomously.
Architecture Overview
┌──────────────────────────────────────────────────────────┐
│ TwinkleAuto (asyncio chat loop) │
│ │
│ Components: │
│ AgentLoop ─── LLM tool-calling loop │
│ TrainingMonitor ─── periodic health check & auto-fix │
│ LocalConnection ─── file-system based communication │
│ SkillManager ─── async plugin loading │
└──────────────────────────────────────────────────────────┘
Installation & Launch
Auto is part of the twinkle-client package:
pip install twinkle-client
Command-Line Usage
# Basic launch (uses default local Ollama endpoint)
twinkle-auto
# Specify LLM backend
twinkle-auto --llm-base-url http://localhost:11434/v1 --llm-model qwen3.5
# Attach to an existing training run
twinkle-auto --run-id my-grpo-run
# Use a remote API (e.g., OpenAI-compatible)
twinkle-auto --llm-base-url https://api.example.com/v1 --llm-api-key sk-xxx --llm-model gpt-4o
# Enable debug logging
twinkle-auto --verbose
Or run as a Python module:
python -m twinkle_client.auto
CLI Options
| Option | Env Var | Default | Description |
|---|---|---|---|
--run-id, -r | TWINKLE_AUTO_RUN_ID | None | Attach to an existing training run |
--llm-base-url | TWINKLE_LLM_BASE_URL | http://localhost:11434/v1 | LLM API base URL |
--llm-model | TWINKLE_LLM_MODEL | qwen3.5 | LLM model name |
--llm-api-key | TWINKLE_LLM_API_KEY | not-needed | LLM API key |
--verbose, -v | TWINKLE_AUTO_VERBOSE | False | Enable DEBUG logging |
--version, -V | — | — | Show version and exit |
Chat Agent
The core of Auto is an LLM-powered tool-calling agent (AgentLoop) that processes natural language commands through an OpenAI-compatible API. The agent maintains conversation history with automatic pruning (last 50 messages) and supports up to 10 tool-calling rounds per interaction.
What You Can Say
Training lifecycle:
- “List my training runs”
- “Start a new GRPO training with Qwen3.5-4B on gsm8k”
- “Pause the current run”
- “Resume training”
- “Stop training”
Server management:
- “Start the server with Qwen3.5-4B and a Qwen3.5-72B sampler on 2 GPUs”
- “Shut down the server”
- “How many GPUs are available?”
Monitoring & analysis:
- “How is the training going?”
- “Show me the reward-related metrics”
- “Zoom into steps 100-200”
- “Reset the chart view”
Search:
- “Search for math datasets”
- “Find Qwen models on ModelScope”
Available Tools
The agent has access to 13 built-in tools:
| Tool | Description |
|---|---|
list_training_runs | List all training runs |
get_training_status | Get detailed status and recent metrics |
start_server | Start Ray cluster + Twinkle Server (idempotent) |
shutdown_server | Shut down server and release GPU resources |
start_training | Create and launch a new training run |
select_run | Switch monitoring to a different run |
pause_training | Pause training (SIGKILL, server retains state) |
resume_training | Resume by re-launching the client script |
stop_training | Stop training (SIGTERM, saves checkpoint) |
update_script | Update training script with version archiving |
list_supported_models | Query server for available models |
search_datasets | Search ModelScope for datasets |
search_models | Search ModelScope for models |
zoom_metrics | Adjust metrics chart view range |
select_metrics | Choose which metrics to display (max 4) |
get_cluster_info | Get GPU/cluster resource info |
Server Startup
The start_server tool automates a multi-step pipeline:
- GPU detection —
nvidia-smihardware scan - GPU allocation — partition GPUs between training model and samplers
- Config generation — auto-create
server_config.yaml - Ray cluster startup — multi-node GPU partitioning with isolated
CUDA_VISIBLE_DEVICES - Server launch — start Twinkle Server as background process
- Health check — poll
/api/v1/healthzuntil ready
Multi-model topology is supported: 1 training model + N sampler/teacher models.
Skills System
Auto supports extensible skill plugins loaded from three sources:
- Bundled skills — shipped inside
twinkle_client/skills/bundled/ - User-local skills —
~/.cache/twinkle/auto/skills/local/ - Community skills — fetched from ModelScope (best-effort, 10s timeout)
Skills are loaded asynchronously after startup and injected into the agent’s system prompt. The agent is usable immediately even before skills finish loading.
Training Monitor (Auto-Fix)
The TrainingMonitor is a background service that runs every 30 seconds, collecting all available signals about the current training run and feeding them to the LLM for analysis.
Collected Signals
- Process status: alive / dead / unknown
- output.log tail: last 1500 chars (prioritizes tracebacks)
- Metrics: recent entries + first-half vs second-half trend analysis
- Stall duration: seconds since last metric was produced
- Current train.py: full script source (for accurate fixes)
Decision Framework
The LLM classifies each check into one of three actions:
| Decision | When | Action |
|---|---|---|
| LGTM | Training progressing normally | No action |
| WARNING | Loss plateau, reward hacking, KL explosion, etc. | Relay observation to user |
| FIX | Script crashed, process dead with traceback | Auto-fix and restart |
Auto-Fix Pipeline
When a FIX is needed:
- LLM outputs diagnosis + complete fixed script
- Monitor archives the old
train.pyastrain_v{N}.py - Writes the fixed script as the new
train.py - Re-launches training via
resume_training - Resets stall tracking for the new attempt
Safety guardrails:
- Max 3 auto-fix attempts per run (prevents infinite retry loops)
- Fix attempts are tracked per
run_id - Snapshot deduplication avoids re-analyzing unchanged states
File-Based Connection
Auto communicates with training processes through the local filesystem:
~/.cache/twinkle/{run_id}/
├── meta.json — run metadata (model_id, config, status, pid)
├── metrics.jsonl — one JSON object per step (incremental)
├── output.log — combined stdout+stderr from training
├── train.py — current active training script
└── train_v{N}.py — archived previous script versions
Training Control Model
In Server Mode, the Twinkle Server retains all model/optimizer state in GPU memory:
- Pause = kill client process (SIGKILL) — server state preserved
- Resume = re-launch client script — seamlessly continues training
- Stop = SIGTERM — triggers checkpoint saving then exits
- Shut down server = releases GPU resources, destroys model state
TrainingRuntime (Script Integration)
Training scripts use TrainingRuntime to integrate with Auto:
from twinkle_client.auto.runtime import TrainingRuntime
rt = TrainingRuntime(run_id='my-grpo-run')
rt.start(model_id='Qwen/Qwen3.5-4B', config={'lr': 1e-5})
rt.register_graceful_shutdown(model, dataloader)
for step, batch in enumerate(dataloader):
# ... training logic ...
rt.log_metrics(step=step, loss=loss, reward=reward, grad_norm=gn, lr=lr)
rt.log(f'Completed step {step}, loss={loss:.4f}')
rt.finish()
Key Methods
| Method | Description |
|---|---|
start(model_id, config, script_path) | Initialize run directory and metadata |
log_metrics(**kwargs) | Write metrics entry to metrics.jsonl |
log(message) | Print log message (captured as output.log) |
get_resume_info() | Get last_step for resuming from checkpoint |
finish(status) | Mark training as finished, close files |
register_graceful_shutdown(model, dataloader) | Register SIGTERM handler that saves checkpoint |
Resume Support
TrainingRuntime automatically saves training progress to meta.json (throttled to every 5 seconds). Scripts can use get_resume_info() to resume from the last saved step:
rt = TrainingRuntime(run_id='my-run')
resume = rt.get_resume_info()
global_step = resume['last_step']
if global_step > 0:
dataloader.skip_consumed_samples(global_step * BATCH_SIZE)
print(f'Resuming from step {global_step}')
Graceful Shutdown
When register_graceful_shutdown() is called, a SIGTERM handler is installed that:
- Saves model checkpoint (LoRA weights + optimizer state)
- Saves dataloader position (
consumed_train_samples) - Logs the checkpoint path
- Marks training as
stoppedand exits
Logging
All logs are written to ./auto.log (current working directory):
- Rotated at 5MB with 3 backups
- No console output — all output goes to the log file
- Use
--verbosefor DEBUG level logging