Auto-Research

Mon, 01 Jan 0001 00:00:00 +0000

Twinkle Auto is a terminal-based intelligent training assistant that lets you control, monitor, and debug ML training through natural language. It combines a chat-driven AI agent with an automated health monitor that can detect and fix training failures autonomously.

Architecture Overview

┌──────────────────────────────────────────────────────────┐
│ TwinkleAuto (asyncio chat loop) │
│ │
│ Components: │
│ AgentLoop ─── LLM tool-calling loop │
│ TrainingMonitor ─── periodic health check & auto-fix │
│ LocalConnection ─── file-system based communication │
│ SkillManager ─── async plugin loading │
└──────────────────────────────────────────────────────────┘

Installation & Launch

Auto is part of the twinkle-client package:

pip install twinkle-client

Command-Line Usage

# Basic launch (uses default local Ollama endpoint)
twinkle-auto

# Specify LLM backend
twinkle-auto --llm-base-url http://localhost:11434/v1 --llm-model qwen3.5

# Attach to an existing training run
twinkle-auto --run-id my-grpo-run

# Use a remote API (e.g., OpenAI-compatible)
twinkle-auto --llm-base-url https://api.example.com/v1 --llm-api-key sk-xxx --llm-model gpt-4o

# Enable debug logging
twinkle-auto --verbose

Or run as a Python module:

python -m twinkle_client.auto

CLI Options

Option	Env Var	Default	Description
`--run-id`, `-r`	`TWINKLE_AUTO_RUN_ID`	None	Attach to an existing training run
`--llm-base-url`	`TWINKLE_LLM_BASE_URL`	`http://localhost:11434/v1`	LLM API base URL
`--llm-model`	`TWINKLE_LLM_MODEL`	`qwen3.5`	LLM model name
`--llm-api-key`	`TWINKLE_LLM_API_KEY`	`not-needed`	LLM API key
`--verbose`, `-v`	`TWINKLE_AUTO_VERBOSE`	`False`	Enable DEBUG logging
`--version`, `-V`	—	—	Show version and exit

Chat Agent

The core of Auto is an LLM-powered tool-calling agent (AgentLoop) that processes natural language commands through an OpenAI-compatible API. The agent maintains conversation history with automatic pruning (last 50 messages) and supports up to 10 tool-calling rounds per interaction.

What You Can Say

Training lifecycle:

“List my training runs”
“Start a new GRPO training with Qwen3.5-4B on gsm8k”
“Pause the current run”
“Resume training”
“Stop training”

Server management:

“Start the server with Qwen3.5-4B and a Qwen3.5-72B sampler on 2 GPUs”
“Shut down the server”
“How many GPUs are available?”

Monitoring & analysis:

“How is the training going?”
“Show me the reward-related metrics”
“Zoom into steps 100-200”
“Reset the chart view”

Search:

“Search for math datasets”
“Find Qwen models on ModelScope”

Available Tools

The agent has access to 13 built-in tools:

Tool	Description
`list_training_runs`	List all training runs
`get_training_status`	Get detailed status and recent metrics
`start_server`	Start Ray cluster + Twinkle Server (idempotent)
`shutdown_server`	Shut down server and release GPU resources
`start_training`	Create and launch a new training run
`select_run`	Switch monitoring to a different run
`pause_training`	Pause training (SIGKILL, server retains state)
`resume_training`	Resume by re-launching the client script
`stop_training`	Stop training (SIGTERM, saves checkpoint)
`update_script`	Update training script with version archiving
`list_supported_models`	Query server for available models
`search_datasets`	Search ModelScope for datasets
`search_models`	Search ModelScope for models
`zoom_metrics`	Adjust metrics chart view range
`select_metrics`	Choose which metrics to display (max 4)
`get_cluster_info`	Get GPU/cluster resource info

Server Startup

The start_server tool automates a multi-step pipeline:

GPU detection — nvidia-smi hardware scan
GPU allocation — partition GPUs between training model and samplers
Config generation — auto-create server_config.yaml
Ray cluster startup — multi-node GPU partitioning with isolated CUDA_VISIBLE_DEVICES
Server launch — start Twinkle Server as background process
Health check — poll /api/v1/healthz until ready

Multi-model topology is supported: 1 training model + N sampler/teacher models.

Skills System

Auto supports extensible skill plugins loaded from three sources:

Bundled skills — shipped inside twinkle_client/skills/bundled/
User-local skills — ~/.cache/twinkle/auto/skills/local/
Community skills — fetched from ModelScope (best-effort, 10s timeout)

Skills are loaded asynchronously after startup and injected into the agent’s system prompt. The agent is usable immediately even before skills finish loading.

Training Monitor (Auto-Fix)

The TrainingMonitor is a background service that runs every 30 seconds, collecting all available signals about the current training run and feeding them to the LLM for analysis.

Collected Signals

Process status: alive / dead / unknown
output.log tail: last 1500 chars (prioritizes tracebacks)
Metrics: recent entries + first-half vs second-half trend analysis
Stall duration: seconds since last metric was produced
Current train.py: full script source (for accurate fixes)

Decision Framework

The LLM classifies each check into one of three actions:

Decision	When	Action
LGTM	Training progressing normally	No action
WARNING	Loss plateau, reward hacking, KL explosion, etc.	Relay observation to user
FIX	Script crashed, process dead with traceback	Auto-fix and restart

Auto-Fix Pipeline

When a FIX is needed:

LLM outputs diagnosis + complete fixed script
Monitor archives the old train.py as train_v{N}.py
Writes the fixed script as the new train.py
Re-launches training via resume_training
Resets stall tracking for the new attempt

Safety guardrails:

Max 3 auto-fix attempts per run (prevents infinite retry loops)
Fix attempts are tracked per run_id
Snapshot deduplication avoids re-analyzing unchanged states

File-Based Connection

Auto communicates with training processes through the local filesystem:

~/.cache/twinkle/{run_id}/
├── meta.json — run metadata (model_id, config, status, pid)
├── metrics.jsonl — one JSON object per step (incremental)
├── output.log — combined stdout+stderr from training
├── train.py — current active training script
└── train_v{N}.py — archived previous script versions

Training Control Model

In Server Mode, the Twinkle Server retains all model/optimizer state in GPU memory:

Pause = kill client process (SIGKILL) — server state preserved
Resume = re-launch client script — seamlessly continues training
Stop = SIGTERM — triggers checkpoint saving then exits
Shut down server = releases GPU resources, destroys model state

TrainingRuntime (Script Integration)

Training scripts use TrainingRuntime to integrate with Auto:

from twinkle_client.auto.runtime import TrainingRuntime

rt = TrainingRuntime(run_id='my-grpo-run')
rt.start(model_id='Qwen/Qwen3.5-4B', config={'lr': 1e-5})
rt.register_graceful_shutdown(model, dataloader)

for step, batch in enumerate(dataloader):
 # ... training logic ...
 rt.log_metrics(step=step, loss=loss, reward=reward, grad_norm=gn, lr=lr)
 rt.log(f'Completed step {step}, loss={loss:.4f}')

rt.finish()

Key Methods

Method	Description
`start(model_id, config, script_path)`	Initialize run directory and metadata
`log_metrics(**kwargs)`	Write metrics entry to `metrics.jsonl`
`log(message)`	Print log message (captured as `output.log`)
`get_resume_info()`	Get `last_step` for resuming from checkpoint
`finish(status)`	Mark training as finished, close files
`register_graceful_shutdown(model, dataloader)`	Register SIGTERM handler that saves checkpoint

Resume Support

TrainingRuntime automatically saves training progress to meta.json (throttled to every 5 seconds). Scripts can use get_resume_info() to resume from the last saved step:

rt = TrainingRuntime(run_id='my-run')
resume = rt.get_resume_info()
global_step = resume['last_step']

if global_step > 0:
 dataloader.skip_consumed_samples(global_step * BATCH_SIZE)
 print(f'Resuming from step {global_step}')

Graceful Shutdown

When register_graceful_shutdown() is called, a SIGTERM handler is installed that:

Saves model checkpoint (LoRA weights + optimizer state)
Saves dataloader position (consumed_train_samples)
Logs the checkpoint path
Marks training as stopped and exits

Logging

All logs are written to ./auto.log (current working directory):

Rotated at 5MB with 3 backups
No console output — all output goes to the log file
Use --verbose for DEBUG level logging

SkillProvider

Mon, 01 Jan 0001 00:00:00 +0000

The skill system allows Twinkle Auto’s agent to dynamically load specialized knowledge from external sources (Git repos, APIs, local files) and inject them into the LLM’s system prompt.

Architecture

Class	Role
Skill	Dataclass holding a single skill’s name, content, and source
SkillProvider	Abstract base class for fetching skills from a source
SkillManager	Orchestrates multiple providers, aggregates skills for prompt injection

Skill Dataclass

@dataclasses.dataclass
class Skill:
 name: str # Short identifier (typically filename without extension)
 content: str # Full markdown content
 source: str # Provider name + relative path for traceability

Creating a Custom Provider

Subclass SkillProvider and implement name and fetch():

from twinkle_client.skills.base import SkillProvider

class MySkillProvider(SkillProvider):

 @property
 def name(self) -> str:
 return 'my-skills'

 async def fetch(self) -> None:
 # Download/clone skill files to self.cache_dir
 # e.g., git clone, API download, file copy
 ...

The default load_skills() scans self.cache_dir for .md files (skipping README, LICENSE, etc.) and returns Skill objects.

SkillManager

from twinkle_client.skills.manager import SkillManager

manager = SkillManager()
manager.register(my_provider)
manager.register(another_provider)

# Fetch and load all skills
skills = await manager.load_all()

# Format for LLM system prompt injection
prompt_section = manager.format_for_prompt()

Key Methods

Method	Description
`register(provider)`	Add a skill provider
`load_all()`	Fetch + load from all providers
`format_for_prompt()`	Render skills as formatted text for system prompt
`get_skill_names()`	List names of loaded skills

Cache Directory

By default, skills are cached at ~/.cache/twinkle/auto/skills/<provider_name>/. Override by passing cache_dir to the provider constructor.

Auto | Twinkle

Auto-Research

Architecture Overview

Installation & Launch

Command-Line Usage

CLI Options

Chat Agent

What You Can Say

Available Tools

Server Startup

Skills System

Training Monitor (Auto-Fix)

Collected Signals

Decision Framework

Auto-Fix Pipeline

File-Based Connection

Training Control Model

TrainingRuntime (Script Integration)

Key Methods

Resume Support

Graceful Shutdown

Logging

SkillProvider

Architecture

Skill Dataclass

Creating a Custom Provider

SkillManager

Key Methods

Cache Directory