Data Annotation Module¶
The Data Annotation Module is built on Label Studio, providing efficient and flexible data annotation solutions for machine learning projects. It supports multiple annotation scenarios and is particularly suitable for data preparation in reward model and dialogue system projects.
1. Overview¶
- Deep Label Studio Integration: Uses Label Studio as the annotation interface, providing an intuitive and user-friendly annotation experience
- Multi-scenario Data Support: Comprehensive support for dialogue annotation, quality scoring, preference ranking, and other annotation tasks
- Quick Deployment: Provides both Docker and pip deployment options with one-click service startup
- Templated Annotation Configuration: Built-in annotation templates like RewardBench for out-of-the-box usage
- Seamless Ecosystem Integration: Deep integration with RM Gallery data processing pipeline for smooth data flow
- Enterprise-level Batch Processing: Supports large-scale batch annotation, export, and management
2. Application Scenarios¶
This module is primarily suitable for the following machine learning data preparation scenarios:
- Reward Model Training Data Annotation - Preparing high-quality preference data for reward model training
- Dialogue System Quality Assessment - Evaluating and improving the output quality of dialogue models
- Preference Learning Data Preparation - Building comparison datasets for preference learning
- Text Classification and Sentiment Analysis - Preparing annotated data for supervised learning tasks
3. Quick Start¶
3.1. Environment Setup¶
Ensure the following dependencies are installed:
label_studio==1.17.0
3.2. Start Label Studio Annotation Service¶
Use the following commands to start the annotation service:
# Start using Docker (recommended)
python ./rm_gallery/core/data/annotation/server.py start
# Start using pip
python ./rm_gallery/core/data/annotation/server.py start --use-pip
# Check service status
python ./rm_gallery/core/data/annotation/server.py status
# Stop annotation service
python ./rm_gallery/core/data/annotation/server.py stop
After successful startup, the console will display:
============================================================
🚀 Label Studio Successfully Started!
============================================================
🌐 Web Interface: http://localhost:8080
📧 Username: admin@rmgallery.com
🔐 Password: RM-Gallery
📁 Data Directory: ./log/label_studio_logs
🐳 Deployment: Pip
============================================================
3.3. Verify Service Status¶
Run the status check command to confirm the service is running normally:
==================================================
📊 Label Studio Status
==================================================
🌐 Server URL: http://localhost:8080
🚀 Deployment: PIP
🔌 Port: 8080
✅ Running
📁 Data Dir: ./log/label_studio_logs
👤 Username: admin@rmgallery.com
🔄 Process PIDs: 65727
🔌 Port PIDs: 65727
==================================================
3.4. Obtain API Token¶
After completing service startup, follow these steps to obtain the API Token:
- Visit http://localhost:8080 in your browser
- Login using the following credentials:
- Username:
admin@rmgallery.com
- Password:
RM-Gallery
- Username:
- Click "Organization" in the left navigation bar and go to API Tokens Settings
- Set both Personal Access Tokens and Legacy Tokens to True
- Click the user avatar in the top right corner and select "Account & Settings"
- Copy the token value from the Access Token section - this is the API Token you'll need later
4. Complete Usage Example¶
The following complete example demonstrates how to use the data annotation module, including the full workflow of data import, project creation, annotation execution, and result export.
"""
Step 1: Import data and create annotation project
"""
from rm_gallery.core.data.annotation.annotation import create_annotator
from rm_gallery.core.data.load.base import create_loader
import rm_gallery.core.data # Core strategy registration
import rm_gallery.gallery.data # Extension strategy registration
# Replace with your actual API Token obtained from Label Studio
API_TOKEN = ""
# Step 1.1: Configure data loading parameters
load_config = {
"path": "../../../data/reward-bench-2/data/test-00000-of-00001.parquet", # Replace with actual data path
"limit": 1000, # Limit the number of records loaded to avoid loading too much data for initial testing
}
# Step 1.2: Create data loader
loader = create_loader(
name="rewardbench2", # Dataset identifier name
load_strategy_type="local", # Use local file loading strategy
data_source="rewardbench2", # Specify data source format converter
config=load_config # Pass loading configuration parameters
)
# Step 1.3: Execute data loading
print("Loading data...")
dataset = loader.run()
print(f"Data loading completed, {len(dataset.datasamples)} records total")
# Step 1.4: Configure annotation project parameters
annotation_config = {
"server_url": "http://localhost:8080", # Label Studio service address
"api_token": API_TOKEN, # API access token
"project_title": "RM Gallery Quality Annotation", # Project display name
"template_name": "rewardbench2", # Use built-in RewardBench2 template
"project_description": "Data quality annotation project based on RewardBench2 template"
}
# Step 1.5: Create annotation module instance
annotation_module = create_annotator(
name="rm_gallery_annotation",
**annotation_config
)
# Step 1.6: Create annotation project and import data
print("Creating annotation project...")
result = annotation_module.run(dataset, create_new_project=True)
if result:
project_url = f"{result.metadata['annotation_server_url']}/projects/{result.metadata['annotation_project_id']}"
print(f"✅ Annotation project created successfully!")
print(f"🌐 Project URL: {project_url}")
print(f"📊 Project ID: {result.metadata['annotation_project_id']}")
else:
print("❌ Failed to create annotation project, please check configuration and network connection")
"""
Step 2: Perform data annotation
After the project is created successfully, you can:
1. Visit the project URL output above
2. Login using admin@rmgallery.com / RM-Gallery
3. Annotate data in the annotation interface
4. After annotation is complete, run the next step to export annotation results
Note: In actual usage, you need to manually complete the annotation work, then run the export code below.
"""
# This is a placeholder for annotation operations
# Actual annotation work needs to be completed in the Label Studio web interface
print("📝 Please complete data annotation work in the Label Studio interface")
print("💡 After annotation is complete, run the code below to export results")
"""
Step 3: Export annotation results
"""
from rm_gallery.core.data.annotation.annotation import create_annotator
from rm_gallery.core.data.export import create_exporter
import rm_gallery.core.data # Core strategy registration
import rm_gallery.gallery.data # Extension strategy registration
# Use the same API Token as when creating the project
API_TOKEN = ""
# Step 3.1: Recreate annotation module instance
annotation_module = create_annotator(
name="rm_gallery_annotation",
template_name="rewardbench2",
api_token=API_TOKEN
)
# Step 3.2: Set project ID (obtained from Step 1 output)
annotation_module.project_id = 3 # Replace with actual project_id
# Step 3.3: Export annotation data from Label Studio
print("Exporting annotation data...")
try:
annotated_dataset = annotation_module.export_annotations_to_dataset()
print(f"✅ Annotation data exported successfully, {len(annotated_dataset.datasamples)} annotated records total")
except Exception as e:
print(f"❌ Export failed: {e}")
annotated_dataset = None
# Step 3.4: Configure file exporter
if annotated_dataset:
exporter = create_exporter(
name="annotation_exporter",
config={
"output_dir": "./exports", # Export file storage directory
"formats": ["jsonl"] # Support multiple export formats
}
)
# Step 3.5: Execute data export
print("Saving annotation results to file...")
export_result = exporter.run(annotated_dataset)
if export_result:
print("✅ Annotation results saved to ./exports directory")
print(f"📊 Dataset info: {annotated_dataset.name}")
print(f"📁 Contains {len(annotated_dataset.datasamples)} annotation data")
else:
print("❌ File export failed")
# Display partial data preview
if annotated_dataset.datasamples:
print("\n📋 Data preview:")
sample = annotated_dataset.datasamples[0]
print(f" - Sample ID: {sample.unique_id}")
print(f" - Annotation status: {sample.metadata.get('annotation_status', 'unknown')}")
print(f" - Output count: {len(sample.output) if sample.output else 0}")
5. Built-in Annotation Templates¶
The system provides the following pre-configured annotation templates for out-of-the-box usage, located in rm_gallery/gallery/data/annotation/
:
Template Name | Template ID | Source | Description |
---|---|---|---|
RewardBenchAnnotationTemplate | rewardbench |
RewardBench | Supports 2-choice quality scoring and ranking |
RewardBench2AnnotationTemplate | rewardbench2 |
RewardBench2 | Supports 4-choice quality scoring and ranking |
6. Custom Annotation Template Development¶
If the built-in templates don't meet your annotation needs, you can develop custom templates following these steps:
Step 1: Create Template Class¶
Create a new template file in the rm_gallery/gallery/data/annotation/
directory, inheriting from the BaseAnnotationTemplate
base class:
from rm_gallery.core.data.annotation.template import BaseAnnotationTemplate, AnnotationTemplateRegistry
@AnnotationTemplateRegistry.register("custom_template")
class CustomAnnotationTemplate(BaseAnnotationTemplate):
@property
def label_config(self) -> str:
"""
Define Label Studio annotation interface configuration
Using Label Studio's XML configuration syntax
"""
return """
<View>
<Text name="question" value="$question"/>
<Choices name="quality" toName="question" choice="single-radio">
<Choice value="excellent" background="green"/>
<Choice value="good" background="blue"/>
<Choice value="fair" background="yellow"/>
<Choice value="poor" background="red"/>
</Choices>
<Rating name="score" toName="question" maxRating="10" />
<TextArea name="comments" toName="question"
placeholder="Please enter evaluation reason..." rows="3"/>
</View>
"""
def process_annotations(self, annotation_data):
"""
Process annotation data obtained from Label Studio
Convert raw annotation data to structured format
"""
processed_data = {
"quality_rating": annotation_data.get("choices", {}).get("quality", {}).get("choices", []),
"numerical_score": annotation_data.get("rating", {}).get("score", {}).get("rating", 0),
"textual_feedback": annotation_data.get("textarea", {}).get("comments", {}).get("text", [""])[0]
}
return processed_data
Step 2: Register Template¶
Import your template class in the rm_gallery/gallery/data/__init__.py
file to complete registration:
# Import custom annotation template
from rm_gallery.gallery.data.annotation.custom_template import CustomAnnotationTemplate
Step 3: Use Custom Template¶
When creating an annotation module, specify your custom template name:
annotation_module = create_annotation_module(
name="custom_annotation_project",
template_name="custom_template", # Use your registered template name
# ... other configuration parameters
)
7. Data Format Specifications¶
Input Data Requirements¶
- DataSample Standard Format
Output Data Format¶
- Annotation Result Integration: Annotation data is automatically added to the
label
field of DataSample - Original Data Protection: Maintains integrity and original structure of input data
- Rich Metadata: Includes annotation project ID, annotation status, timestamps, and other tracking information
- Multi-format Export: Supports JSON, JSONL, and other export formats
Data Structure Example¶
# Annotated DataSample structure example
data_sample = DataSample(
unique_id="sample_001",
input=[...], # Original input data
output=[...], # Original output data (if any)
metadata={
"annotation_status": "completed",
"annotation_project_id": 123,
"annotator_id": "user@example.com"
}
)
# The label field in each output contains annotation results
output.label = {
"annotation_data": {
"ratings": {...}, # Rating data
"choices": {...}, # Choice data
"text_areas": {...} # Text input data
},
"processed": {...} # Structured data processed by template
}
8. Troubleshooting Guide¶
8.1. Common Issues and Solutions¶
8.1.1. Label Studio Service Startup Failure¶
Problem: Service cannot start or is inaccessible after startup
Solution Steps:
# Check if port is occupied
lsof -i :8080
# If port is occupied, start with different port
python ./rm_gallery/core/data/annotation/server.py start --port 8081
# View detailed startup logs
python ./rm_gallery/core/data/annotation/server.py start --data-dir ./custom_log --verbose
# Clean previous data directory (use with caution)
rm -rf ./log/label_studio_logs
8.1.2. API Token Acquisition Failure¶
Problem: Cannot find API Token in the interface
Solution Steps:
- Ensure correct login to Label Studio interface
- Check user permission settings, ensure admin privileges
- Enable API Tokens feature in Organization settings
- If still unable to obtain, try recreating user account
8.1.3. Data Import Failure¶
Problem: Project created successfully but data cannot be imported
Solution Steps:
- Check if data format meets requirements
- Verify API Token correctness
- Confirm normal network connection
- Check if data size exceeds limits
8.1.4. Annotation Result Export Exception¶
Problem: Exported data is incomplete or format is abnormal
Solution Steps:
- Confirm all data has been annotated
- Check write permissions for export path
- Verify correct implementation of template's process_annotations method
8.2 Getting Technical Support¶
If you encounter unresolvable issues, you can get help through the following methods:
- View log files: Log files in the
./log/label_studio_logs/
directory - Check system status: Run
python ./rm_gallery/core/data/annotation/server.py status
- Restart service: Completely stop the service and restart
- Community support: Submit an Issue in the project repository with detailed error information and reproduction steps