data_juicer.core.data.data_validator module¶
- class data_juicer.core.data.data_validator.DataValidator(config: Dict)[source]¶
Bases:
ABC
Base class for data validation
- abstractmethod validate(dataset: DJDataset) None [source]¶
Validate dataset content
- Parameters:
dataset – The dataset to validate
- Raises:
DataValidationError – If validation fails
- exception data_juicer.core.data.data_validator.DataValidationError[source]¶
Bases:
Exception
Custom exception for data validation errors
- class data_juicer.core.data.data_validator.DataValidatorRegistry[source]¶
Bases:
object
Registry for data validators
- classmethod get_validator(validator_type: str) Type[DataValidator] | None [source]¶
- class data_juicer.core.data.data_validator.BaseConversationValidator(config: Dict)[source]¶
Bases:
DataValidator
Base class for conversation validators
- class data_juicer.core.data.data_validator.SwiftMessagesValidator(config: Dict)[source]¶
Bases:
BaseConversationValidator
Validator for Swift Messages conversation format.
This validator ensures conversations follow the Swift Messages format with proper message structure and role assignments.
- Parameters:
config (Dict) –
Configuration dictionary containing: min_turns (int, optional): Minimum number of messages.
Defaults to 1.
- max_turns (int, optional): Maximum number of messages.
Defaults to 100.
- sample_size (int, optional): Number of samples to validate.
Defaults to 100.
- Example Format:
{ "messages": [ {"role": "system", "content": "<system>"}, {"role": "user", "content": "<query>"}, {"role": "assistant", "content": "<response>"}, ... ] }
- Raises:
DataValidationError – If validation fails due to: - Missing ‘messages’ field - Invalid message structure - Invalid role values - Missing content - Message count outside allowed range
- class data_juicer.core.data.data_validator.DataJuicerFormatValidator(config: Dict)[source]¶
Bases:
BaseConversationValidator
Validator for Data-Juicer default conversation format.
This validator ensures conversations follow the Data-Juicer format with proper fields and structure.
- Parameters:
config (Dict) –
Configuration dictionary containing: min_turns (int, optional): Minimum number of conversation turns.
Defaults to 1.
- max_turns (int, optional): Maximum number of conversation turns.
Defaults to 100.
- sample_size (int, optional): Number of samples to validate.
Defaults to 100.
- Example Format:
{ "system": "<system>", # Optional "instruction": "<query-inst>", "query": "<query2>", "response": "<response2>", "history": [ # Optional ["<query1>", "<response1>"], ... ] }
- Raises:
DataValidationError – If validation fails due to: - Missing required fields - Invalid field types - Invalid conversation structure - Turn count outside allowed range
- class data_juicer.core.data.data_validator.CodeDataValidator(config: Dict)[source]¶
Bases:
DataValidator
Validator for code data
- validate(dataset: DJDataset) None [source]¶
Validate dataset content
- Parameters:
dataset – The dataset to validate
- Raises:
DataValidationError – If validation fails
- class data_juicer.core.data.data_validator.RequiredFieldsValidator(config: Dict)[source]¶
Bases:
DataValidator
Validator that checks for required fields in dataset.
This validator ensures that specified fields exist in the dataset and optionally checks their types and missing value ratios.
- Parameters:
config (Dict) – Configuration dictionary containing: required_fields (List[str]): List of field names that must exist field_types (Dict[str, type], optional): Map of field names to expected types allow_missing (float, optional): Maximum ratio of missing values allowed. Defaults to 0.0.
- Example Config:
{ "required_fields": ["field1", "field2"], "field_types": {"field1": str, "field2": int}, "allow_missing": 0.0 }
- Raises:
DataValidationError – If validation fails
- __init__(config: Dict)[source]¶
Initialize validator with config
- Parameters:
config – Dict containing: - required_fields: List of field names that must exist - field_types: Optional map of field names to expected types - allow_missing: Optional float for max ratio missing allowed
- validate(dataset: DJDataset) None [source]¶
Validate dataset has required fields with correct types
- Parameters:
dataset – NestedDataset or RayDataset to validate
- Raises:
DataValidationError – If validation fails