data_juicer.core.data.data_validator module

class data_juicer.core.data.data_validator.DataValidator(config: Dict)[source]

Bases: ABC

Base class for data validation

__init__(config: Dict)[source]
abstractmethod validate(dataset: DJDataset) None[source]

Validate dataset content

Parameters:

dataset – The dataset to validate

Raises:

DataValidationError – If validation fails

exception data_juicer.core.data.data_validator.DataValidationError[source]

Bases: Exception

Custom exception for data validation errors

class data_juicer.core.data.data_validator.DataValidatorRegistry[source]

Bases: object

Registry for data validators

classmethod register(validator_type: str)[source]
classmethod get_validator(validator_type: str) Type[DataValidator] | None[source]
class data_juicer.core.data.data_validator.BaseConversationValidator(config: Dict)[source]

Bases: DataValidator

Base class for conversation validators

__init__(config: Dict)[source]
validate(dataset: DJDataset) None[source]

Base validation for all conversation formats

abstractmethod validate_conversation(data: Dict) None[source]

Validate specific conversation format

class data_juicer.core.data.data_validator.SwiftMessagesValidator(config: Dict)[source]

Bases: BaseConversationValidator

Validator for Swift Messages conversation format.

This validator ensures conversations follow the Swift Messages format with proper message structure and role assignments.

Parameters:

config (Dict) –

Configuration dictionary containing: min_turns (int, optional): Minimum number of messages.

Defaults to 1.

max_turns (int, optional): Maximum number of messages.

Defaults to 100.

sample_size (int, optional): Number of samples to validate.

Defaults to 100.

Example Format:
{
    "messages": [
        {"role": "system", "content": "<system>"},
        {"role": "user", "content": "<query>"},
        {"role": "assistant", "content": "<response>"},
        ...
    ]
}
Raises:

DataValidationError – If validation fails due to: - Missing ‘messages’ field - Invalid message structure - Invalid role values - Missing content - Message count outside allowed range

validate_conversation(data: Dict) None[source]

Validate specific conversation format

class data_juicer.core.data.data_validator.DataJuicerFormatValidator(config: Dict)[source]

Bases: BaseConversationValidator

Validator for Data-Juicer default conversation format.

This validator ensures conversations follow the Data-Juicer format with proper fields and structure.

Parameters:

config (Dict) –

Configuration dictionary containing: min_turns (int, optional): Minimum number of conversation turns.

Defaults to 1.

max_turns (int, optional): Maximum number of conversation turns.

Defaults to 100.

sample_size (int, optional): Number of samples to validate.

Defaults to 100.

Example Format:
{
    "system": "<system>",  # Optional
    "instruction": "<query-inst>",
    "query": "<query2>",
    "response": "<response2>",
    "history": [  # Optional
        ["<query1>", "<response1>"],
        ...
    ]
}
Raises:

DataValidationError – If validation fails due to: - Missing required fields - Invalid field types - Invalid conversation structure - Turn count outside allowed range

validate_conversation(data: Dict) None[source]

Validate specific conversation format

class data_juicer.core.data.data_validator.CodeDataValidator(config: Dict)[source]

Bases: DataValidator

Validator for code data

__init__(config: Dict)[source]
validate(dataset: DJDataset) None[source]

Validate dataset content

Parameters:

dataset – The dataset to validate

Raises:

DataValidationError – If validation fails

class data_juicer.core.data.data_validator.RequiredFieldsValidator(config: Dict)[source]

Bases: DataValidator

Validator that checks for required fields in dataset.

This validator ensures that specified fields exist in the dataset and optionally checks their types and missing value ratios.

Parameters:

config (Dict) – Configuration dictionary containing: required_fields (List[str]): List of field names that must exist field_types (Dict[str, type], optional): Map of field names to expected types allow_missing (float, optional): Maximum ratio of missing values allowed. Defaults to 0.0.

Example Config:
{
    "required_fields": ["field1", "field2"],
    "field_types": {"field1": str, "field2": int},
    "allow_missing": 0.0
}
Raises:

DataValidationError – If validation fails

__init__(config: Dict)[source]

Initialize validator with config

Parameters:

config – Dict containing: - required_fields: List of field names that must exist - field_types: Optional map of field names to expected types - allow_missing: Optional float for max ratio missing allowed

validate(dataset: DJDataset) None[source]

Validate dataset has required fields with correct types

Parameters:

dataset – NestedDataset or RayDataset to validate

Raises:

DataValidationError – If validation fails