data_juicer.core.data.data_validator module

class data_juicer.core.data.data_validator.DataValidator(config: Dict)[源代码]

基类:ABC

Base class for data validation

__init__(config: Dict)[源代码]
abstractmethod validate(dataset: DJDataset) None[源代码]

Validate dataset content

参数:

dataset -- The dataset to validate

抛出:

DataValidationError -- If validation fails

exception data_juicer.core.data.data_validator.DataValidationError[源代码]

基类:Exception

Custom exception for data validation errors

class data_juicer.core.data.data_validator.DataValidatorRegistry[源代码]

基类:object

Registry for data validators

classmethod register(validator_type: str)[源代码]
classmethod get_validator(validator_type: str) Type[DataValidator] | None[源代码]
class data_juicer.core.data.data_validator.BaseConversationValidator(config: Dict)[源代码]

基类:DataValidator

Base class for conversation validators

__init__(config: Dict)[源代码]
validate(dataset: DJDataset) None[源代码]

Base validation for all conversation formats

abstractmethod validate_conversation(data: Dict) None[源代码]

Validate specific conversation format

class data_juicer.core.data.data_validator.SwiftMessagesValidator(config: Dict)[源代码]

基类:BaseConversationValidator

Validator for Swift Messages conversation format.

This validator ensures conversations follow the Swift Messages format with proper message structure and role assignments.

参数:

config (Dict) --

Configuration dictionary containing: min_turns (int, optional): Minimum number of messages.

Defaults to 1.

max_turns (int, optional): Maximum number of messages.

Defaults to 100.

sample_size (int, optional): Number of samples to validate.

Defaults to 100.

Example Format:
{
    "messages": [
        {"role": "system", "content": "<system>"},
        {"role": "user", "content": "<query>"},
        {"role": "assistant", "content": "<response>"},
        ...
    ]
}
抛出:

DataValidationError -- If validation fails due to: - Missing 'messages' field - Invalid message structure - Invalid role values - Missing content - Message count outside allowed range

validate_conversation(data: Dict) None[源代码]

Validate specific conversation format

class data_juicer.core.data.data_validator.DataJuicerFormatValidator(config: Dict)[源代码]

基类:BaseConversationValidator

Validator for Data-Juicer default conversation format.

This validator ensures conversations follow the Data-Juicer format with proper fields and structure.

参数:

config (Dict) --

Configuration dictionary containing: min_turns (int, optional): Minimum number of conversation turns.

Defaults to 1.

max_turns (int, optional): Maximum number of conversation turns.

Defaults to 100.

sample_size (int, optional): Number of samples to validate.

Defaults to 100.

Example Format:
{
    "system": "<system>",  # Optional
    "instruction": "<query-inst>",
    "query": "<query2>",
    "response": "<response2>",
    "history": [  # Optional
        ["<query1>", "<response1>"],
        ...
    ]
}
抛出:

DataValidationError -- If validation fails due to: - Missing required fields - Invalid field types - Invalid conversation structure - Turn count outside allowed range

validate_conversation(data: Dict) None[源代码]

Validate specific conversation format

class data_juicer.core.data.data_validator.CodeDataValidator(config: Dict)[源代码]

基类:DataValidator

Validator for code data

__init__(config: Dict)[源代码]
validate(dataset: DJDataset) None[源代码]

Validate dataset content

参数:

dataset -- The dataset to validate

抛出:

DataValidationError -- If validation fails

class data_juicer.core.data.data_validator.RequiredFieldsValidator(config: Dict)[源代码]

基类:DataValidator

Validator that checks for required fields in dataset.

This validator ensures that specified fields exist in the dataset and optionally checks their types and missing value ratios.

参数:

config (Dict) -- Configuration dictionary containing: required_fields (List[str]): List of field names that must exist field_types (Dict[str, type], optional): Map of field names to expected types allow_missing (float, optional): Maximum ratio of missing values allowed. Defaults to 0.0.

Example Config:
{
    "required_fields": ["field1", "field2"],
    "field_types": {"field1": str, "field2": int},
    "allow_missing": 0.0
}
抛出:

DataValidationError -- If validation fails

__init__(config: Dict)[源代码]

Initialize validator with config

参数:

config -- Dict containing: - required_fields: List of field names that must exist - field_types: Optional map of field names to expected types - allow_missing: Optional float for max ratio missing allowed

validate(dataset: DJDataset) None[源代码]

Validate dataset has required fields with correct types

参数:

dataset -- NestedDataset or RayDataset to validate

抛出:

DataValidationError -- If validation fails