data_juicer.core.data.schema module

class data_juicer.core.data.schema.Schema(column_types: Dict[str, Any], columns: List[str])[source]

Bases: object

Dataset schema representation.

column_types

Mapping of column names to their types

Type:

Dict[str, Any]

columns

List of column names in order

Type:

List[str]

column_types: Dict[str, Any]
columns: List[str]
classmethod from_hf_features(features: Features)[source]
classmethod from_ray_schema(schema)[source]
classmethod map_hf_type_to_python(feature)[source]

Map HuggingFace feature type to Python type.

Recursively maps nested types (e.g., List[str], Dict[str, int]).

Examples

Value(‘string’) -> str Sequence(Value(‘int32’)) -> List[int] Dict({‘text’: Value(‘string’)}) -> Dict[str, Any]

Parameters:

feature – HuggingFace feature type

Returns:

Corresponding Python type

classmethod map_ray_type_to_python(ray_type: DataType)[source]

Map Ray/Arrow data type to Python type.

Parameters:

ray_type – PyArrow DataType

Returns:

Corresponding Python type

__init__(column_types: Dict[str, Any], columns: List[str]) None