data_juicer.tools package¶

Submodules¶

data_juicer.tools.DJ_mcp_granular_ops module¶

data_juicer.tools.DJ_mcp_granular_ops.process_parameter(name: str, param: Parameter) → Parameter[源代码]¶: Processes a function parameter: - Converts jsonargparse.typing.ClosedUnitInterval to a local equivalent annotation.

data_juicer.tools.DJ_mcp_granular_ops.create_operator_function(op, mcp)[源代码]¶

Creates a callable function for a Data-Juicer operator class.

This function dynamically creates a function that can be registered as an MCP tool, with proper signature and documentation based on the operator's __init__ method.

data_juicer.tools.DJ_mcp_granular_ops.create_mcp_server(port: str = '8000')[源代码]¶

Creates the FastMCP server and registers the tools.

参数:: port (str, optional) -- Port number. Defaults to "8000".

data_juicer.tools.DJ_mcp_recipe_flow module¶

data_juicer.tools.DJ_mcp_recipe_flow.get_data_processing_ops(op_type: str | None = None, tags: List[str] | None = None, match_all: bool = True) → dict[源代码]¶

Retrieves a list of available data processing operators based on the specified type and tags. Operators are a collection of basic processes that assist in data modification, cleaning, filtering, deduplication, etc.

Should be used with run_data_recipe.

If both tags and ops_type are None, return a list of all operators.

The following op_type values are supported: - aggregator: Aggregate for batched samples, such as summary or conclusion. - deduplicator: Detects and removes duplicate samples. - filter: Filters out low-quality samples. - grouper: Group samples to batched samples. - mapper: Edits and transforms samples. - selector: Selects top samples based on ranking.

The tags parameter specifies the characteristics of the data or the required resources. Available tags are:

Modality Tags:

text: process text data specifically.
image: process image data specifically.
audio: process audio data specifically.
video: process video data specifically.
multimodal: process multimodal data.

Resource Tags:

cpu: only requires CPU resource.
gpu: requires GPU/CUDA resource as well.

Model Tags:

api: equipped with API-based models (e.g. ChatGPT, GPT-4o).
vllm: equipped with models supported by vLLM.
hf: equipped with models from HuggingFace Hub.

Tags are used to refine the search for suitable operators based on specific data processing needs.

参数:

op_type -- The type of data processing operator to retrieve. If None, no ops_type-based filtering is applied. If specified, must be one of the values listed. Defaults to None.
tags -- An optional list of tags to filter operators. See the tag list above for options. If None, no tag-based filtering is applied. Defaults to None.
match_all -- If True, only operators matching all specified tags are returned. If False, operators matching any of the specified tags are returned. Defaults to True.

返回:

A dict containing detailed information about the available operators

data_juicer.tools.DJ_mcp_recipe_flow.run_data_recipe(dataset_path: str, process: list[Dict], export_path: str | None = None, np: int = 1) → str[源代码]¶

Run data recipe. If you want to run one or more DataJuicer data processing operators, you can use this tool. Supported operators and their arguments should be obtained through the get_data_processing_ops tool.

参数:

dataset_path -- Path to the dataset to be processed.
process -- List of processing operations to be executed sequentially. Each element is a dictionary with operator name as key and its configuration as value. Multiple operators can be chained.
export_path -- Path to export the processed dataset. Defaults to None, which exports to './outputs' directory.
np -- Number of processes to use. Defaults to 1.

示例

# First get available filter operators for text data >>> available_ops = get_data_processing_ops( ... op_type="filter", ... tags=["text"] ... )

# Then run a data recipe with selected filters: # 1. First filter samples with text length 10-50 # 2. Then filter English samples with language confidence score >= 0.8 >>> run_data_recipe( ... "/path/to/dataset.jsonl", ... [ ... { ... "text_length_filter": { ... "min_len": 10, ... "max_len": 50 ... } ... }, ... { ... "language_id_score_filter": { ... "lang": "en", ... "min_score": 0.8 ... } ... } ... ] ... )

data_juicer.tools.DJ_mcp_recipe_flow.create_mcp_server(port: str = '8000')[源代码]¶

Creates the FastMCP server and registers the tools.

参数:: port (str, optional) -- Port number. Defaults to "8000".

data_juicer.tools.mcp_tool module¶

data_juicer.tools.mcp_tool.add_extra_cfg(dj_cfg: Dict) → Dict[源代码]¶: Add extra dj config.

data_juicer.tools.mcp_tool.execute_op(dj_cfg: Dict)[源代码]¶

data_juicer.tools.op_search module¶

Operator Searcher - A tool for filtering Data-Juicer operators by tags

class data_juicer.tools.op_search.OPRecord(op_type: str, name: str, desc: str, tags: List[str], sig: Signature, param_desc: str)[源代码]¶

基类：object

A record class for storing operator metadata

__init__(op_type: str, name: str, desc: str, tags: List[str], sig: Signature, param_desc: str)[源代码]¶

to_dict()[源代码]¶

data_juicer.tools.op_search.analyze_modality_tag(code, op_prefix)[源代码]¶: Analyze the modality tag for the given code content string. Should be one of the "Modality Tags" in tagging_mappings.json. It makes the choice by finding the usages of attributes {modality}_key and the prefix of the OP name. If there are multiple modality keys are used, the 'multimodal' tag will be returned instead.

data_juicer.tools.op_search.analyze_resource_tag(code)[源代码]¶: Analyze the resource tag for the given code content string. Should be one of the "Resource Tags" in tagging_mappings.json. It makes the choice according to their assigning statement to attribute _accelerator.

data_juicer.tools.op_search.analyze_model_tags(code)[源代码]¶: Analyze the model tag for the given code content string. SHOULD be one of the "Model Tags" in tagging_mappings.json. It makes the choice by finding the model_type arg in prepare_model method invocation.

data_juicer.tools.op_search.analyze_tag_with_inheritance(op_cls, analyze_func, default_tags=[], other_parm={})[源代码]¶: Universal inheritance chain label analysis function

data_juicer.tools.op_search.analyze_tag_from_cls(op_cls, op_name)[源代码]¶: Analyze the tags for the OP from the given cls.

data_juicer.tools.op_search.extract_param_docstring(docstring)[源代码]¶: Extract parameter descriptions from __init__ method docstring.

class data_juicer.tools.op_search.OPSearcher(specified_op_list: List[str] | None = None)[源代码]¶