data_juicer.tools package

Submodules

data_juicer.tools.DJ_mcp_granular_ops module

data_juicer.tools.DJ_mcp_granular_ops.process_parameter(name: str, param: Parameter) Parameter[源代码]

Processes a function parameter: - Converts jsonargparse.typing.ClosedUnitInterval to a local equivalent annotation.

data_juicer.tools.DJ_mcp_granular_ops.create_operator_function(op, mcp)[源代码]

Creates a callable function for a Data-Juicer operator class.

This function dynamically creates a function that can be registered as an MCP tool, with proper signature and documentation based on the operator's __init__ method.

data_juicer.tools.DJ_mcp_granular_ops.create_mcp_server(port: str = '8000')[源代码]

Creates the FastMCP server and registers the tools.

参数:

port (str, optional) -- Port number. Defaults to "8000".

data_juicer.tools.DJ_mcp_recipe_flow module

data_juicer.tools.DJ_mcp_recipe_flow.get_data_processing_ops(op_type: str | None = None, tags: List[str] | None = None, match_all: bool = True) dict[源代码]

Retrieves a list of available data processing operators based on the specified type and tags. Operators are a collection of basic processes that assist in data modification, cleaning, filtering, deduplication, etc.

Should be used with run_data_recipe.

If both tags and ops_type are None, return a list of all operators.

The following op_type values are supported: - aggregator: Aggregate for batched samples, such as summary or conclusion. - deduplicator: Detects and removes duplicate samples. - filter: Filters out low-quality samples. - grouper: Group samples to batched samples. - mapper: Edits and transforms samples. - selector: Selects top samples based on ranking.

The tags parameter specifies the characteristics of the data or the required resources. Available tags are:

Modality Tags:
  • text: process text data specifically.

  • image: process image data specifically.

  • audio: process audio data specifically.

  • video: process video data specifically.

  • multimodal: process multimodal data.

Resource Tags:
  • cpu: only requires CPU resource.

  • gpu: requires GPU/CUDA resource as well.

Model Tags:
  • api: equipped with API-based models (e.g. ChatGPT, GPT-4o).

  • vllm: equipped with models supported by vLLM.

  • hf: equipped with models from HuggingFace Hub.

Tags are used to refine the search for suitable operators based on specific data processing needs.

参数:
  • op_type -- The type of data processing operator to retrieve. If None, no ops_type-based filtering is applied. If specified, must be one of the values listed. Defaults to None.

  • tags -- An optional list of tags to filter operators. See the tag list above for options. If None, no tag-based filtering is applied. Defaults to None.

  • match_all -- If True, only operators matching all specified tags are returned. If False, operators matching any of the specified tags are returned. Defaults to True.

返回:

A dict containing detailed information about the available operators

data_juicer.tools.DJ_mcp_recipe_flow.run_data_recipe(dataset_path: str, process: list[Dict], export_path: str | None = None, np: int = 1) str[源代码]

Run data recipe. If you want to run one or more DataJuicer data processing operators, you can use this tool. Supported operators and their arguments should be obtained through the get_data_processing_ops tool.

参数:
  • dataset_path -- Path to the dataset to be processed.

  • process -- List of processing operations to be executed sequentially. Each element is a dictionary with operator name as key and its configuration as value. Multiple operators can be chained.

  • export_path -- Path to export the processed dataset. Defaults to None, which exports to './outputs' directory.

  • np -- Number of processes to use. Defaults to 1.

示例

# First get available filter operators for text data >>> available_ops = get_data_processing_ops( ... op_type="filter", ... tags=["text"] ... )

# Then run a data recipe with selected filters: # 1. First filter samples with text length 10-50 # 2. Then filter English samples with language confidence score >= 0.8 >>> run_data_recipe( ... "/path/to/dataset.jsonl", ... [ ... { ... "text_length_filter": { ... "min_len": 10, ... "max_len": 50 ... } ... }, ... { ... "language_id_score_filter": { ... "lang": "en", ... "min_score": 0.8 ... } ... } ... ] ... )

data_juicer.tools.DJ_mcp_recipe_flow.create_mcp_server(port: str = '8000')[源代码]

Creates the FastMCP server and registers the tools.

参数:

port (str, optional) -- Port number. Defaults to "8000".

data_juicer.tools.mcp_tool module

data_juicer.tools.mcp_tool.add_extra_cfg(dj_cfg: Dict) Dict[源代码]

Add extra dj config.

data_juicer.tools.mcp_tool.execute_op(dj_cfg: Dict)[源代码]

Module contents