Contents Menu Expand Light mode Dark mode Auto light/dark, in light mode Auto light/dark, in dark mode Skip to content
Data Juicer
Data Juicer

教程

  • DJ-Cookbook
  • 安装
  • 快速上手

帮助文档

  • Operator Schemas 算子提要
  • 数据菜谱Gallery
  • 数据集配置指南
  • “坏”数据展览
  • DJ-SORA
  • DataJuicer-Agent
  • DJ_服务化
  • 开发者指南
  • Data-Juicer 分布式数据处理
  • 沙盒实验室
  • Awesome Data-Model Co-Development of MLLMs

demos

  • 演示
  • 自动化评测:HELM 评测及可视化
  • 为LLM构造角色扮演的system prompt

工具

  • 分布式模糊去重工具
  • Auto Evaluation Toolkit
  • GPT EVAL:使用 OpenAI API 评测大模型
  • Evaluation Results Recorder
  • 格式转换工具
  • 多模态工具
  • 后微调工具
  • 数据菜谱的自动化超参优化
  • Label Studio Service Utility
  • 视频生成测评工具
  • Postprocess tools
  • 预处理工具
  • 给数据打分

第三方

  • 大语言模型生态
  • 第三方模型库

API Reference

  • API Reference
    • data_juicer.core
    • data_juicer.ops
    • data_juicer.ops.filter
    • data_juicer.ops.mapper
    • data_juicer.ops.deduplicator
    • data_juicer.ops.selector
    • data_juicer.ops.common
    • data_juicer.analysis
    • data_juicer.config
    • data_juicer.format
zh-CN|v1.4.2
Language
English 简体中文
Version
main v1.4.2 v1.4.1 v1.4.0
Back to top
View this page

data_juicer.tools.op_search module¶

Operator Searcher - A tool for filtering Data-Juicer operators by tags

class data_juicer.tools.op_search.OPRecord(op_type: str, name: str, desc: str, tags: List[str], sig: Signature, param_desc: str)[源代码]¶

基类:object

A record class for storing operator metadata

__init__(op_type: str, name: str, desc: str, tags: List[str], sig: Signature, param_desc: str)[源代码]¶
to_dict()[源代码]¶
data_juicer.tools.op_search.analyze_modality_tag(code, op_prefix)[源代码]¶

Analyze the modality tag for the given code content string. Should be one of the "Modality Tags" in tagging_mappings.json. It makes the choice by finding the usages of attributes {modality}_key and the prefix of the OP name. If there are multiple modality keys are used, the 'multimodal' tag will be returned instead.

data_juicer.tools.op_search.analyze_resource_tag(code)[源代码]¶

Analyze the resource tag for the given code content string. Should be one of the "Resource Tags" in tagging_mappings.json. It makes the choice according to their assigning statement to attribute _accelerator.

data_juicer.tools.op_search.analyze_model_tags(code)[源代码]¶

Analyze the model tag for the given code content string. SHOULD be one of the "Model Tags" in tagging_mappings.json. It makes the choice by finding the model_type arg in prepare_model method invocation.

data_juicer.tools.op_search.analyze_tag_with_inheritance(op_cls, analyze_func, default_tags=[], other_parm={})[源代码]¶

Universal inheritance chain label analysis function

data_juicer.tools.op_search.analyze_tag_from_cls(op_cls, op_name)[源代码]¶

Analyze the tags for the OP from the given cls.

data_juicer.tools.op_search.extract_param_docstring(docstring)[源代码]¶

Extract parameter descriptions from __init__ method docstring.

class data_juicer.tools.op_search.OPSearcher(specified_op_list: List[str] | None = None)[源代码]¶

基类:object

Operator search engine

__init__(specified_op_list: List[str] | None = None)[源代码]¶
search(tags: List[str] | None = None, op_type: str | None = None, match_all: bool = True) → List[Dict][源代码]¶

Search operators by criteria :param tags: List of tags to match :param op_type: Operator type (mapper/filter/etc) :param match_all: True requires matching all tags, False matches any tag :return: List of matched operator records

data_juicer.tools.op_search.main(tags, op_type)[源代码]¶
Copyright © 2024, Data-Juicer Team
Made with Sphinx and @pradyunsg's Furo
On this page
  • data_juicer.tools.op_search module
    • OPRecord
      • OPRecord.__init__()
      • OPRecord.to_dict()
    • analyze_modality_tag()
    • analyze_resource_tag()
    • analyze_model_tags()
    • analyze_tag_with_inheritance()
    • analyze_tag_from_cls()
    • extract_param_docstring()
    • OPSearcher
      • OPSearcher.__init__()
      • OPSearcher.search()
    • main()