data_juicer.core.tracer module

class data_juicer.core.tracer.Tracer(work_dir, op_list_to_trace=None, show_num=10)[源代码]

基类:object

The tracer to trace the sample changes before and after an operator process.

The comparison results will be stored in the work directory.

__init__(work_dir, op_list_to_trace=None, show_num=10)[源代码]

Initialization method.

参数:
  • work_dir -- the work directory to store the comparison results

  • op_list_to_trace -- the OP list to be traced.

  • show_num -- the maximum number of samples to show in the comparison result files.

trace_mapper(op_name: str, previous_ds: Dataset, processed_ds: Dataset, text_key: str)[源代码]

Compare datasets before and after a Mapper.

This will mainly show the different sample pairs due to the modification by the Mapper

参数:
  • op_name -- the op name of mapper

  • previous_ds -- dataset before the mapper process

  • processed_ds -- dataset processed by the mapper

  • text_key -- which text_key to trace

返回:

trace_batch_mapper(op_name: str, previous_ds: Dataset, processed_ds: Dataset, text_key: str)[源代码]

Compare datasets before and after a BatchMapper.

This will mainly show the new samples augmented by the BatchMapper

参数:
  • op_name -- the op name of mapper

  • previous_ds -- dataset before the mapper process

  • processed_ds -- dataset processed by the mapper

  • text_key -- which text_key to trace

返回:

trace_filter(op_name: str, previous_ds: Dataset, processed_ds: Dataset)[源代码]

Compare datasets before and after a Filter.

This will mainly show the filtered samples by the Filter

参数:
  • op_name -- the op name of filter

  • previous_ds -- dataset before the filter process

  • processed_ds -- dataset processed by the filter

返回:

trace_deduplicator(op_name: str, dup_pairs: dict)[源代码]

Compare datasets before and after a Deduplicator.

This will mainly show the near-duplicate sample pairs extracted by the Deduplicator. Different from the other two trace methods, the trace process for deduplicator is embedded into the process method of deduplicator, but the other two trace methods are independent of the process method of mapper and filter operators

参数:
  • op_name -- the op name of deduplicator

  • dup_pairs -- duplicate sample pairs obtained from deduplicator

返回: