data_juicer.ops.mapper.nlpcda_zh_mapper module¶

class data_juicer.ops.mapper.nlpcda_zh_mapper.NlpcdaZhMapper(sequential: bool = False, aug_num: Annotated[int, Gt(gt=0)] = 1, keep_original_sample: bool = True, replace_similar_word: bool = False, replace_homophone_char: bool = False, delete_random_char: bool = False, swap_random_char: bool = False, replace_equivalent_num: bool = False, *args, **kwargs)[source]¶

Bases: Mapper

Augments Chinese text samples using the nlpcda library.

This operator applies various augmentation methods to Chinese text, such as replacing similar words, homophones, deleting random characters, swapping characters, and replacing equivalent numbers. The number of augmented samples generated can be controlled by the aug_num parameter. If sequential is set to True, the augmentation methods are applied in sequence; otherwise, they are applied independently. The original sample can be kept or removed based on the keep_original_sample flag. It is recommended to use 1-3 augmentation methods at a time to avoid significant changes in the semantics of the samples. Some augmentation methods may not work for special texts, resulting in no augmented samples being generated.

__init__(sequential: bool = False, aug_num: Annotated[int, Gt(gt=0)] = 1, keep_original_sample: bool = True, replace_similar_word: bool = False, replace_homophone_char: bool = False, delete_random_char: bool = False, swap_random_char: bool = False, replace_equivalent_num: bool = False, *args, **kwargs)[source]¶

Initialization method. All augmentation methods use default parameters in default. We recommend you to only use 1-3 augmentation methods at a time. Otherwise, the semantics of samples might be changed significantly. Notice: some augmentation method might not work for some special texts, so there might be no augmented texts generated.

Parameters:

sequential – whether combine all augmentation methods to a sequence. If it’s True, a sample will be augmented by all opened augmentation methods sequentially. If it’s False, each opened augmentation method would generate its augmented samples independently.
aug_num – number of augmented samples to be generated. If sequential is True, there will be total aug_num augmented samples generated. If it’s False, there will be (aug_num * #opened_aug_method) augmented samples generated.
keep_original_sample – whether to keep the original sample. If it’s set to False, there will be only generated texts in the final datasets and the original texts will be removed. It’s True in default.
replace_similar_word – whether to open the augmentation method of replacing random words with their similar words in the original texts. e.g. “这里一共有5种不同的数据增强方法” –> “这边一共有5种不同的数据增强方法”
replace_homophone_char – whether to open the augmentation method of replacing random characters with their homophones in the original texts. e.g. “这里一共有5种不同的数据增强方法” –> “这里一共有5种不同的濖据增强方法”
delete_random_char – whether to open the augmentation method of deleting random characters from the original texts. e.g. “这里一共有5种不同的数据增强方法” –> “这里一共有5种不同的数据增强”
swap_random_char – whether to open the augmentation method of swapping random contiguous characters in the original texts. e.g. “这里一共有5种不同的数据增强方法” –> “这里一共有5种不同的数据强增方法”
replace_equivalent_num – whether to open the augmentation method of replacing random numbers with their equivalent representations in the original texts. Notice: Only for numbers for now. e.g. “这里一共有5种不同的数据增强方法” –> “这里一共有伍种不同的数据增强方法”
args – extra args
kwargs – extra args

process_batched(samples)[source]¶