data_juicer.ops.mapper.nlpaug_en_mapper module¶

class data_juicer.ops.mapper.nlpaug_en_mapper.NlpaugEnMapper(sequential: bool = False, aug_num: Annotated[int, Gt(gt=0)] = 1, keep_original_sample: bool = True, delete_random_word: bool = False, swap_random_word: bool = False, spelling_error_word: bool = False, split_random_word: bool = False, keyboard_error_char: bool = False, ocr_error_char: bool = False, delete_random_char: bool = False, swap_random_char: bool = False, insert_random_char: bool = False, *args, **kwargs)[source]¶

Bases: Mapper

Augments English text samples using various methods from the nlpaug library.

This operator applies a series of text augmentation techniques to generate new samples. It supports both word-level and character-level augmentations, such as deleting, swapping, and inserting words or characters. The number of augmented samples can be controlled, and the original samples can be kept or removed. When multiple augmentation methods are enabled, they can be applied sequentially or independently. Sequential application means each sample is augmented by all enabled methods in sequence, while independent application generates multiple augmented samples for each method. We recommend using 1-3 augmentation methods at a time to avoid significant changes in sample semantics.

__init__(sequential: bool = False, aug_num: Annotated[int, Gt(gt=0)] = 1, keep_original_sample: bool = True, delete_random_word: bool = False, swap_random_word: bool = False, spelling_error_word: bool = False, split_random_word: bool = False, keyboard_error_char: bool = False, ocr_error_char: bool = False, delete_random_char: bool = False, swap_random_char: bool = False, insert_random_char: bool = False, *args, **kwargs)[source]¶

Initialization method. All augmentation methods use default parameters in default. We recommend you to only use 1-3 augmentation methods at a time. Otherwise, the semantics of samples might be changed significantly.

Parameters:

sequential – whether combine all augmentation methods to a sequence. If it’s True, a sample will be augmented by all opened augmentation methods sequentially. If it’s False, each opened augmentation method would generate its augmented samples independently.
aug_num – number of augmented samples to be generated. If sequential is True, there will be total aug_num augmented samples generated. If it’s False, there will be (aug_num * #opened_aug_method) augmented samples generated.
keep_original_sample – whether to keep the original sample. If it’s set to False, there will be only generated texts in the final datasets and the original texts will be removed. It’s True in default.
delete_random_word – whether to open the augmentation method of deleting random words from the original texts. e.g. “I love LLM” –> “I LLM”
swap_random_word – whether to open the augmentation method of swapping random contiguous words in the original texts. e.g. “I love LLM” –> “Love I LLM”
spelling_error_word – whether to open the augmentation method of simulating the spelling error for words in the original texts. e.g. “I love LLM” –> “Ai love LLM”
split_random_word – whether to open the augmentation method of splitting words randomly with whitespaces in the original texts. e.g. “I love LLM” –> “I love LL M”
keyboard_error_char – whether to open the augmentation method of simulating the keyboard error for characters in the original texts. e.g. “I love LLM” –> “I ;ov4 LLM”
ocr_error_char – whether to open the augmentation method of simulating the OCR error for characters in the original texts. e.g. “I love LLM” –> “I 10ve LLM”
delete_random_char – whether to open the augmentation method of deleting random characters from the original texts. e.g. “I love LLM” –> “I oe LLM”
swap_random_char – whether to open the augmentation method of swapping random contiguous characters in the original texts. e.g. “I love LLM” –> “I ovle LLM”
insert_random_char – whether to open the augmentation method of inserting random characters into the original texts. e.g. “I love LLM” –> “I ^lKove LLM”
args – extra args
kwargs – extra args

process_batched(samples)[source]¶