data_juicer.ops.mapper.sentence_augmentation_mapper module¶

class data_juicer.ops.mapper.sentence_augmentation_mapper.SentenceAugmentationMapper(hf_model: str = 'Qwen/Qwen2-7B-Instruct', system_prompt: str = None, task_sentence: str = None, max_new_tokens=256, temperature=0.2, top_p=None, num_beams=1, text_key=None, text_key_second=None, *args, **kwargs)[source]¶

Bases: Mapper

Augments sentences by generating enhanced versions using a Hugging Face model. This operator enhances input sentences by generating new, augmented versions. It is designed to work best with individual sentences rather than full documents. For optimal results, ensure the input text is at the sentence level. The augmentation process uses a Hugging Face model, such as lmsys/vicuna-13b-v1.5 or Qwen/Qwen2-7B-Instruct. The operator requires specifying both the primary and secondary text keys, where the augmented sentence will be stored in the secondary key. The generation process can be customized with parameters like temperature, top-p sampling, and beam search size.

__init__(hf_model: str = 'Qwen/Qwen2-7B-Instruct', system_prompt: str = None, task_sentence: str = None, max_new_tokens=256, temperature=0.2, top_p=None, num_beams=1, text_key=None, text_key_second=None, *args, **kwargs)[source]¶

Initialization method. :param hf_model: Huggingface model id. :param system_prompt: System prompt. :param task_sentence: The instruction for the current task. :param max_new_tokens: the maximum number of new tokens

generated by the model.

Parameters:

temperature – used to control the randomness of generated text. The higher the temperature, the more random and creative the generated text will be.
top_p – randomly select the next word from the group of words whose cumulative probability reaches p.
num_beams – the larger the beam search size, the higher the quality of the generated text.
text_key – the key name used to store the first sentence in the text pair. (optional, defalut=’text’)
text_key_second – the key name used to store the second sentence in the text pair.
args – extra args
kwargs – extra args

process_single(sample=None, rank=None)[source]¶

For sample level, sample –> sample

Parameters:: sample – sample to process
Returns:: processed sample