data_juicer.ops.mapper.sentence_split_mapper module¶

class data_juicer.ops.mapper.sentence_split_mapper.SentenceSplitMapper(lang: str = 'en', *args, **kwargs)[source]¶

Bases: Mapper

Splits text samples into individual sentences based on the specified language.

This operator uses an NLTK-based tokenizer to split the input text into sentences. The language for the tokenizer is specified during initialization. The original text in each sample is replaced with a list of sentences. This operator processes samples in batches for efficiency. Ensure that the lang parameter is set to the appropriate language code (e.g., “en” for English) to achieve accurate sentence splitting.

__init__(lang: str = 'en', *args, **kwargs)[source]¶

Initialization method.

Parameters:

lang – split sentence of text in which language.
args – extra args
kwargs – extra args

process_batched(samples)[source]¶