data_juicer.ops.mapper.sentence_split_mapper module¶
- class data_juicer.ops.mapper.sentence_split_mapper.SentenceSplitMapper(lang: str = 'en', *args, **kwargs)[源代码]¶
基类:
Mapper
Splits text samples into individual sentences based on the specified language.
This operator uses an NLTK-based tokenizer to split the input text into sentences. The language for the tokenizer is specified during initialization. The original text in each sample is replaced with a list of sentences. This operator processes samples in batches for efficiency. Ensure that the lang parameter is set to the appropriate language code (e.g., "en" for English) to achieve accurate sentence splitting.