data_juicer.ops.mapper.chinese_convert_mapper module

data_juicer.ops.mapper.chinese_convert_mapper.prepare_converter(mode)[source]
class data_juicer.ops.mapper.chinese_convert_mapper.ChineseConvertMapper(mode: str = 's2t', *args, **kwargs)[source]

Bases: Mapper

Mapper to convert Chinese text between Traditional, Simplified, and Japanese Kanji.

This operator converts Chinese text based on the specified mode. It supports conversions between Simplified Chinese, Traditional Chinese (including Taiwan and Hong Kong variants), and Japanese Kanji. The conversion is performed using a pre-defined set of rules. The available modes include ‘s2t’ for Simplified to Traditional, ‘t2s’ for Traditional to Simplified, and other specific variants like ‘s2tw’, ‘tw2s’, ‘s2hk’, ‘hk2s’, ‘s2twp’, ‘tw2sp’, ‘t2tw’, ‘tw2t’, ‘hk2t’, ‘t2hk’, ‘t2jp’, and ‘jp2t’. The operator processes text in batches and applies the conversion to the specified text key in the samples.

__init__(mode: str = 's2t', *args, **kwargs)[source]

Initialization method.

Parameters:
  • mode

    Choose the mode to convert Chinese:

    s2t: Simplified Chinese to Traditional Chinese,

    t2s: Traditional Chinese to Simplified Chinese,

    s2tw: Simplified Chinese to Traditional Chinese (Taiwan Standard),

    tw2s: Traditional Chinese (Taiwan Standard) to Simplified Chinese,

    s2hk: Simplified Chinese to Traditional Chinese (Hong Kong variant),

    hk2s: Traditional Chinese (Hong Kong variant) to Simplified Chinese,

    s2twp: Simplified Chinese to Traditional Chinese (Taiwan Standard) with Taiwanese idiom,

    tw2sp: Traditional Chinese (Taiwan Standard) to Simplified Chinese with Mainland Chinese idiom,

    t2tw: Traditional Chinese to Traditional Chinese (Taiwan Standard),

    tw2t: Traditional Chinese (Taiwan standard) to Traditional Chinese,

    hk2t: Traditional Chinese (Hong Kong variant) to Traditional Chinese,

    t2hk: Traditional Chinese to Traditional Chinese (Hong Kong variant),

    t2jp: Traditional Chinese Characters (Kyūjitai) to New Japanese Kanji,

    jp2t: New Japanese Kanji (Shinjitai) to Traditional Chinese Characters,

  • args – extra args

  • kwargs – extra args

process_batched(samples)[source]