data_juicer.ops.mapper.chinese_convert_mapper module¶
- class data_juicer.ops.mapper.chinese_convert_mapper.ChineseConvertMapper(mode: str = 's2t', *args, **kwargs)[源代码]¶
基类:
Mapper
Mapper to convert Chinese text between Traditional, Simplified, and Japanese Kanji.
This operator converts Chinese text based on the specified mode. It supports conversions between Simplified Chinese, Traditional Chinese (including Taiwan and Hong Kong variants), and Japanese Kanji. The conversion is performed using a pre-defined set of rules. The available modes include 's2t' for Simplified to Traditional, 't2s' for Traditional to Simplified, and other specific variants like 's2tw', 'tw2s', 's2hk', 'hk2s', 's2twp', 'tw2sp', 't2tw', 'tw2t', 'hk2t', 't2hk', 't2jp', and 'jp2t'. The operator processes text in batches and applies the conversion to the specified text key in the samples.
- __init__(mode: str = 's2t', *args, **kwargs)[源代码]¶
Initialization method.
- 参数:
mode --
Choose the mode to convert Chinese:
s2t: Simplified Chinese to Traditional Chinese,
t2s: Traditional Chinese to Simplified Chinese,
s2tw: Simplified Chinese to Traditional Chinese (Taiwan Standard),
tw2s: Traditional Chinese (Taiwan Standard) to Simplified Chinese,
s2hk: Simplified Chinese to Traditional Chinese (Hong Kong variant),
hk2s: Traditional Chinese (Hong Kong variant) to Simplified Chinese,
s2twp: Simplified Chinese to Traditional Chinese (Taiwan Standard) with Taiwanese idiom,
tw2sp: Traditional Chinese (Taiwan Standard) to Simplified Chinese with Mainland Chinese idiom,
t2tw: Traditional Chinese to Traditional Chinese (Taiwan Standard),
tw2t: Traditional Chinese (Taiwan standard) to Traditional Chinese,
hk2t: Traditional Chinese (Hong Kong variant) to Traditional Chinese,
t2hk: Traditional Chinese to Traditional Chinese (Hong Kong variant),
t2jp: Traditional Chinese Characters (Kyūjitai) to New Japanese Kanji,
jp2t: New Japanese Kanji (Shinjitai) to Traditional Chinese Characters,
args -- extra args
kwargs -- extra args