utils
BaseTokenizer
Bases: BaseModel
, ABC
Base tokenizer class providing unified tokenization interface.
This abstract base class defines the interface for different tokenization strategies including tiktoken and jieba tokenizers.
Source code in rm_gallery/core/utils/tokenizer.py
8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 |
|
preprocess_text(text, to_lower=False)
Preprocess text before tokenization.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
text
|
str
|
Input text |
required |
to_lower
|
bool
|
Whether to convert to lowercase |
False
|
Returns:
Name | Type | Description |
---|---|---|
str |
str
|
Preprocessed text |
Source code in rm_gallery/core/utils/tokenizer.py
31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 |
|
tokenize(text)
abstractmethod
Tokenize input text into a list of tokens.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
text
|
str
|
Input text to tokenize |
required |
Returns:
Type | Description |
---|---|
List[str]
|
List[str]: List of token strings |
Source code in rm_gallery/core/utils/tokenizer.py
18 19 20 21 22 23 24 25 26 27 28 29 |
|
JiebaTokenizer
Bases: BaseTokenizer
Jieba-based tokenizer for Chinese text processing.
Provides Chinese word segmentation using jieba library with optional Chinese character filtering and preprocessing capabilities.
Source code in rm_gallery/core/utils/tokenizer.py
84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 |
|
tokenize(text)
Tokenize Chinese text using jieba.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
text
|
str
|
Input text to tokenize |
required |
Returns:
Type | Description |
---|---|
List[str]
|
List[str]: List of token strings |
Raises:
Type | Description |
---|---|
ImportError
|
If jieba library is not installed |
Source code in rm_gallery/core/utils/tokenizer.py
110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 |
|
SimpleTokenizer
Bases: BaseTokenizer
Simple whitespace-based tokenizer.
Basic tokenizer that splits text on whitespace. Used as fallback when other tokenizers are not available or fail.
Source code in rm_gallery/core/utils/tokenizer.py
135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 |
|
tokenize(text)
Tokenize text by splitting on whitespace.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
text
|
str
|
Input text to tokenize |
required |
Returns:
Type | Description |
---|---|
List[str]
|
List[str]: List of token strings |
Source code in rm_gallery/core/utils/tokenizer.py
145 146 147 148 149 150 151 152 153 154 155 |
|
TiktokenTokenizer
Bases: BaseTokenizer
Tiktoken-based tokenizer supporting multilingual content.
Uses tiktoken encoding for robust tokenization of Chinese, English and other languages. Falls back to simple splitting if tiktoken fails.
Source code in rm_gallery/core/utils/tokenizer.py
48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 |
|
tokenize(text)
Tokenize text using tiktoken encoder.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
text
|
str
|
Input text to tokenize |
required |
Returns:
Type | Description |
---|---|
List[str]
|
List[str]: List of token strings |
Source code in rm_gallery/core/utils/tokenizer.py
61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 |
|
get_tokenizer(tokenizer_type='tiktoken', encoding_name='cl100k_base', chinese_only=False, **kwargs)
Factory function to create tokenizer instances.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
tokenizer_type
|
str
|
Type of tokenizer ("tiktoken", "jieba", "simple") |
'tiktoken'
|
encoding_name
|
str
|
Tiktoken encoding name (for tiktoken tokenizer) |
'cl100k_base'
|
chinese_only
|
bool
|
Whether to keep only Chinese characters (for jieba tokenizer) |
False
|
**kwargs
|
Additional arguments for tokenizer initialization |
{}
|
Returns:
Name | Type | Description |
---|---|---|
BaseTokenizer |
BaseTokenizer
|
Tokenizer instance |
Raises:
Type | Description |
---|---|
ValueError
|
If tokenizer_type is not supported |
Source code in rm_gallery/core/utils/tokenizer.py
158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 |
|