core
ChatMessageConverter
Bases: DataConverter
Specialized converter for chat message data format with conversation structure.
Processes data containing message arrays with role/content pairs for chat-based reward modeling and conversation training.
Input Data Format Expected
{ "messages": [ {"role": "user", "content": "Hello"}, {"role": "assistant", "content": "Hi there!"} ] }
Output: DataSample with structured input messages and empty output for inference
Source code in rm_gallery/core/data/load/chat_message.py
21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 |
|
convert_to_data_sample(data_dict, source_info)
Convert chat message data dictionary to standardized DataSample format.
Extracts conversation messages from input data and creates a DataSample with structured input for chat-based processing pipelines.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
data_dict
|
Dict[str, Any]
|
Raw data containing messages array with role/content pairs |
required |
source_info
|
Dict[str, Any]
|
Metadata about data source (file path, dataset name, etc.) |
required |
Returns:
Type | Description |
---|---|
DataSample
|
DataSample with structured conversation input and metadata |
DataSample
|
Returns None if conversion fails |
Source code in rm_gallery/core/data/load/chat_message.py
40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 |
|
ConversationTurnFilter
Bases: BaseOperator
Filter conversations based on the number of turns in the input. A turn is defined as a single message in the conversation.
Source code in rm_gallery/core/data/process/ops/filter/conversation_turn_filter.py
9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 |
|
__init__(name, config=None)
Initialize the conversation turn filter.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
name
|
str
|
Name of the operator |
required |
min_turns
|
Minimum number of turns required (inclusive) |
required | |
max_turns
|
Maximum number of turns allowed (inclusive) |
required | |
config
|
Optional[Dict[str, Any]]
|
Additional configuration parameters |
None
|
Source code in rm_gallery/core/data/process/ops/filter/conversation_turn_filter.py
16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 |
|
process_dataset(items)
Filter conversations based on the number of turns.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
items
|
List[DataSample]
|
List of DataSample items to process |
required |
Returns:
Type | Description |
---|---|
List[DataSample]
|
List of DataSample items that meet the turn count criteria |
Source code in rm_gallery/core/data/process/ops/filter/conversation_turn_filter.py
32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 |
|
GenericConverter
Bases: DataConverter
Generic converter that automatically handles diverse HuggingFace dataset formats.
Acts as a fallback converter when no specific format converter is available. Intelligently extracts input/output pairs from common field names and structures.
Supported Input Patterns
- Fields: prompt, question, input, text, instruction (for input)
- Fields: response, answer, output, completion (for output)
- Messages: array of role/content objects for conversations
Output: DataSample with auto-detected task category and structured data
Source code in rm_gallery/core/data/load/huggingface.py
15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 |
|
convert_to_data_sample(data_dict, source_info)
Convert generic HuggingFace data dictionary to standardized DataSample format.
Automatically detects input/output patterns from common field names, determines task category, and creates appropriate data structure.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
data_dict
|
Dict[str, Any]
|
Raw data dictionary from HuggingFace dataset |
required |
source_info
|
Dict[str, Any]
|
Source metadata including dataset name, config, split info |
required |
Returns:
Type | Description |
---|---|
DataSample
|
DataSample with auto-detected structure and task category |
DataSample
|
Returns None if input/output extraction fails |
Source code in rm_gallery/core/data/load/huggingface.py
31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 |
|
TextLengthFilter
Bases: BaseOperator
Filter texts based on their length.
Source code in rm_gallery/core/data/process/ops/filter/text_length_filter.py
7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 |
|
__init__(name, config=None)
Initialize the text length filter.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
name
|
str
|
Name of the operator |
required |
min_length
|
Minimum text length required (inclusive) |
required | |
max_length
|
Maximum text length allowed (inclusive) |
required | |
config
|
Optional[Dict[str, Any]]
|
Additional configuration parameters |
None
|
Source code in rm_gallery/core/data/process/ops/filter/text_length_filter.py
13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 |
|
process_dataset(items)
Filter items based on text length.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
items
|
List[DataSample]
|
List of data items to process |
required |
Returns:
Type | Description |
---|---|
List[DataSample]
|
Filtered list of items |
Source code in rm_gallery/core/data/process/ops/filter/text_length_filter.py
29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 |
|