data_juicer.utils.webdataset_utils module

data_juicer.utils.webdataset_utils.reconstruct_custom_webdataset_format(samples, field_mapping: Dict[str, str] | None = None)[source]

Reconstruct the original dataset to the WebDataset format. For all keys, they can be specified by field_mapping argument, which is a dict mapping from the target field key in the result format to the source field key in the original samples.

Parameters:
  • samples – the input samples batch to be reconstructed

  • field_mapping – the field mapping to construct the left fields.