# Postprocess tools This folder contains some postprocess scripts for additional processing of your processed dataset using Data-Juicer. ## Usage ### Count tokens for datasets Use [count_token.py](count_token.py) to count tokens for datasets. ```shell python tools/postprocess/count_token.py \ --data_path \ --text_keys \ --tokenizer_method \ --num_proc # get help python tools/postprocess/count_token.py --help ``` - `data_path`: path to the input dataset. Only support `jsonl` now. - `text_keys`: field keys that will be considered into token counts. - `tokenizer_method`: name of the Hugging Face tokenizer. - `num_proc`: number of processes to count tokens. ### Mix multiple datasets with optional weights Use [data_mixture.py](data_mixture.py) to mix multiple datasets. This script will randomly select samples from every dataset and mix these samples and export to a new_dataset. ```shell python tools/postprocess/data_mixture.py \ --data_path \ --export_path \ --export_shard_size \ --num_proc # get help python tools/postprocess/data_mixture.py --help ``` - `data_path`: a dataset file or a list of dataset files or a list of both them, optional weights, if not set, 1.0 as default. - `export_path`: a dataset file name for exporting mixed dataset, support `json` / `jsonl` / `parquet`. - `export_shard_size`: dataset file size in Byte. If not set, mixed dataset will be exported into only one file. - `num_proc`: process num to load and export datasets. - e.g., `python tools/postprocess/data_mixture.py --data_path ds.jsonl ds_dir ds_file.json` **Note:** All datasets must have the same meta field, so we can use [HuggingFace Datasets](https://huggingface.co/docs/datasets/index) to align their features. ### Deserialize meta fields in jsonl file This tool is usually used with [serialize_meta.py](../preprocess/serialize_meta.py) to deserialize the specified field into the original format. ```shell python tools/postprocess/deserialize_meta.py \ --src_dir \ --target_dir \ --serialized_key \ --num_proc # get help python tools/postprocess/deserialize_meta.py --help ``` - `src_dir`: path to store jsonl files. - `target_dir`: path to save the converted jsonl files. - `serialized_key`: the key corresponding to the field that will be deserialized. Default it's 'source_info'. - `num_proc` (optional): number of process workers. Default it's 1. **Note:** After deserialization, all serialized fields in the original file will be placed in `'serialized_key'`, this is to ensure that the fields generated after data-juicer processing will not conflict with the original meta fields.