data_juicer.analysis.collector module¶
- class data_juicer.analysis.collector.TextTokenDistCollector(tokenizer)[source]¶
Bases:
object
Tokenize and collect distribution of tokens for given dataset with a specified tokenizer.
- __init__(tokenizer)[source]¶
Initialization method.
- Parameters:
tokenizer – tokenizer name on huggingface
- collect(data_path, text_key, num_proc=1) Categorical [source]¶
Tokenize and collect tokens distribution of input dataset :param data_path: path to input dataset. :param text_key: field keys that will be considered into token counts. :param num_proc: number of processes to count tokens. :return: token distribution.