data_juicer.utils.sample module

data_juicer.utils.sample.random_sample(dataset, weight=1.0, sample_number=0, seed=None)[source]

Randomly sample a subset from a dataset with weight or number, if sample number is bigger than 0, we will use sample number instead of weight. :param dataset: a HuggingFace dataset :param weight: sample ratio of dataset :param sample_number: sample number of dataset :param seed: random sample seed, if None, 42 as default :return: a subset of dataset