data_juicer.utils.common_utils module

data_juicer.utils.common_utils.stats_to_number(s, reverse=True)[source]

convert a stats value which can be string of list to a float.

data_juicer.utils.common_utils.dict_to_hash(input_dict: dict, hash_length=None)[source]

hash a dict to a string with length hash_length

Parameters:

input_dict – the given dict

data_juicer.utils.common_utils.nested_access(data, path, digit_allowed=True)[source]

Access nested data using a dot-separated path.

Parameters:
  • data – A dictionary or a list to access the nested data from.

  • path – A dot-separated string representing the path to access. This can include numeric indices when accessing list elements.

  • digit_allowed – Allow transferring string to digit.

Returns:

The value located at the specified path, or raises a KeyError or IndexError if the path does not exist.

data_juicer.utils.common_utils.is_string_list(var)[source]

return if the var is list of string.

Parameters:

var – input variance

data_juicer.utils.common_utils.avg_split_string_list_under_limit(str_list: list, token_nums: list, max_token_num=None)[source]

Split the string list to several sub str_list, such that the total token num of each sub string list is less than max_token_num, keeping the total token nums of sub string lists are similar.

Parameters:
  • str_list – input string list.

  • token_nums – token num of each string list.

  • max_token_num – max token num of each sub string list.

data_juicer.utils.common_utils.is_float(s)[source]
data_juicer.utils.common_utils.check_op_method_param(method, param_name)[source]

Check if the given method contains a parameter named param_name, or it contains parameter with **.