data_juicer.utils.common_utils module

data_juicer.utils.common_utils.stats_to_number(s, reverse=True)[源代码]

convert a stats value which can be string of list to a float.

data_juicer.utils.common_utils.dict_to_hash(input_dict: dict, hash_length=None)[源代码]

hash a dict to a string with length hash_length

参数:

input_dict -- the given dict

data_juicer.utils.common_utils.nested_access(data, path, digit_allowed=True)[源代码]

Access nested data using a dot-separated path.

参数:
  • data -- A dictionary or a list to access the nested data from.

  • path -- A dot-separated string representing the path to access. This can include numeric indices when accessing list elements.

  • digit_allowed -- Allow transferring string to digit.

返回:

The value located at the specified path, or raises a KeyError or IndexError if the path does not exist.

data_juicer.utils.common_utils.is_string_list(var)[源代码]

return if the var is list of string.

参数:

var -- input variance

data_juicer.utils.common_utils.avg_split_string_list_under_limit(str_list: list, token_nums: list, max_token_num=None)[源代码]

Split the string list to several sub str_list, such that the total token num of each sub string list is less than max_token_num, keeping the total token nums of sub string lists are similar.

参数:
  • str_list -- input string list.

  • token_nums -- token num of each string list.

  • max_token_num -- max token num of each sub string list.

data_juicer.utils.common_utils.is_float(s)[源代码]
data_juicer.utils.common_utils.check_op_method_param(method, param_name)[源代码]

Check if the given method contains a parameter named param_name, or it contains parameter with **.