data_juicer.utils.common_utils module¶
- data_juicer.utils.common_utils.stats_to_number(s, reverse=True)[源代码]¶
convert a stats value which can be string of list to a float.
- data_juicer.utils.common_utils.dict_to_hash(input_dict: dict, hash_length=None)[源代码]¶
hash a dict to a string with length hash_length
- 参数:
input_dict -- the given dict
- data_juicer.utils.common_utils.nested_access(data, path, digit_allowed=True)[源代码]¶
Access nested data using a dot-separated path.
- 参数:
data -- A dictionary or a list to access the nested data from.
path -- A dot-separated string representing the path to access. This can include numeric indices when accessing list elements.
digit_allowed -- Allow transferring string to digit.
- 返回:
The value located at the specified path, or raises a KeyError or IndexError if the path does not exist.
- data_juicer.utils.common_utils.is_string_list(var)[源代码]¶
return if the var is list of string.
- 参数:
var -- input variance
- data_juicer.utils.common_utils.avg_split_string_list_under_limit(str_list: list, token_nums: list, max_token_num=None)[源代码]¶
Split the string list to several sub str_list, such that the total token num of each sub string list is less than max_token_num, keeping the total token nums of sub string lists are similar.
- 参数:
str_list -- input string list.
token_nums -- token num of each string list.
max_token_num -- max token num of each sub string list.