data_juicer.utils.common_utils module¶
- data_juicer.utils.common_utils.stats_to_number(s, reverse=True)[source]¶
convert a stats value which can be string of list to a float.
- data_juicer.utils.common_utils.dict_to_hash(input_dict: dict, hash_length=None)[source]¶
hash a dict to a string with length hash_length
- Parameters:
input_dict – the given dict
- data_juicer.utils.common_utils.nested_access(data, path, digit_allowed=True)[source]¶
Access nested data using a dot-separated path.
- Parameters:
data – A dictionary or a list to access the nested data from.
path – A dot-separated string representing the path to access. This can include numeric indices when accessing list elements.
digit_allowed – Allow transferring string to digit.
- Returns:
The value located at the specified path, or raises a KeyError or IndexError if the path does not exist.
- data_juicer.utils.common_utils.is_string_list(var)[source]¶
return if the var is list of string.
- Parameters:
var – input variance
- data_juicer.utils.common_utils.avg_split_string_list_under_limit(str_list: list, token_nums: list, max_token_num=None)[source]¶
Split the string list to several sub str_list, such that the total token num of each sub string list is less than max_token_num, keeping the total token nums of sub string lists are similar.
- Parameters:
str_list – input string list.
token_nums – token num of each string list.
max_token_num – max token num of each sub string list.