data_juicer.utils.nltk_utils module¶
Utilities for working with NLTK in Data-Juicer.
This module provides utility functions for handling NLTK-specific operations, including pickle security patches and data downloading.
- data_juicer.utils.nltk_utils.ensure_nltk_resource(resource_path, fallback_package=None)[source]¶
Ensure a specific NLTK resource is available and accessible.
This function attempts to find and load a resource, and if it fails, downloads the specified fallback package.
- Parameters:
resource_path – The path to the resource to check
fallback_package – The package to download if the resource isn’t found
- Returns:
True if the resource is available, False otherwise
- Return type:
bool
- data_juicer.utils.nltk_utils.clean_nltk_cache(packages=None, complete_reset=False)[source]¶
Clean NLTK model cache.
- Parameters:
packages (list, optional) – List of package names to clean. If None, cleans all package caches.
complete_reset (bool, optional) – If True, deletes all NLTK data. Default is False.
- data_juicer.utils.nltk_utils.patch_nltk_pickle_security()[source]¶
Patch NLTK’s pickle security restrictions to allow loading models.
NLTK 3.9+ introduced strict pickle security that prevents loading some models. This function patches NLTK to bypass those restrictions while maintaining security.
This should be called once during initialization before any NLTK functions are used.
- data_juicer.utils.nltk_utils.create_physical_resource_alias(source_path, alias_path)[source]¶
Create a physical file alias for NLTK resources.
This function creates a hard link, symlink, or copy of a source resource to a target alias path. This is useful for problematic resources that might be requested with a path that doesn’t match NLTK’s structure.
- Parameters:
source_path – The full path to the source file
alias_path – The full path where the alias should be created
- Returns:
True if the alias was created successfully, False otherwise
- Return type:
bool