data_juicer.utils.nltk_utils module

Utilities for working with NLTK in Data-Juicer.

This module provides utility functions for handling NLTK-specific operations, including pickle security patches and data downloading.

data_juicer.utils.nltk_utils.ensure_nltk_resource(resource_path, fallback_package=None)[source]

Ensure a specific NLTK resource is available and accessible.

This function attempts to find and load a resource, and if it fails, downloads the specified fallback package.

Parameters:
  • resource_path – The path to the resource to check

  • fallback_package – The package to download if the resource isn’t found

Returns:

True if the resource is available, False otherwise

Return type:

bool

data_juicer.utils.nltk_utils.clean_nltk_cache(packages=None, complete_reset=False)[source]

Clean NLTK model cache.

Parameters:
  • packages (list, optional) – List of package names to clean. If None, cleans all package caches.

  • complete_reset (bool, optional) – If True, deletes all NLTK data. Default is False.

data_juicer.utils.nltk_utils.patch_nltk_pickle_security()[source]

Patch NLTK’s pickle security restrictions to allow loading models.

NLTK 3.9+ introduced strict pickle security that prevents loading some models. This function patches NLTK to bypass those restrictions while maintaining security.

This should be called once during initialization before any NLTK functions are used.

data_juicer.utils.nltk_utils.create_physical_resource_alias(source_path, alias_path)[source]

Create a physical file alias for NLTK resources.

This function creates a hard link, symlink, or copy of a source resource to a target alias path. This is useful for problematic resources that might be requested with a path that doesn’t match NLTK’s structure.

Parameters:
  • source_path – The full path to the source file

  • alias_path – The full path where the alias should be created

Returns:

True if the alias was created successfully, False otherwise

Return type:

bool

data_juicer.utils.nltk_utils.setup_resource_aliases()[source]

Create physical file aliases for common problematic NLTK resources.

This function creates aliases/copies of resources that have known problematic paths to ensure they can be found regardless of how they’re requested.