data_juicer.ops.common package

Submodules

data_juicer.ops.common.helper_func module

class data_juicer.ops.common.helper_func.UnionFind[source]

Bases: object

__init__()[source]

Initialization method.

find(x)[source]
union(x, y)[source]
data_juicer.ops.common.helper_func.strip(document, strip_characters)[source]

Way faster than document.strip(strip_characters) since strip_characters is now a set instead of a str, and it contains a lot of elements (all the emojis).

Parameters:
  • document – document to be processed

  • strip_characters – characters used for stripping document

Returns:

stripped document

data_juicer.ops.common.helper_func.split_on_whitespace(document, new_line=False, tab=False)[source]

This method also removes concatenated spaces.

Parameters:
  • document – document to be split

  • new_line – whether to split document with ‘\n’

  • tag – whether to split document with ‘\t’

Returns:

word list obtained after splitting document

data_juicer.ops.common.helper_func.split_on_newline_tab_whitespace(document)[source]

This method is used to split the document into different levels of sub- sentences.

First split on “\n”, then on “\t”, then on “ “. :param document: document to be split :return: sentence list obtained after splitting document

data_juicer.ops.common.helper_func.merge_on_whitespace_tab_newline(sentences)[source]

This method is used to merge different levels of sub-sentences into one document. Invert the method split_on_newline_tab_whitespace. Removes concatenated separators.

Parameters:

sentences – sentence list to be merged

Returns:

document obtained after merging sub-sentences

data_juicer.ops.common.helper_func.words_augmentation(words, group_size, join_char)[source]

Augment words, especially for Chinese (without a space between words) and Vietnamese (with a space between syllables).

Parameters:
  • word – word list to be augmented

  • group_size – the size of word groups that need to be merged

  • join_char – characters to be added between word group

Returns:

word list after augment

data_juicer.ops.common.helper_func.get_words_from_document(document, token_func=None, new_line=True, tab=True)[source]

Get words from a document. Useful to compute ratios, like the stopwords ratio.

Parameters:
  • document – document that need to split words.

  • token_func – function of tokenizer, if specified, the function will be used for split document into different tokens.

  • new_line – whether to use ‘\n’ to split words.

  • tab – whether to use ‘\t’ to split words.

Returns:

word list obtained from document

data_juicer.ops.common.helper_func.words_refinement(words, lower_case=False, strip_chars=None, use_words_aug=False, words_aug_group_sizes=[2], words_aug_join_char='')[source]

Refine split words. Non reversible since the document is split on multiple characters, words are stripped of special characters and characters are converted to lower case.

Parameters:
  • words – the word list to be augmented

  • lower_case – whether to convert word to lowercase

  • strip_chars – chars that need to be stripped in words

  • use_words_aug – whether to use word augmentation

  • words_aug_group_sizes – the size of word groups that need to be merged

  • words_aug_join_char – characters to be added between word group

Returns:

refined words or word list

data_juicer.ops.common.helper_func.get_sentences_from_document(document, model_func=None)[source]

Get sentences from a document.

Parameters:
  • document – document that need to split sentences

  • model_func – function of sentence model, if specified, the function will be used for splitting document into different sentences.

Returns:

document with the sentences separated by ‘\n’

data_juicer.ops.common.helper_func.split_text_by_punctuation(text)[source]

Split text by any zh and en punctuation

Parameters:

text – text to be split.

Returns:

sub texts split by any zh and en punctuation

data_juicer.ops.common.prompt2prompt_pipeline module

data_juicer.ops.common.prompt2prompt_pipeline.rescale_noise_cfg(noise_cfg, noise_pred_text, guidance_rescale=0.0)[source]

Rescale noise_cfg according to guidance_rescale. Based on findings of [Common Diffusion Noise Schedules and Sample Steps are Flawed](https://arxiv.org/pdf/2305.08891.pdf).

See Section 3.4

class data_juicer.ops.common.prompt2prompt_pipeline.Prompt2PromptPipeline(vae: AutoencoderKL, text_encoder: CLIPTextModel, text_encoder_2: CLIPTextModelWithProjection, tokenizer: CLIPTokenizer, tokenizer_2: CLIPTokenizer, unet: UNet2DConditionModel, scheduler: KarrasDiffusionSchedulers, image_encoder: CLIPVisionModelWithProjection | None = None, feature_extractor: CLIPImageProcessor | None = None, force_zeros_for_empty_prompt: bool = True, add_watermarker: bool | None = None)[source]

Bases: StableDiffusionXLPipeline

Args: Prompt-to-Prompt-Pipeline for text-to-image generation using Stable Diffusion. This model inherits from [StableDiffusionPipeline]. Check the superclass documentation

for the generic methods the library implements for

all the pipelines (such as downloading or saving, running on a particular device, etc.)

vae ([AutoencoderKL]):

Variational Auto-Encoder (VAE) Model to encode and decode images to and from latent representations.

text_encoder ([CLIPTextModel]):

Frozen text-encoder. Stable Diffusion uses the text portion of [CLIP](https://huggingface.co/docs/transformers/model_doc/ clip#transformers.CLIPTextModel), specifically the [clip-vit-large-patch14](https://huggingface.co/openai/ clip-vit-large-patch14) variant.

tokenizer (CLIPTokenizer):

Tokenizer of class [CLIPTokenizer](https://huggingface.co/docs/transformers/ v4.21.0/en/model_doc/clip#transformers.CLIPTokenizer).

unet ([UNet2DConditionModel]): Conditional U-Net architecture

to denoise the encoded image latents. scheduler

([SchedulerMixin]):
A scheduler to be used in combination with unet to denoise

the encoded image latents. Can be one of

[DDIMScheduler], [LMSDiscreteScheduler], or [PNDMScheduler].

safety_checker ([StableDiffusionSafetyChecker]):
Classification module that estimates whether generated

images could be considered offensive or harmful.

Please, refer to the [model card](https://huggingface.co/ runwayml/stable-diffusion-v1-5) for details.

feature_extractor ([CLIPFeatureExtractor]):
Model that extracts features from generated images to be

used as inputs for the safety_checker.

check_inputs(prompt, prompt_2, height, width, callback_steps, negative_prompt=None, negative_prompt_2=None, prompt_embeds=None, negative_prompt_embeds=None, pooled_prompt_embeds=None, negative_pooled_prompt_embeds=None)[source]
register_attention_control(controller)[source]
class data_juicer.ops.common.prompt2prompt_pipeline.P2PCrossAttnProcessor(controller, place_in_unet)[source]

Bases: object

__init__(controller, place_in_unet)[source]
class data_juicer.ops.common.prompt2prompt_pipeline.AttentionControl(attn_res=None)[source]

Bases: ABC

step_callback(x_t)[source]
between_steps()[source]
property num_uncond_att_layers
abstract forward(attn, is_cross: bool, place_in_unet: str)[source]
reset()[source]
__init__(attn_res=None)[source]
data_juicer.ops.common.prompt2prompt_pipeline.create_controller(prompts: List[str], cross_attention_kwargs: Dict, num_inference_steps: int, tokenizer, device, attn_res) AttentionControl[source]
class data_juicer.ops.common.prompt2prompt_pipeline.EmptyControl(attn_res=None)[source]

Bases: AttentionControl

forward(attn, is_cross: bool, place_in_unet: str)[source]
class data_juicer.ops.common.prompt2prompt_pipeline.AttentionStore(attn_res=None)[source]

Bases: AttentionControl

static get_empty_store()[source]
forward(attn, is_cross: bool, place_in_unet: str)[source]
between_steps()[source]
get_average_attention()[source]
reset()[source]
__init__(attn_res=None)[source]
class data_juicer.ops.common.prompt2prompt_pipeline.LocalBlend(prompts: List[str], words: [List[List[str]]], tokenizer, device, threshold=0.3, attn_res=None)[source]

Bases: object

__init__(prompts: List[str], words: [List[List[str]]], tokenizer, device, threshold=0.3, attn_res=None)[source]
class data_juicer.ops.common.prompt2prompt_pipeline.AttentionControlEdit(prompts, num_steps: int, cross_replace_steps: float | Tuple[float, float] | Dict[str, Tuple[float, float]], self_replace_steps: float | Tuple[float, float], local_blend: LocalBlend | None, tokenizer, device, attn_res=None)[source]

Bases: AttentionStore, ABC

step_callback(x_t)[source]
replace_self_attention(attn_base, att_replace)[source]
abstract replace_cross_attention(attn_base, att_replace)[source]
forward(attn, is_cross: bool, place_in_unet: str)[source]
__init__(prompts, num_steps: int, cross_replace_steps: float | Tuple[float, float] | Dict[str, Tuple[float, float]], self_replace_steps: float | Tuple[float, float], local_blend: LocalBlend | None, tokenizer, device, attn_res=None)[source]
class data_juicer.ops.common.prompt2prompt_pipeline.AttentionReplace(prompts, num_steps: int, cross_replace_steps: float, self_replace_steps: float, local_blend: LocalBlend | None = None, tokenizer=None, device=None, attn_res=None)[source]

Bases: AttentionControlEdit

replace_cross_attention(attn_base, att_replace)[source]
__init__(prompts, num_steps: int, cross_replace_steps: float, self_replace_steps: float, local_blend: LocalBlend | None = None, tokenizer=None, device=None, attn_res=None)[source]
class data_juicer.ops.common.prompt2prompt_pipeline.AttentionRefine(prompts, num_steps: int, cross_replace_steps: float, self_replace_steps: float, local_blend: LocalBlend | None = None, tokenizer=None, device=None, attn_res=None)[source]

Bases: AttentionControlEdit

replace_cross_attention(attn_base, att_replace)[source]
__init__(prompts, num_steps: int, cross_replace_steps: float, self_replace_steps: float, local_blend: LocalBlend | None = None, tokenizer=None, device=None, attn_res=None)[source]
class data_juicer.ops.common.prompt2prompt_pipeline.AttentionReweight(prompts, num_steps: int, cross_replace_steps: float, self_replace_steps: float, equalizer, local_blend: LocalBlend | None = None, controller: AttentionControlEdit | None = None, tokenizer=None, device=None, attn_res=None)[source]

Bases: AttentionControlEdit

replace_cross_attention(attn_base, att_replace)[source]
__init__(prompts, num_steps: int, cross_replace_steps: float, self_replace_steps: float, equalizer, local_blend: LocalBlend | None = None, controller: AttentionControlEdit | None = None, tokenizer=None, device=None, attn_res=None)[source]
data_juicer.ops.common.prompt2prompt_pipeline.update_alpha_time_word(alpha, bounds: float | Tuple[float, float], prompt_ind: int, word_inds: Tensor | None = None)[source]
data_juicer.ops.common.prompt2prompt_pipeline.get_time_words_attention_alpha(prompts, num_steps, cross_replace_steps: float | Dict[str, Tuple[float, float]], tokenizer, max_num_words=77)[source]
data_juicer.ops.common.prompt2prompt_pipeline.get_word_inds(text: str, word_place: int, tokenizer)[source]
data_juicer.ops.common.prompt2prompt_pipeline.get_replacement_mapper_(x: str, y: str, tokenizer, max_len=77)[source]
data_juicer.ops.common.prompt2prompt_pipeline.get_replacement_mapper(prompts, tokenizer, max_len=77)[source]
data_juicer.ops.common.prompt2prompt_pipeline.get_equalizer(text: str, word_select: int | Tuple[int, ...], values: List[float] | Tuple[float, ...], tokenizer)[source]
class data_juicer.ops.common.prompt2prompt_pipeline.ScoreParams(gap, match, mismatch)[source]

Bases: object

__init__(gap, match, mismatch)[source]
mis_match_char(x, y)[source]
data_juicer.ops.common.prompt2prompt_pipeline.get_matrix(size_x, size_y, gap)[source]
data_juicer.ops.common.prompt2prompt_pipeline.get_traceback_matrix(size_x, size_y)[source]
data_juicer.ops.common.prompt2prompt_pipeline.global_align(x, y, score)[source]
data_juicer.ops.common.prompt2prompt_pipeline.get_aligned_sequences(x, y, trace_back)[source]
data_juicer.ops.common.prompt2prompt_pipeline.get_mapper(x: str, y: str, tokenizer, max_len=77)[source]
data_juicer.ops.common.prompt2prompt_pipeline.get_refinement_mapper(prompts, tokenizer, max_len=77)[source]

data_juicer.ops.common.special_characters module

Module contents

data_juicer.ops.common.get_sentences_from_document(document, model_func=None)[source]

Get sentences from a document.

Parameters:
  • document – document that need to split sentences

  • model_func – function of sentence model, if specified, the function will be used for splitting document into different sentences.

Returns:

document with the sentences separated by ‘\n’

data_juicer.ops.common.get_words_from_document(document, token_func=None, new_line=True, tab=True)[source]

Get words from a document. Useful to compute ratios, like the stopwords ratio.

Parameters:
  • document – document that need to split words.

  • token_func – function of tokenizer, if specified, the function will be used for split document into different tokens.

  • new_line – whether to use ‘\n’ to split words.

  • tab – whether to use ‘\t’ to split words.

Returns:

word list obtained from document

data_juicer.ops.common.merge_on_whitespace_tab_newline(sentences)[source]

This method is used to merge different levels of sub-sentences into one document. Invert the method split_on_newline_tab_whitespace. Removes concatenated separators.

Parameters:

sentences – sentence list to be merged

Returns:

document obtained after merging sub-sentences

data_juicer.ops.common.split_on_newline_tab_whitespace(document)[source]

This method is used to split the document into different levels of sub- sentences.

First split on “\n”, then on “\t”, then on “ “. :param document: document to be split :return: sentence list obtained after splitting document

data_juicer.ops.common.split_on_whitespace(document, new_line=False, tab=False)[source]

This method also removes concatenated spaces.

Parameters:
  • document – document to be split

  • new_line – whether to split document with ‘\n’

  • tag – whether to split document with ‘\t’

Returns:

word list obtained after splitting document

data_juicer.ops.common.strip(document, strip_characters)[source]

Way faster than document.strip(strip_characters) since strip_characters is now a set instead of a str, and it contains a lot of elements (all the emojis).

Parameters:
  • document – document to be processed

  • strip_characters – characters used for stripping document

Returns:

stripped document

data_juicer.ops.common.words_augmentation(words, group_size, join_char)[source]

Augment words, especially for Chinese (without a space between words) and Vietnamese (with a space between syllables).

Parameters:
  • word – word list to be augmented

  • group_size – the size of word groups that need to be merged

  • join_char – characters to be added between word group

Returns:

word list after augment

data_juicer.ops.common.words_refinement(words, lower_case=False, strip_chars=None, use_words_aug=False, words_aug_group_sizes=[2], words_aug_join_char='')[source]

Refine split words. Non reversible since the document is split on multiple characters, words are stripped of special characters and characters are converted to lower case.

Parameters:
  • words – the word list to be augmented

  • lower_case – whether to convert word to lowercase

  • strip_chars – chars that need to be stripped in words

  • use_words_aug – whether to use word augmentation

  • words_aug_group_sizes – the size of word groups that need to be merged

  • words_aug_join_char – characters to be added between word group

Returns:

refined words or word list

data_juicer.ops.common.split_text_by_punctuation(text)[source]

Split text by any zh and en punctuation

Parameters:

text – text to be split.

Returns:

sub texts split by any zh and en punctuation