“Bad” Data Exhibition¶
This document is a exhibition that shows those so-called “bad” data from diverse datasets during we use our Data-Juicer to process them. The motivations of this exhibition include:
It can help users to better understand how each OP in Data-juicer finds these “bad” data to improve the “quality” of datasets.
There might be non-negligible differences between diverse datasets. So some OPs work well on some datasets but might be useless on others.
No matter how high-quality people consider a dataset to be (e.g. Wikipedia, Books, …), there are always some “bad” data hidden in it.
Table of Contents¶
Involved OPs¶
OP |
Datasets |
---|---|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Everyone from community is welcome to continue to add examples to this table.
Multimodal Datasets¶
LCS-558K¶
The pretraining dataset of LLaVA-1.5
LCS-558K
item |
value |
---|---|
from OP |
|
id |
004436823 |
aspect_ratio |
36.2857142857 |
caption |
the timber in honey amber |
image |
|
comments |
Unaligned image and caption contents |
item |
value |
---|---|
from OP |
|
id |
005568917 |
image_height |
5177 |
caption |
the us coast guard’s top five most popular aircrafts infographic |
image |
|
comments |
The image with large height/width will lose too much information after being processed as model input |
item |
value |
---|---|
from OP |
|
id |
003301613 |
image_width |
3469 |
caption |
color the circle by number pages for children to learn colors |
image |
|
comments |
The image with large height/width will lose too much information after being processed as model input |
item |
value |
---|---|
from OP |
|
id |
002642925 |
image_size |
2,391 bytes |
caption |
pink gold opal and diamond ring |
image |
|
comments |
Images with too small size might be invalid placeholders without meaningful contents |
item |
value |
---|---|
from OP |
|
id |
001365521 |
image_text_matching_score |
0.0008432278991676867 |
caption |
a black bmw m140i sports hatch from a dealer’s garage |
image |
|
comments |
Images with too small image-text matching score might be invalid placeholders without meaningful contents |
item |
value |
---|---|
from OP |
|
id |
001135426 |
alnum_ratio |
0.429825 |
caption |
dakin - laptopruck » » » » » » » » » » » » » » » » |
image |
|
comments |
Texts with too small alnum ratio might contain unexpected extra meaningless tokens |
item |
value |
---|---|
from OP |
|
id |
004292597 |
char_rep_ratio |
0.720207 |
caption |
harden harden harden harden harden harden harden harden harden harden harden harden harden harden harden harden harden harden harden harden harden |
image |
|
comments |
Texts with too large character repetition ratio might contain repeat contents (captions of LCS-558K are generated by BLIP model) |
item |
value |
---|---|
from OP |
|
id |
001013268 |
flagged_words_ratio |
0.263158 |
caption |
porn photo video porn cartoon porn porn pictures online for adult porn |
image |
Won’t display this image |
comments |
Texts with non-zero flagged words ratio might contain NSFW contents |
item |
value |
---|---|
from OP |
|
id |
002606088 |
perplexity |
19789.9 |
caption |
real white pearl stud earrings sterling 925 925 9250 9210 9240 9210 9280 stud |
image |
|
comments |
Texts with too large perplexity might contain meaningless contents |
item |
value |
---|---|
from OP |
|
ids |
004559803, 003716167, 005659131 |
image |
|
comments |
There are some duplicate images with different names |
MMC4¶
MMC4
item |
value |
---|---|
from OP |
|
aspect_ratios |
[1.6, 10.4651162791] |
corresponding text |
“We found that kahweol acetate and cafestol inhibited growth of cancer cells in mice, but the combination seemed to work synergistically, leading to a significantly slower tumour growth than in untreated mice,” said lead author Hiroaki Iwamoto. |
image |
|
comments |
Unaligned image and caption contents. Images with too large aspect ratio might lose too much information after being processed as model input |
item |
value |
---|---|
from OP |
|
image_sizes |
[453, 198343] |
corresponding text |
If you’re in InfoSec, you are well aware of how this flies in the face of security team demographics. |
image |
|
comments |
Unaligned image and caption contents. Images with too small size might contains meaningless simple contents |
item |
value |
---|---|
from OP |
|
image_sizes |
[481, 517, 532, 482] |
corresponding text |
[“Level Up Coin (LUC) is a cryptocurrency token and operates on the Ethereum platform.”, “Level Up Coin has a current supply of 1,298,120,000 LUC with 996,923,370 LUC in circulation.”, “The last known price of Level Up Coin is 0.000257 USD and is up 23.24% over the last 24 hours.”, “More information can be found at https://play2live.io.”] |
image |
|
comments |
Images with too small size might be QR codes that contain sensitive contents |
item |
value |
---|---|
from OP |
|
image_text_matching_score |
[0.0012427607] |
corresponding text |
Many a times, we face problems connecting to the internet in spite of the Android smartphone being connected to the Wi-Fi. |
image |
|
comments |
Unaligned image and caption contents. Some ad images might be mistakenly regarded as part of the sample. |
item |
value |
---|---|
from OP |
|
word_rep_ratio |
0.917219 |
text |
|
comments |
Texts with too large word repetition ratio might be a list of similar, repeated, but not the same contents |
Text-only Datasets¶
Wikipedia¶
Wikipedia
item |
value |
---|---|
from OP |
|
wiki page |
|
alnum_ratio |
0.262965 |
text |
|
comments |
Texts with too small alnum ratio might only contain structural content, which might be hard to learn |
item |
value |
---|---|
from OP |
|
wiki page |
|
char_rep_ratio |
0.818624 |
text |
|
comments |
Texts with too large character repetition ratio might contain the same style code in a table for the cells |
item |
value |
---|---|
from OP |
|
wiki page |
|
special_char_ratio |
0.861592 |
text |
|
comments |
Texts with too many special characters might be a list of some other pages |
item |
value |
---|---|
from OP |
|
wiki page |
|
text_len |
9 |
text |
|
comments |
Texts with too short content might be an empty page |
item |
value |
---|---|
from OP |
|
wiki page |
|
word_rep_ratio |
0.965517 |
text |
|
comments |
Texts with too large word repetition ratio might be a list of relevant, repeated, but not the same contents |
Books¶
Books
item |
value |
---|---|
from OP |
|
alnum_ratio |
0 |
text |
|
comments |
Texts with too small alnum ratio might only contain meaningless tokens |
item |
value |
---|---|
from OP |
|
char_rep_ratio |
0.86 |
text |
|
comments |
Texts with too large character repetition ratio might contain lots of repeated contents |
item |
value |
---|---|
from OP |
|
perplexity |
380817.4 |
text |
|
comments |
Texts with too large perplexity might contain hard-to-understand contents (e.g. ISBN) |
item |
value |
---|---|
from OP |
|
lang_score |
0.057 |
lang |
en |
text |
|
comments |
Texts with too low language score might contain unreadable texts |
item |
value |
---|---|
from OP |
|
special_char_ratio |
0.999 |
text |
|
comments |
Texts with too large special character ratio might contain meaningless contents |
Stack Exchange¶
Stack Exchange
item |
value |
---|---|
from OP |
|
char_rep_ratio |
0.969099481 |
text |
|
comments |
Texts with too large character repetition ratio might contain the base64 code of an image |
item |
value |
---|---|
from OP |
|
num_words |
2 |
text |
|
comments |
Texts with too few words might be missing content |
ArXiv¶
ArXiv
item |
value |
---|---|
from OP |
|
text_len |
7 |
text |
|
comments |
Texts with too short content might be missing content |
item |
value |
---|---|
from OP |
|
perplexity |
244697 |
text |
|
comments |
Texts with too large perplexity might be the table area in LaTeX code |
Github Code¶
Github Code
item |
value |
---|---|
from OP |
|
text_len |
10 |
text |
|
comments |
Code with too short content might be missing/meaningless content |
item |
value |
---|---|
from OP |
|
avg_line_length |
4.8571428571 |
text |
|
comments |
Code with too short average line length might be “bad” code |