data_juicer.analysis.measure module

class data_juicer.analysis.measure.Measure[source]

Bases: object

Base class for Measure distribution.

name = 'base'
measure(*args, **kwargs)[source]
class data_juicer.analysis.measure.KLDivMeasure[source]

Bases: Measure

Measure Kullback-Leibler divergence.

name = 'kl_divergence'
measure(p, q)[source]
class data_juicer.analysis.measure.JSDivMeasure[source]

Bases: Measure

Measure Jensen-Shannon divergence.

name = 'js_divergence'
measure(p, q)[source]
class data_juicer.analysis.measure.CrossEntropyMeasure[source]

Bases: Measure

Measure Cross-Entropy.

name = 'cross_entropy'
measure(p, q)[source]
class data_juicer.analysis.measure.EntropyMeasure[source]

Bases: Measure

Measure Entropy.

name = 'entropy'
measure(p)[source]
class data_juicer.analysis.measure.RelatedTTestMeasure[source]

Bases: Measure

Measure T-Test for two related distributions on their histogram of the same bins.

Ref: https://en.wikipedia.org/wiki/Student%27s_t-test

For continuous features or distributions, the input could be dataset stats list. For discrete features or distributions, the input could be the tags or the categories list.

name = 't-test'
static stats_to_hist(p, q)[source]
static category_to_hist(p, q)[source]
measure(p, q)[source]
Parameters:
  • p – the first feature or distribution. (stats/tags/categories)

  • q – the second feature or distribution. (stats/tags/categories)

Returns:

the T-Test results object – ([ref](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats._result_classes.TtestResult.html#scipy.stats._result_classes.TtestResult)) # noqa: E501