Make p-value and test statistic reporting more comprehensive #136

mechevere · 2025-01-26T12:23:36Z

mechevere
Jan 26, 2025

Hi there

I'm a big fan of the tool and have been using it for some weeks. Thank you for sharing this work!

I have a suggestion to enhance the experiment results reporting. Currently, the library provides confidence intervals, p-value, and test statistic for the absolute change between variation and control. Meanwhile, for the relative change it only provides the confidence intervals. I think it would be valuable to also include the p-value and test statistic for the relative lift.

This would give users a more comprehensive view of the statistical significance the experimental results. Would you be open to adding these metrics to the output?

Thanks for considering this suggestion.

e10v · 2025-01-26T20:25:10Z

e10v
Jan 26, 2025
Maintainer

Hi,

I'm glad to know tea-tasting is useful for your work.

Before deciding whether to implement the feature, let me clarify two points:

For the confidence interval of the relative effect size, tea-tasting uses a log transformation. This means it would calculate the test statistic and p-value for the difference of logarithms of two means. Does this align with your request?
In theory, the statistics and p-values for the relative effect size differ from those for the absolute effect size. However, in practice, they should be very similar. The null hypotheses $\mu_1 - \mu_2 = 0$ and $\mu_1 / \mu_2 = 1$ are mathematically equivalent (the latter implies the former). They are typically either both rejected or neither is rejected. Could you elaborate on how you intend to use these values? Do you expect the null hypothesis conclusions to differ when using the p-value for the relative effect size?

3 replies

mechevere Jan 27, 2025
Author

Thanks for considering the implementation of my suggestions! To clarify your points:

Yes, this is exactly my request! I would like to be able to get the p-value and test statistic when the sampling distribution for the null hypothesis is the difference of the logarithm of the two means.
I expect that in some cases my conclusions about the null hypothesis will differ when using the p-value for the relative effect vs the absolute effect. In fact, I know this to be true. For some of my analyses, I have seen instances were the confidence interval for the relative lift contains zero but the p-value for the absolute difference is less than 0.05, hence the conclusion about the null hypothesis is contradictory.

To give more context about my use case, I usually need to evaluate experiment results across a significant amount of segments. I want to report the results using the relative lift, the confidence interval and the p-values for each segment. Sometimes, for some the of the underpowered segments in my analysis, I run into the issue of p-value for the absolute change and the relative confidence interval contradicting each other about the null hypothesis conclusion. Also note that based on my checks for the same segments, the confidence interval and the p-value for the absolute change never contradict each other, as expected.

In my view, adding the p-value and test statistic for the relative lift will make the results provided by tea-tasting more comprehensive and will allow users to report the confidence intervals and the p-values for the relative lift without the risk of them contradicting each other.

Thanks again for considering this and for creating and sharing the library.

Best regards,
Mateo

e10v Jan 27, 2025
Maintainer

Thank you for the kind words about tea-tasting and for the detailed explanation. Here are my thoughts:

The differing results for absolute and relative change are likely due to small sample sizes. Probably, the delta method approximation for the variance of the log-transformed mean doesn't work well on small samples. Given the limitations of the delta method on small samples, absolute change metrics (statistic and p-value) are more reliable here.
Important: Avoid making decisions based on analyzing an experiment across many segments. This approach inflates the risk of Type I error (false positives) due to multiple comparisons. But you are likely already aware of this issue.
I will consider adding p-value and statistic for the relative change. But I will decide later. In the next version I will add simulations and A/A tests. This will help me to validate how frequently p-values for absolute change and relative change differ from each other, and how significantly.
Meanwhile, you can create a custom metric and add the statistic and p-value for the relative change. Please find below the example: a slightly modified copy of the module tea_tasting.metrics.mean. There is a usage example after if __name__ == "__main__":.

Custom metrics for the analysis of means.

"""Custom metrics for the analysis of means."""
# ruff: noqa: PD901

from __future__ import annotations

from collections.abc import Sequence
import math
from typing import TYPE_CHECKING, NamedTuple, overload

import scipy.optimize
import scipy.stats

import tea_tasting.aggr
import tea_tasting.config
from tea_tasting.metrics.base import (
    AggrCols,
    MetricBaseAggregated,
    MetricPowerResults,
    PowerBaseAggregated,
)
import tea_tasting.utils


if TYPE_CHECKING:
    from collections.abc import Callable
    from typing import Literal, TypeVar


    N = TypeVar("N", bound=float | int | None)


MAX_ITER = 100


class CustomMeanResult(NamedTuple):
    """Result of the analysis of means.

    Attributes:
        control: Control mean.
        treatment: Treatment mean.
        effect_size: Absolute effect size. Difference between the two means.
        effect_size_ci_lower: Lower bound of the absolute effect size
            confidence interval.
        effect_size_ci_upper: Upper bound of the absolute effect size
            confidence interval.
        rel_effect_size: Relative effect size. Difference between the two means,
            divided by the control mean.
        rel_effect_size_ci_lower: Lower bound of the relative effect size
            confidence interval.
        rel_effect_size_ci_upper: Upper bound of the relative effect size
            confidence interval.
        pvalue: P-value.
        statistic: Statistic (standardized effect size).
        pvalue_rel: P-value for the relative difference between two means.
        statistic_rel: Statistic for the relative difference between two means.
    """
    control: float
    treatment: float
    effect_size: float
    effect_size_ci_lower: float
    effect_size_ci_upper: float
    rel_effect_size: float
    rel_effect_size_ci_lower: float
    rel_effect_size_ci_upper: float
    pvalue: float
    statistic: float
    pvalue_rel: float
    statistic_rel: float


class MeanPowerResult(NamedTuple):
    """Power analysis results.

    Attributes:
        power: Statistical power.
        effect_size: Absolute effect size. Difference between the two means.
        rel_effect_size: Relative effect size. Difference between the two means,
            divided by the control mean.
        n_obs: Number of observations in the control and in the treatment together.
    """
    power: float
    effect_size: float
    rel_effect_size: float
    n_obs: float

MeanPowerResults = MetricPowerResults[MeanPowerResult]


class CustomRatioOfMeans(  # noqa: D101
    MetricBaseAggregated[CustomMeanResult],
    PowerBaseAggregated[MeanPowerResults],
):
    def __init__(  # noqa: PLR0913
        self,
        numer: str,
        denom: str | None = None,
        numer_covariate: str | None = None,
        denom_covariate: str | None = None,
        *,
        alternative: Literal["two-sided", "greater", "less"] | None = None,
        confidence_level: float | None = None,
        equal_var: bool | None = None,
        use_t: bool | None = None,
        alpha: float | None = None,
        ratio: float | int | None = None,
        power: float | None = None,
        effect_size: float | int | Sequence[float | int] | None = None,
        rel_effect_size: float | Sequence[float] | None = None,
        n_obs: int | Sequence[int] | None = None,
    ) -> None:
        """Metric for the analysis of ratios of means.

        Args:
            numer: Numerator column name.
            denom: Denominator column name.
            numer_covariate: Covariate numerator column name.
            denom_covariate: Covariate denominator column name.
            alternative: Alternative hypothesis:

                - `"two-sided"`: the means are unequal,
                - `"greater"`: the mean in the treatment variant is greater than the mean
                    in the control variant,
                - `"less"`: the mean in the treatment variant is less than the mean
                    in the control variant.

            confidence_level: Confidence level for the confidence interval.
            equal_var: Defines whether equal variance is assumed. If `True`,
                pooled variance is used for the calculation of the standard error
                of the difference between two means.
            use_t: Defines whether to use the Student's t-distribution (`True`) or
                the Normal distribution (`False`).
            alpha: Significance level. Only for the analysis of power.
            ratio: Ratio of the number of observations in the treatment
                relative to the control. Only for the analysis of power.
            power: Statistical power. Only for the analysis of power.
            effect_size: Absolute effect size. Difference between the two means.
                Only for the analysis of power.
            rel_effect_size: Relative effect size. Difference between the two means,
                divided by the control mean. Only for the analysis of power.
            n_obs: Number of observations in the control and in the treatment together.
                Only for the analysis of power.

        Parameter defaults:
            Defaults for parameters `alpha`, `alternative`, `confidence_level`,
            `equal_var`, `n_obs`, `power`, `ratio`, and `use_t` can be changed
            using the `config_context` and `set_context` functions.
            See the [Global configuration](https://tea-tasting.e10v.me/api/config/)
            reference for details.

        References:
            - [Deng, A., Knoblich, U., & Lu, J. (2018). Applying the Delta Method in Metric Analytics: A Practical Guide with Novel Ideas](https://alexdeng.github.io/public/files/kdd2018-dm.pdf).
            - [Deng, A., Xu, Y., Kohavi, R., & Walker, T. (2013). Improving the Sensitivity of Online Controlled Experiments by Utilizing Pre-Experiment Data](https://exp-platform.com/Documents/2013-02-CUPED-ImprovingSensitivityOfControlledExperiments.pdf).

        Examples:
            ```pycon
            >>> import tea_tasting as tt

            >>> experiment = tt.Experiment(
            ...     orders_per_session=tt.RatioOfMeans("orders", "sessions"),
            ... )
            >>> data = tt.make_users_data(seed=42)
            >>> result = experiment.analyze(data)
            >>> print(result)
                        metric control treatment rel_effect_size rel_effect_size_ci pvalue
            orders_per_session   0.266     0.289            8.8%      [-0.89%, 19%] 0.0762

            ```

            With CUPED:

            ```pycon
            >>> experiment = tt.Experiment(
            ...     orders_per_session=tt.RatioOfMeans(
            ...         "orders",
            ...         "sessions",
            ...         "orders_covariate",
            ...         "sessions_covariate",
            ...     ),
            ... )
            >>> data = tt.make_users_data(seed=42, covariates=True)
            >>> result = experiment.analyze(data)
            >>> print(result)
                        metric control treatment rel_effect_size rel_effect_size_ci  pvalue
            orders_per_session   0.262     0.293             12%        [4.2%, 21%] 0.00229

            ```

            Power analysis:

            ```pycon
            >>> data = tt.make_users_data(
            ...     seed=42,
            ...     sessions_uplift=0,
            ...     orders_uplift=0,
            ...     revenue_uplift=0,
            ...     covariates=True,
            ... )
            >>> orders_per_session = tt.RatioOfMeans(
            ...     "orders",
            ...     "sessions",
            ...     "orders_covariate",
            ...     "sessions_covariate",
            ...     n_obs=(10_000, 20_000),
            ... )
            >>> # Solve for effect size.
            >>> print(orders_per_session.solve_power(data))
            power effect_size rel_effect_size n_obs
              80%      0.0177            6.8% 10000
              80%      0.0125            4.8% 20000

            >>> orders_per_session = tt.RatioOfMeans(
            ...     "orders",
            ...     "sessions",
            ...     "orders_covariate",
            ...     "sessions_covariate",
            ...     rel_effect_size=0.05,
            ... )
            >>> # Solve for the total number of observations.
            >>> print(orders_per_session.solve_power(data, "n_obs"))
            power effect_size rel_effect_size n_obs
              80%      0.0130            5.0% 18515

            >>> orders_per_session = tt.RatioOfMeans(
            ...     "orders",
            ...     "sessions",
            ...     "orders_covariate",
            ...     "sessions_covariate",
            ...     rel_effect_size=0.1,
            ... )
            >>> # Solve for power. Infer number of observations from the sample.
            >>> print(orders_per_session.solve_power(data, "power"))
            power effect_size rel_effect_size n_obs
              74%      0.0261             10%  4000

            ```
        """  # noqa: E501
        self.numer = tea_tasting.utils.check_scalar(numer, "numer", typ=str)
        self.denom = tea_tasting.utils.check_scalar(denom, "denom", typ=str | None)
        self.numer_covariate = tea_tasting.utils.check_scalar(
            numer_covariate, "numer_covariate", typ=str | None)
        self.denom_covariate = tea_tasting.utils.check_scalar(
            denom_covariate, "denom_covariate", typ=str | None)
        self.alternative = (
            tea_tasting.utils.auto_check(alternative, "alternative")
            if alternative is not None
            else tea_tasting.config.get_config("alternative")
        )
        self.confidence_level = (
            tea_tasting.utils.auto_check(confidence_level, "confidence_level")
            if confidence_level is not None
            else tea_tasting.config.get_config("confidence_level")
        )
        self.equal_var = (
            tea_tasting.utils.auto_check(equal_var, "equal_var")
            if equal_var is not None
            else tea_tasting.config.get_config("equal_var")
        )
        self.use_t = (
            tea_tasting.utils.auto_check(use_t, "use_t")
            if use_t is not None
            else tea_tasting.config.get_config("use_t")
        )
        self.alpha = (
            tea_tasting.utils.auto_check(alpha, "alpha")
            if alpha is not None
            else tea_tasting.config.get_config("alpha")
        )
        self.ratio = (
            tea_tasting.utils.auto_check(ratio, "ratio")
            if ratio is not None
            else tea_tasting.config.get_config("ratio")
        )
        self.power = (
            tea_tasting.utils.auto_check(power, "power")
            if power is not None
            else tea_tasting.config.get_config("power")
        )
        if effect_size is not None and rel_effect_size is not None:
            raise ValueError(
                "Both `effect_size` and `rel_effect_size` are not `None`. "
                "Only one of them should be defined.",
            )
        if isinstance(effect_size, Sequence):
            for x in effect_size:
                tea_tasting.utils.check_scalar(
                    x, "effect_size", typ=float | int,
                    gt=float("-inf"), lt=float("inf"), ne=0,
                )
        elif effect_size is not None:
            tea_tasting.utils.check_scalar(
                effect_size, "effect_size", typ=float | int,
                gt=float("-inf"), lt=float("inf"), ne=0,
            )
        self.effect_size = effect_size
        if isinstance(rel_effect_size, Sequence):
            for x in rel_effect_size:
                tea_tasting.utils.check_scalar(
                    x, "rel_effect_size", typ=float | int,
                    gt=float("-inf"), lt=float("inf"), ne=0,
                )
        elif rel_effect_size is not None:
            tea_tasting.utils.check_scalar(
                rel_effect_size, "rel_effect_size", typ=float | int,
                gt=float("-inf"), lt=float("inf"), ne=0,
            )
        self.rel_effect_size = rel_effect_size
        self.n_obs = (
            tea_tasting.utils.auto_check(n_obs, "n_obs")
            if n_obs is not None
            else tea_tasting.config.get_config("n_obs")
        )


    @property
    def aggr_cols(self) -> AggrCols:
        """Columns to be aggregated for a metric analysis."""
        cols = tuple(
            col for col in (
                self.numer,
                self.denom,
                self.numer_covariate,
                self.denom_covariate,
            )
            if col is not None
        )
        return AggrCols(
            has_count=True,
            mean_cols=cols,
            var_cols=cols,
            cov_cols=tuple(
                (col0, col1)
                for col0 in cols
                for col1 in cols
                if col0 < col1
            ),
        )


    def analyze_aggregates(
        self,
        control: tea_tasting.aggr.Aggregates,
        treatment: tea_tasting.aggr.Aggregates,
    ) -> CustomMeanResult:
        """Analyze a metric in an experiment using aggregated statistics.

        Args:
            control: Control data.
            treatment: Treatment data.

        Returns:
            Analysis result.
        """
        control = control.with_zero_div()
        treatment = treatment.with_zero_div()
        total = control + treatment
        covariate_coef = self._covariate_coef(total)
        covariate_mean = total.mean(self.numer_covariate) / total.mean(
            self.denom_covariate)
        return self._analyze_stats(
            contr_mean=self._metric_mean(control, covariate_coef, covariate_mean),
            contr_var=self._metric_var(control, covariate_coef),
            contr_count=control.count(),
            treat_mean=self._metric_mean(treatment, covariate_coef, covariate_mean),
            treat_var=self._metric_var(treatment, covariate_coef),
            treat_count=treatment.count(),
        )


    def solve_power_from_aggregates(
        self,
        data: tea_tasting.aggr.Aggregates,
        parameter: Literal[
            "power", "effect_size", "rel_effect_size", "n_obs"] = "rel_effect_size",
    ) -> MeanPowerResults:
        """Solve for a parameter of the power of a test.

        Args:
            data: Sample data.
            parameter: Parameter name.

        Returns:
            Power analysis result.
        """
        tea_tasting.utils.check_scalar(
            parameter,
            "parameter",
            in_={"power", "effect_size", "rel_effect_size", "n_obs"},
        )

        data = data.with_zero_div()
        covariate_coef = self._covariate_coef(data)
        covariate_mean = data.mean(self.numer_covariate) / data.mean(
            self.denom_covariate)
        metric_mean = self._metric_mean(data, covariate_coef, covariate_mean)

        power, effect_size, rel_effect_size, n_obs = self._validate_power_parameters(
            metric_mean=metric_mean,
            sample_count=data.count(),
            parameter=parameter,
        )

        result = MeanPowerResults()
        for effect_size_i, rel_effect_size_i in zip(
            effect_size,
            rel_effect_size,
            strict=True,
        ):
            for n_obs_i in n_obs:
                parameter_value = self._solve_power_from_stats(
                    sample_var=self._metric_var(data, covariate_coef),
                    sample_count=n_obs_i,
                    effect_size=effect_size_i,
                    power=power,
                )
                result.append(MeanPowerResult(
                    power=parameter_value if parameter == "power" else power,  # type: ignore
                    effect_size=(
                        parameter_value
                        if parameter in {"effect_size", "rel_effect_size"}
                        else effect_size_i
                    ),  # type: ignore
                    rel_effect_size=(
                        parameter_value / metric_mean
                        if parameter in {"effect_size", "rel_effect_size"}
                        else rel_effect_size_i
                    ),  # type: ignore
                    n_obs=(
                        math.ceil(parameter_value)
                        if parameter == "n_obs"
                        else n_obs_i
                    ),  # type: ignore
                ))

        return result


    def _validate_power_parameters(
        self,
        metric_mean: float,
        sample_count: int,
        parameter: Literal["power", "effect_size", "rel_effect_size", "n_obs"],
    ) -> tuple[
        float | None,  # power
        Sequence[float | int | None],  # effect_size
        Sequence[float | None],  # rel_effect_size
        Sequence[int | None],  # n_obs
    ]:
        n_obs = None
        effect_size = None
        rel_effect_size = None
        power = None

        if parameter in {"power", "n_obs"}:
            if self.effect_size is None and self.rel_effect_size is None:
                raise ValueError(
                    "Both `effect_size` and `rel_effect_size` are `None`. "
                    "One of them should be defined.",
                )
            effect_size = (
                self.effect_size if self.rel_effect_size is None
                else tuple(
                    rel_effect_size * metric_mean
                    for rel_effect_size in _to_seq(self.rel_effect_size)
                )
            )
            rel_effect_size = (
                self.rel_effect_size if self.effect_size is None
                else tuple(
                    effect_size / metric_mean
                    for effect_size in _to_seq(self.effect_size)
                )
            )

        if parameter in {"power", "effect_size", "rel_effect_size"}:
            n_obs = (sample_count,) if self.n_obs is None else self.n_obs

        if parameter in {"effect_size", "rel_effect_size", "n_obs"}:
            power = self.power

        return power, _to_seq(effect_size), _to_seq(rel_effect_size), _to_seq(n_obs)


    def _covariate_coef(self, aggr: tea_tasting.aggr.Aggregates) -> float:
        covariate_var = aggr.ratio_var(self.numer_covariate, self.denom_covariate)
        if covariate_var == 0:
            return 0
        return self._covariate_cov(aggr) / covariate_var


    def _covariate_cov(self, aggr: tea_tasting.aggr.Aggregates) -> float:
        return aggr.ratio_cov(
            self.numer,
            self.denom,
            self.numer_covariate,
            self.denom_covariate,
        )


    def _metric_mean(
        self,
        aggr: tea_tasting.aggr.Aggregates,
        covariate_coef: float,
        covariate_mean: float,
    ) -> float:
        value = aggr.mean(self.numer) / aggr.mean(self.denom)
        covariate = aggr.mean(self.numer_covariate) / aggr.mean(self.denom_covariate)
        return value - covariate_coef*(covariate - covariate_mean)


    def _metric_var(
        self,
        aggr: tea_tasting.aggr.Aggregates,
        covariate_coef: float,
    ) -> float:
        var = aggr.ratio_var(self.numer, self.denom)
        covariate_var = aggr.ratio_var(self.numer_covariate, self.denom_covariate)
        covariate_cov = self._covariate_cov(aggr)
        return (
            var
            + covariate_coef * covariate_coef * covariate_var
            - 2 * covariate_coef * covariate_cov
        )


    def _analyze_stats(
        self,
        contr_mean: float,
        contr_var: float,
        contr_count: int,
        treat_mean: float,
        treat_var: float,
        treat_count: int,
    ) -> CustomMeanResult:
        scale, distr, _ = self._scale_and_distr(
            contr_var=contr_var,
            contr_count=contr_count,
            treat_var=treat_var,
            treat_count=treat_count,
        )
        log_scale, log_distr, _ = self._scale_and_distr(
            contr_var=contr_var / contr_mean / contr_mean,
            contr_count=contr_count,
            treat_var=treat_var / treat_mean / treat_mean,
            treat_count=treat_count,
        )

        means_ratio = treat_mean / contr_mean
        effect_size = treat_mean - contr_mean
        statistic = effect_size / scale
        statistic_rel = math.log(means_ratio) / log_scale

        if self.alternative == "greater":
            q = self.confidence_level
            effect_size_ci_lower = effect_size + scale*distr.isf(q)
            means_ratio_ci_lower = means_ratio * math.exp(log_scale * log_distr.isf(q))
            effect_size_ci_upper = means_ratio_ci_upper = float("+inf")
            pvalue = distr.sf(statistic)
            pvalue_rel = distr.sf(statistic_rel)
        elif self.alternative == "less":
            q = self.confidence_level
            effect_size_ci_lower = means_ratio_ci_lower = float("-inf")
            effect_size_ci_upper = effect_size + scale*distr.ppf(q)
            means_ratio_ci_upper = means_ratio * math.exp(log_scale * log_distr.ppf(q))
            pvalue = distr.cdf(statistic)
            pvalue_rel = distr.cdf(statistic_rel)
        else:  # two-sided
            q = (1 + self.confidence_level) / 2
            half_ci = scale * distr.ppf(q)
            effect_size_ci_lower = effect_size - half_ci
            effect_size_ci_upper = effect_size + half_ci

            rel_half_ci = math.exp(log_scale * log_distr.ppf(q))
            means_ratio_ci_lower = means_ratio / rel_half_ci
            means_ratio_ci_upper = means_ratio * rel_half_ci

            pvalue = 2 * distr.sf(abs(statistic))
            pvalue_rel = 2 * distr.sf(abs(statistic_rel))

        return CustomMeanResult(
            control=contr_mean,
            treatment=treat_mean,
            effect_size=effect_size,
            effect_size_ci_lower=effect_size_ci_lower,
            effect_size_ci_upper=effect_size_ci_upper,
            rel_effect_size=means_ratio - 1,
            rel_effect_size_ci_lower=means_ratio_ci_lower - 1,
            rel_effect_size_ci_upper=means_ratio_ci_upper - 1,
            pvalue=pvalue,
            statistic=statistic,
            pvalue_rel=pvalue_rel,
            statistic_rel=statistic_rel,
        )


    def _solve_power_from_stats(
        self,
        sample_var: float,
        sample_count: int | None = None,
        effect_size: float | None = None,
        power: float | None = None,
    ) -> float | int:
        if power is None and effect_size is not None and sample_count is not None:
            return self._power_from_stats(
                sample_var=sample_var,
                sample_count=sample_count,
                effect_size=effect_size,
            )

        if power is not None and effect_size is None and sample_count is not None:
            def fn(x: float | int) -> float:
                return power - self._power_from_stats(
                    sample_var=sample_var,
                    sample_count=sample_count,
                    effect_size=x,
                )
            sign = -1 if self.alternative == "less" else 1
            other_bound = _find_boundary(
                fn,
                sign * 10 * math.sqrt(sample_var / sample_count),
            )
            lower_bound, upper_bound = sorted((0, other_bound))

        if power is not None and effect_size is not None and sample_count is None:
            def fn(x: float | int) -> float:
                return power - self._power_from_stats(
                    sample_var=sample_var,
                    sample_count=x,
                    effect_size=effect_size,
                )
            lower_bound = 3
            upper_bound = _find_boundary(fn, 10)

        return scipy.optimize.brentq(fn, lower_bound, upper_bound, maxiter=MAX_ITER)  # type: ignore


    def _power_from_stats(
        self,
        sample_var: float,
        sample_count: int | float,
        effect_size: float,
    ) -> float:
        _, null_distr, alt_distr = self._scale_and_distr(
            contr_var=sample_var,
            contr_count=sample_count / (1 + self.ratio),
            treat_var=sample_var,
            treat_count=sample_count * self.ratio / (1 + self.ratio),
            effect_size=effect_size,
        )
        if self.alternative == "greater":
            stat_critical = null_distr.isf(self.alpha)
            return alt_distr.sf(stat_critical)
        if self.alternative == "less":
            stat_critical = null_distr.ppf(self.alpha)
            return alt_distr.cdf(stat_critical)
        # two-sided
        stat_critical = null_distr.isf(self.alpha / 2)
        return alt_distr.cdf(-stat_critical) + alt_distr.sf(stat_critical)


    @overload
    def _scale_and_distr(
        self,
        contr_var: float,
        contr_count: int | float,
        treat_var: float,
        treat_count: int | float,
        effect_size: None = None,
    ) -> tuple[float, scipy.stats.rv_frozen, None]:
        ...

    @overload
    def _scale_and_distr(
        self,
        contr_var: float,
        contr_count: int | float,
        treat_var: float,
        treat_count: int | float,
        effect_size: float,
    ) -> tuple[float, scipy.stats.rv_frozen, scipy.stats.rv_frozen]:
        ...

    def _scale_and_distr(
        self,
        contr_var: float,
        contr_count: int | float,
        treat_var: float,
        treat_count: int | float,
        effect_size: float | None = None,
    ) -> tuple[float, scipy.stats.rv_frozen, scipy.stats.rv_frozen | None]:
        if self.equal_var:
            pooled_var = (
                (contr_count - 1)*contr_var + (treat_count - 1)*treat_var
            ) / (contr_count + treat_count - 2)
            scale = math.sqrt(pooled_var/contr_count + pooled_var/treat_count)
        else:
            scale = math.sqrt(contr_var/contr_count + treat_var/treat_count)

        if self.use_t:
            if self.equal_var:
                df = contr_count + treat_count - 2
            else:
                contr_mean_var = contr_var / contr_count
                treat_mean_var = treat_var / treat_count
                df = (contr_mean_var + treat_mean_var)**2 / (
                    contr_mean_var**2 / (contr_count - 1)
                    + treat_mean_var**2 / (treat_count - 1)
                )
            null_distr = scipy.stats.t(df=df)
            alt_distr = None if effect_size is None else scipy.stats.nct(
                df=df, nc=effect_size / scale)
        else:
            null_distr = scipy.stats.norm()
            alt_distr = None if effect_size is None else scipy.stats.norm(
                loc=effect_size / scale)

        return scale, null_distr, alt_distr


def _find_boundary(
    fn: Callable[[float | int], float],
    init: float | int,
    mult: float | int = 10,
) -> float:
    b = init
    i = 0
    while fn(b) > 0:
        b *= mult
        i += 1
        if i == MAX_ITER:
            raise RuntimeError(
                "Cannot find parameter boundaries. "
                "Maximum number of iterations is reached.",
            )
    return b


def _to_seq(x: N | Sequence[N]) -> Sequence[N]:
    if isinstance(x, Sequence):
        return x
    return (x,)


class CustomMean(CustomRatioOfMeans):  # noqa: D101
    def __init__(  # noqa: PLR0913
        self,
        value: str,
        covariate: str | None = None,
        *,
        alternative: Literal["two-sided", "greater", "less"] | None = None,
        confidence_level: float | None = None,
        equal_var: bool | None = None,
        use_t: bool | None = None,
        alpha: float | None = None,
        ratio: float | int | None = None,
        power: float | None = None,
        effect_size: float | int | Sequence[float | int] | None = None,
        rel_effect_size: float | Sequence[float] | None = None,
        n_obs: int | Sequence[int] | None = None,
    ) -> None:
        """Metric for the analysis of means.

        Args:
            value: Metric value column name.
            covariate: Metric covariate column name.
            alternative: Alternative hypothesis:

                - `"two-sided"`: the means are unequal,
                - `"greater"`: the mean in the treatment variant is greater than the mean
                    in the control variant,
                - `"less"`: the mean in the treatment variant is less than the mean
                    in the control variant.

            confidence_level: Confidence level for the confidence interval.
            equal_var: Defines whether equal variance is assumed. If `True`,
                pooled variance is used for the calculation of the standard error
                of the difference between two means.
            use_t: Defines whether to use the Student's t-distribution (`True`) or
                the Normal distribution (`False`).
            alpha: Significance level. Only for the analysis of power.
            ratio: Ratio of the number of observations in the treatment
                relative to the control. Only for the analysis of power.
            power: Statistical power. Only for the analysis of power.
            effect_size: Absolute effect size. Difference between the two means.
                Only for the analysis of power.
            rel_effect_size: Relative effect size. Difference between the two means,
                divided by the control mean. Only for the analysis of power.
            n_obs: Number of observations in the control and in the treatment together.
                Only for the analysis of power.

        Parameter defaults:
            Defaults for parameters `alpha`, `alternative`, `confidence_level`,
            `equal_var`, `n_obs`, `power`, `ratio`, and `use_t` can be changed
            using the `config_context` and `set_context` functions.
            See the [Global configuration](https://tea-tasting.e10v.me/api/config/)
            reference for details.

        References:
            - [Deng, A., Knoblich, U., & Lu, J. (2018). Applying the Delta Method in Metric Analytics: A Practical Guide with Novel Ideas](https://alexdeng.github.io/public/files/kdd2018-dm.pdf).
            - [Deng, A., Xu, Y., Kohavi, R., & Walker, T. (2013). Improving the Sensitivity of Online Controlled Experiments by Utilizing Pre-Experiment Data](https://exp-platform.com/Documents/2013-02-CUPED-ImprovingSensitivityOfControlledExperiments.pdf).

        Examples:
            ```pycon
            >>> import tea_tasting as tt

            >>> experiment = tt.Experiment(
            ...     orders_per_user=tt.Mean("orders"),
            ...     revenue_per_user=tt.Mean("revenue"),
            ... )
            >>> data = tt.make_users_data(seed=42)
            >>> result = experiment.analyze(data)
            >>> print(result)
                      metric control treatment rel_effect_size rel_effect_size_ci pvalue
             orders_per_user   0.530     0.573            8.0%       [-2.0%, 19%]  0.118
            revenue_per_user    5.24      5.73            9.3%       [-2.4%, 22%]  0.123

            ```

            With CUPED:

            ```pycon
            >>> experiment = tt.Experiment(
            ...     orders_per_user=tt.Mean("orders", "orders_covariate"),
            ...     revenue_per_user=tt.Mean("revenue", "revenue_covariate"),
            ... )
            >>> data = tt.make_users_data(seed=42, covariates=True)
            >>> result = experiment.analyze(data)
            >>> print(result)
                      metric control treatment rel_effect_size rel_effect_size_ci  pvalue
             orders_per_user   0.523     0.581             11%        [2.9%, 20%] 0.00733
            revenue_per_user    5.12      5.85             14%        [3.8%, 26%] 0.00674

            ```

            Power analysis:

            ```pycon
            >>> data = tt.make_users_data(
            ...     seed=42,
            ...     sessions_uplift=0,
            ...     orders_uplift=0,
            ...     revenue_uplift=0,
            ...     covariates=True,
            ... )
            >>> orders_per_user = tt.Mean(
            ...     "orders",
            ...     "orders_covariate",
            ...     n_obs=(10_000, 20_000),
            ... )
            >>> # Solve for effect size.
            >>> print(orders_per_user.solve_power(data))
            power effect_size rel_effect_size n_obs
              80%      0.0374            7.2% 10000
              80%      0.0264            5.1% 20000

            >>> orders_per_user = tt.Mean(
            ...     "orders",
            ...     "orders_covariate",
            ...     rel_effect_size=0.05,
            ... )
            >>> # Solve for the total number of observations.
            >>> print(orders_per_user.solve_power(data, "n_obs"))
            power effect_size rel_effect_size n_obs
              80%      0.0260            5.0% 20733

            >>> orders_per_user = tt.Mean(
            ...     "orders",
            ...     "orders_covariate",
            ...     rel_effect_size=0.1,
            ... )
            >>> # Solve for power. Infer number of observations from the sample.
            >>> print(orders_per_user.solve_power(data, "power"))
            power effect_size rel_effect_size n_obs
              69%      0.0519             10%  4000

            ```
        """  # noqa: E501
        super().__init__(
            numer=value,
            denom=None,
            numer_covariate=covariate,
            denom_covariate=None,
            alternative=alternative,
            confidence_level=confidence_level,
            equal_var=equal_var,
            use_t=use_t,
            alpha=alpha,
            ratio=ratio,
            power=power,
            effect_size=effect_size,
            rel_effect_size=rel_effect_size,
            n_obs=n_obs,
        )
        self.value = value
        self.covariate = covariate


if __name__ == "__main__":
    import tea_tasting as tt

    data = tt.make_users_data(seed=42)
    experiment = tt.Experiment(
        sessions_per_user=CustomMean("sessions"),
        orders_per_session=CustomRatioOfMeans("orders", "sessions"),
        orders_per_user=CustomMean("orders"),
        revenue_per_user=CustomMean("revenue"),
    )
    result = experiment.analyze(data)
    print(result.to_string([
        "metric",
        "control",
        "treatment",
        "rel_effect_size",
        "rel_effect_size_ci",
        "pvalue",
        "pvalue_rel",
        "statistic",
        "statistic_rel",
    ]))

Btw, my example confirms that in most cases, the statistic and p-value for the relative effect size are very similar to those for the absolute effect size.

            metric control treatment rel_effect_size rel_effect_size_ci pvalue pvalue_rel statistic statistic_rel
 sessions_per_user    2.00      1.98          -0.66%      [-3.7%, 2.5%]  0.674      0.674    -0.421        -0.421
orders_per_session   0.266     0.289            8.8%      [-0.89%, 19%] 0.0762     0.0766      1.77          1.77
   orders_per_user   0.530     0.573            8.0%       [-2.0%, 19%]  0.118      0.118      1.56          1.56
  revenue_per_user    5.24      5.73            9.3%       [-2.4%, 22%]  0.123      0.123      1.54          1.54

mechevere Feb 4, 2025
Author

I just got some time to test this. It solved my problem. Thank you!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make p-value and test statistic reporting more comprehensive #136

{{title}}

Replies: 1 comment 3 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Make p-value and test statistic reporting more comprehensive #136

mechevere Jan 26, 2025

Replies: 1 comment · 3 replies

e10v Jan 26, 2025 Maintainer

mechevere Jan 27, 2025 Author

e10v Jan 27, 2025 Maintainer

mechevere Feb 4, 2025 Author

mechevere
Jan 26, 2025

Replies: 1 comment 3 replies

e10v
Jan 26, 2025
Maintainer

mechevere Jan 27, 2025
Author

e10v Jan 27, 2025
Maintainer

mechevere Feb 4, 2025
Author