econml.validate.DRTester

class econml.validate.DRTester(*, model_regression, model_propensity, cate, cv: Union[int, List] = 5)[source]

Bases: object

Validation tests for CATE models. Includes the best linear predictor (BLP) test as in Chernozhukov et al. (2022), the calibration test in Dwivedi et al. (2020), and the QINI coefficient as in Radcliffe (2007).

Best Linear Predictor (BLP)

Runs ordinary least squares (OLS) of doubly robust (DR) outcomes on DR outcome predictions from the CATE model (and a constant). If the CATE model captures true heterogeneity, then the OLS estimate on the CATE predictions should be positive and significantly different than 0.

Calibration

First, units are binned based on out-of-sample defined quantiles of the CATE predictions (s(Z)). Within each bin (k), the absolute difference between the mean CATE prediction and DR outcome is calculated, along with the absolute difference in the mean CATE prediction and the overall ATE. These measures are then summed across bins, weighted by a probability a unit is in each bin.

\[ \begin{align}\begin{aligned}\mathrm{Cal}_G = \sum_k \pi(k) |{\E[s(Z) | k] - \E[Y^{DR} | k]}|\\\mathrm{Cal}_O = \sum_k \pi(k) |{\E[s(Z) | k] - \E[Y^{DR}]}|\end{aligned}\end{align} \]

The calibration r-squared is then defined as

\[\mathcal{R^2}_C = 1 - \frac{\mathrm{Cal}_G}{\mathrm{Cal}_O}\]

The calibration r-squared metric is similar to the standard R-square score in that it can take any value less than or equal to 1, with scores closer to 1 indicating a better calibrated CATE model.

Uplift Modeling

Units are ordered by predicted CATE values and a running measure of the average treatment effect in each cohort is kept as we progress through ranks. The resulting TOC curve can then be plotted and its integral calculated and used as a measure of true heterogeneity captured by the CATE model; this integral is referred to as the AUTOC (area under TOC). The QINI curve is a variant of this curve that also incorporates treatment probability; its integral is referred to as the QINI coefficient.

More formally, the TOC and QINI curves are given by the following functions:

\[ \begin{align}\begin{aligned}\tau_{TOC}(q) = \mathrm{Cov}( Y^{DR}(g,p), \frac{ \mathbb{1}\{\hat{\tau}(Z) \geq \hat{\mu}(q)\} }{ \mathrm{Pr}(\hat{\tau}(Z) \geq \hat{\mu}(q)) } )\\\tau_{QINI}(q) = \mathrm{Cov}(Y^{DR}(g,p), \mathbb{1}\{\hat{\tau}(Z) \geq \hat{\mu}(q)\})\end{aligned}\end{align} \]

Where \(q\) is the desired quantile, \(\hat{\mu}\) is the quantile function, and \(\hat{\tau}\) is the predicted CATE function. \(Y^{DR}(g,p)\) refers to the doubly robust outcome difference (relative to control) for the given observation.

The AUTOC and QINI coefficient are then given by:

\[ \begin{align}\begin{aligned}AUTOC = \int_0^1 \tau_{TOC}(q) dq\\QINI = \int_0^1 \tau_{QINI}(q) dq\end{aligned}\end{align} \]

Parameters

model_regression (estimator) – Nuisance model estimator used to fit the outcome to features. Must be able to implement fit and predict methods
model_propensity (estimator) – Nuisance model estimator used to fit the treatment assignment to features. Must be able to implement fit method and either predict (in the case of binary treatment) or predict_proba methods (in the case of multiple categorical treatments).
cate (estimator) – Fitted conditional average treatment effect (CATE) estimator to be validated.
cv (int or list, default 5) – Splitter used for cross-validation. Can be either an integer (corresponding to the number of desired folds) or a list of indices corresponding to membership in each fold.

References

[Chernozhukov2022] V. Chernozhukov et al. Generic Machine Learning Inference on Heterogeneous Treatment Effects in Randomized Experiments arXiv preprint arXiv:1712.04802, 2022. https://arxiv.org/abs/1712.04802

[Dwivedi2020] R. Dwivedi et al. Stable Discovery of Interpretable Subgroups via Calibration in Causal Studies arXiv preprint arXiv:2008.10109, 2020. https://arxiv.org/abs/2008.10109

[Radcliffe2007] N. Radcliffe Using control groups to target on predicted lift: Building and assessing uplift model. Direct Marketing Analytics Journal (2007), pages 14–21.

__init__(*, model_regression, model_propensity, cate, cv: Union[int, List] = 5)[source]

Methods

`__init__`(*, model_regression, ...[, cv])
`evaluate_all`([Xval, Xtrain, n_groups, ...])	Implements the best linear prediction (evaluate_blp), calibration (evaluate_cal), uplift curve (evaluate_uplift) methods
`evaluate_blp`([Xval, Xtrain])	Implements the best linear predictor (BLP) test as in [Chernozhukov2022].
`evaluate_cal`([Xval, Xtrain, n_groups])	Implements calibration test as in [Dwivedi2020]
`evaluate_uplift`([Xval, Xtrain, percentiles, ...])	Calculates uplift curves and coefficients for the given model, where units are ordered by predicted CATE values and a running measure of the average treatment effect in each cohort is kept as we progress through ranks.
`fit_nuisance`(Xval, Dval, yval[, Xtrain, ...])	Generates nuisance predictions and calculates doubly robust (DR) outcomes either by (1) cross-fitting in the validation sample, or (2) fitting in the training sample and applying to the validation sample.
`fit_nuisance_cv`(X, D, y)	Generates nuisance function predictions using k-folds cross validation.
`fit_nuisance_train`(Xtrain, Dtrain, ytrain, Xval)	Fits nuisance models in training sample and applies to generate predictions in validation sample.
`get_cate_preds`(Xval[, Xtrain])	Generates predictions from fitted CATE model.
`get_cv_splits`(vars, T)	Generates splits for cross-validation, given a set of variables and treatment vector.
`get_cv_splitter`([random_state])	Generates splitter object for cross-validation.

evaluate_all(Xval: Optional[numpy.array] = None, Xtrain: Optional[numpy.array] = None, n_groups: int = 4, n_bootstrap: int = 1000) → econml.validate.results.EvaluationResults[source]

Implements the best linear prediction (evaluate_blp), calibration (evaluate_cal), uplift curve (evaluate_uplift) methods

Parameters

Xval ((n_cal x k) matrix, optional) – Validation sample features for CATE model. If not specified, then fit_cate method must already have been implemented
Xtrain ((n_train x k) matrix, optional) – Training sample features for CATE model. If not specified, then fit_cate method must already have been implemented
n_groups (integer, default 4) – Number of quantile-based groups used to calculate calibration score.
n_bootstrap (integer, default 1000) – Number of bootstrap samples to run when calculating uniform confidence bands for uplift curves.

Return type

EvaluationResults object summarizing the results of all tests

evaluate_blp(Xval: Optional[numpy.array] = None, Xtrain: Optional[numpy.array] = None) → econml.validate.results.BLPEvaluationResults[source]

Implements the best linear predictor (BLP) test as in [Chernozhukov2022]. fit_nusiance method must already be implemented.

Parameters

Xval ((n_val x k) matrix, optional) – Validation sample features for CATE model. If not specified, then fit_cate method must already have been implemented
Xtrain ((n_train x k) matrix, optional) – Training sample features for CATE model. If specified, then CATE is fitted on training sample and applied to Xval. If specified, then Xtrain, Dtrain, Ytrain must have been specified in fit_nuisance method (and vice-versa)

Return type

BLPEvaluationResults object showing the results of the BLP test

evaluate_cal(Xval: Optional[numpy.array] = None, Xtrain: Optional[numpy.array] = None, n_groups: int = 4) → econml.validate.results.CalibrationEvaluationResults[source]

Implements calibration test as in [Dwivedi2020]

Parameters

Xval ((n_val x n_treatment) matrix, optional) – Validation sample features for CATE model. If not specified, then fit_cate method must already have been implemented
Xtrain ((n_train x n_treatment) matrix, optional) – Training sample features for CATE model. If not specified, then fit cate method must already have been implemented (with Xtrain specified)
n_groups (integer, default 4) – Number of quantile-based groups used to calculate calibration score.

Return type

CalibrationEvaluationResults object showing the results of the calibration test

evaluate_uplift(Xval: Optional[numpy.array] = None, Xtrain: Optional[numpy.array] = None, percentiles: numpy.array = array([5., 6.83673469, 8.67346939, 10.51020408, 12.34693878, 14.18367347, 16.02040816, 17.85714286, 19.69387755, 21.53061224, 23.36734694, 25.20408163, 27.04081633, 28.87755102, 30.71428571, 32.55102041, 34.3877551, 36.2244898, 38.06122449, 39.89795918, 41.73469388, 43.57142857, 45.40816327, 47.24489796, 49.08163265, 50.91836735, 52.75510204, 54.59183673, 56.42857143, 58.26530612, 60.10204082, 61.93877551, 63.7755102, 65.6122449, 67.44897959, 69.28571429, 71.12244898, 72.95918367, 74.79591837, 76.63265306, 78.46938776, 80.30612245, 82.14285714, 83.97959184, 85.81632653, 87.65306122, 89.48979592, 91.32653061, 93.16326531, 95.]), metric: str = 'qini', n_bootstrap: int = 1000) → econml.validate.results.UpliftEvaluationResults[source]

Calculates uplift curves and coefficients for the given model, where units are ordered by predicted CATE values and a running measure of the average treatment effect in each cohort is kept as we progress through ranks. The uplift coefficient is then the area under the resulting curve, with a value of 0 interpreted as corresponding to a model with randomly assigned CATE coefficients. All calculations are performed on validation dataset results, using the training set as input.

Parameters

Xval ((n_val x k) matrix, optional) – Validation sample features for CATE model. If not specified, then fit_cate method must already have been implemented
Xtrain ((n_train x k) matrix, optional) – Training sample features for CATE model. If specified, then CATE is fitted on training sample and applied to Xval. If specified, then Xtrain, Dtrain, Ytrain must have been specified in fit_nuisance method (and vice-versa)
percentiles (one-dimensional array, default np.linspace(5, 95, 50)) – Array of percentiles over which the QINI curve should be constructed. Defaults to 5%-95% in intervals of 5%.
metric (string, default ‘qini’) – Which type of uplift curve to evaluate. Must be one of [‘toc’, ‘qini’]
n_bootstrap (integer, default 1000) – Number of bootstrap samples to run when calculating uniform confidence bands.

Return type

UpliftEvaluationResults object showing the fitted results

fit_nuisance(Xval: numpy.array, Dval: numpy.array, yval: numpy.array, Xtrain: Optional[numpy.array] = None, Dtrain: Optional[numpy.array] = None, ytrain: Optional[numpy.array] = None)[source]

Generates nuisance predictions and calculates doubly robust (DR) outcomes either by (1) cross-fitting in the validation sample, or (2) fitting in the training sample and applying to the validation sample. If Xtrain, Dtrain, and ytrain are all not None, then option (2) will be implemented, otherwise, option (1) will be implemented. In order to use the evaluate_cal method then Xtrain, Dtrain, and ytrain must all be specified.

Parameters

Xval ((n_val x k) matrix or vector of length n) – Features used in nuisance models for validation sample
Dval (vector of length n_val) – Treatment assignment of validation sample. Control status must be minimum value. It is recommended to have the control status be equal to 0, and all other treatments integers starting at 1.
yval (vector of length n_val) – Outcomes for the validation sample
Xtrain ((n_train x k) matrix or vector of length n, optional) – Features used in nuisance models for training sample
Dtrain (vector of length n_train, optional) – Treatment assignment of training sample. Control status must be minimum value. It is recommended to have the control status be equal to 0, and all other treatments integers starting at 1.
ytrain (vector of length n_train, optional) – Outcomes for the training sample

Returns

self, with added attributes for the validation treatments (Dval), treatment values (tmts),
number of treatments excluding control (n_treat), boolean flag for whether training data is provided
(fit_on_train), doubly robust outcome values for the validation set (dr_val), and the DR ATE value (ate_val).
If training data is provided, also adds attributes for the doubly robust outcomes for the training
set (dr_train) and the training treatments (Dtrain)

fit_nuisance_cv(X: numpy.array, D: numpy.array, y: numpy.array) → Tuple[numpy.array, numpy.array][source]

Generates nuisance function predictions using k-folds cross validation.

Parameters

X ((n x k) matrix) – Features used to predict treatment/outcome
D (array of length n) – Treatment assignments. Should have integer values with the lowest-value corresponding to the control treatment. It is recommended to have the control take value 0 and all other treatments be integers starting at 1
y (array of length n) – Outcomes

Returns

2 (n x n_treatment + 1) arrays corresponding to the predicted outcomes under treatment status and predicted
treatment probabilities, respectively.

fit_nuisance_train(Xtrain: numpy.array, Dtrain: numpy.array, ytrain: numpy.array, Xval: numpy.array) → Tuple[numpy.array, numpy.array][source]

Fits nuisance models in training sample and applies to generate predictions in validation sample.

Parameters

Xtrain ((n_train x k) matrix) – Training sample features used to predict both treatment status and outcomes
Dtrain (array of length n_train) – Training sample treatment assignments. Should have integer values with the lowest-value corresponding to the control treatment. It is recommended to have the control take value 0 and all other treatments be integers starting at 1
ytrain (array of length n_train) – Outcomes for training sample
Xval ((n_train x k) matrix) – Validation sample features used to predict both treatment status and outcomes

Returns

2 (n_val x n_treatment + 1) arrays corresponding to the predicted outcomes under treatment status and predicted
treatment probabilities, respectively. Both evaluated on validation set.

get_cate_preds(Xval: numpy.array, Xtrain: Optional[numpy.array] = None)[source]

Generates predictions from fitted CATE model. If Xtrain is None, then the predictions are generated using k-folds cross-validation on the validation set. If Xtrain is specified, then the CATE is assumed to have been fitted on the training sample (where the DR outcomes were generated using k-folds CV), and then applied to the validation sample.

Parameters

Xval ((n_val x n_treatment) matrix) – Validation set features to be used to predict (and potentially fit) DR outcomes in CATE model
Xtrain (n_train x n_treatment) matrix, optional – Training set features used to fit CATE model

Returns

None, but adds attribute cate_preds_val_ for predicted CATE values on the validation set and, if training
data is provided, attribute cate_preds_train_ for predicted CATE values on the training set

get_cv_splits(vars: numpy.array, T: numpy.array)[source]

Generates splits for cross-validation, given a set of variables and treatment vector.

Parameters

vars ((n x k) matrix or vector of length n) – Features used in nuisance models
T (vector of length n_val) – Treatment assignment vector. Control status must be minimum value. It is recommended to have the control status be equal to 0, and all other treatments integers starting at 1.

Return type

list of folds of the data, on which to run cross-validation

get_cv_splitter(random_state: int = 123)[source]

Generates splitter object for cross-validation. Checks if the cv object passed at initialization is a splitting mechanism or a number of folds and returns appropriately modified object for use in downstream cross-validation.

Parameters: random_state (integer) – seed for splitter, default is 123
Return type: None, but adds attribute ‘cv_splitter’ containing the constructed splitter object.