Problem Setup and API Design

Potential Outcomes Formulation

We begin by formulating the problem in the potential outcomes terminology. Subsequently, we will also provide a formulation in the structural equations notation for readers more familiar with that notation.

The methods developed in our library tackle the following general problem: let \(Y(\vec{t})\) denote the random variable that corresponds to the value of the outcome of interest if we were to treat a sample with treatment \(\vec{t} \in \T\). Given two vectors of treatments \(\vec{t}_0, \vec{t}_1 \in \T\), a vector of co-variates \(\vec{x}\) and a random vector of potential outcomes \(Y(\vec{t})\), we want to estimate the quantity:

\[\tau(\vec{t}_0, \vec{t}_1, \vec{x}) = \E[Y(\vec{t}_1) - Y(\vec{t}_0) | X=\vec{x}]\]

We will refer to the latter quantity as the heterogeneous treatment effect of going from treatment \(\vec{t}_0\) to treatment \(\vec{t}_1\) conditional on observables \(\vec{x}\). If treatments are continuous, then one might also be interested in a local effect around a treatment point. The latter translates to estimating a local gradient around a treatment vector conditional on observables:

\[\partial\tau(\vec{t}, \vec{x}) = \E\left[\nabla_{\vec{t}} Y(\vec{t}) | X=\vec{x}\right] \tag{marginal CATE}\]

We will refer to the latter as the heterogeneous marginal effect. 1

We assume we have data that are generated from some collection policy. In particular, we assume that we have data of the form: \(\{Y_i(T_i), T_i, X_i, W_i, Z_i\}\), where \(Y_i(T_i)\) is the observed outcome for the chosen treatment, \(T_i\) is the treatment, \(X_i\) are the co-variates used for heterogeneity, \(W_i\) are other observable co-variates that we believe are affecting the potential outcome \(Y_i(T_i)\) and potentially also the treatment \(T_i\); and \(Z_i\) are variables that affect the treatment \(T_i\) but do not directly affect the potential outcome. We will refer to variables \(W_i\) as controls and variables \(Z_i\) as instruments. The variables \(X_i\) can also be thought of as control variables, but they are special in the sense that they are a subset of the controls with respect to which we want to measure treatment effect heterogeneity. We will refer to them as features.

Finally, some times we might not only be interested in the effect but also in the actual counterfactual prediction, i.e. estimating the quantity:

\[\mu(\vec{t}, \vec{x}) = \E\left[Y(\vec{t}) | X=\vec{x}\right] \tag{counterfactual prediction}\]

Our package does not offer support for counterfactual prediction. However, for most of our estimators (the ones assuming a linear-in-treatment model), counterfactual prediction can be easily constructed by combining any baseline predictive model with our causal effect model, i.e. train any machine learning model \(b(\vec{t}, \vec{x})\) to solve the regression/classification problem \(\E[Y | T=\vec{t}, X=\vec{x}]\), and then set \(\mu(vec{t}, \vec{x}) = \tau(\vec{t}, T, \vec{x}) + b(T, \vec{x})\), where \(T\) is either the observed treatment for that sample under the observational policy or the treatment that the observational policy would have assigned to that sample. These auxiliary ML models can be trained with any machine learning package outside of EconML.

Structural Equation Formulation

We can equivalently describe the data and the quantities of interest via the means of structural equations. In particular, suppose that we observe i.i.d. samples \(\{Y_i, T_i, X_i, W_i, Z_i\}\) from some joint distribution and we assume the following structural equation model of the world:

\[ \begin{align}\begin{aligned}Y =~& g(T, X, W, \epsilon)\\T =~& f(X, W, Z, \eta)\end{aligned}\end{align} \]

where \(\epsilon\) and \(\eta\) are noise random variables that are independent of \(X, Z, T, W\) but could be potentially correlated with each other. The target quantity that we want to estimate can then be expressed as:

\begin{align} \tau(\vec{t}_0, \vec{t}_1, \vec{x}) =& \E[g(\vec{t}_1, X, W, \epsilon) - g(\vec{t}_0, X, W, \epsilon) | X=\vec{x}] \tag{CATE} \\ \partial\tau(\vec{t}, \vec{x}) =& \E[\nabla_{\vec{t}} g(\vec{t}, X, W, \epsilon) | X=\vec{x}] \tag{marginal CATE} \\ \end{align}

where in these expectations, the random variables \(W, \epsilon\) are taken from the same distribution as the one that generated the data. In other words, there is a one-to-one correspondence between the potential outcomes formulation and the structural equations formulation in that the random variable \(Y(t)\) is equal to the random variable \(g(t, X, W, \epsilon)\), where \(X, W, \epsilon\) is drawn from the distribution that generated each sample in the data set.

API of Conditional Average Treatment Effect Package

The base class of all the methods in our API has the following signature:

Base CATE Estimator Class
class BaseCateEstimator

    def fit(self, Y, T, X=None, W=None, Z=None, inference=None):
        ''' Estimates the counterfactual model from data, i.e. estimates functions
        τ(·, ·, ·)}, ∂τ(·, ·) and μ(·, ·)

        Parameters:
        Y: (n × d_y) matrix of outcomes for each sample
        T: (n × d_t) matrix of treatments for each sample
        X: (n × d_x) matrix of features for each sample, optional
        W: (n × d_w) matrix of controls for each sample, optional
        Z: (n × d_z) matrix of instruments for each sample, optional
        inference: str or `Inference` instance, optional
            Method for performing inference.  All estimators support 'bootstrap'
            (or an instance of `BootstrapInference`), some support other methods as well.
        '''

    def effect(self, X=None, *, T0, T1):
        ''' Calculates the heterogeneous treatment effect τ(·, ·, ·) between two treatment
        points conditional on a vector of features on a set of m test samples {T0_i, T1_i, X_i}

        Parameters:
        T0: (m × d_t) matrix of base treatments for each sample
        T1: (m × d_t) matrix of target treatments for each sample
        X:  (m × d_x) matrix of features for each sample, optional

        Returns:
        tau: (m × d_y) matrix of heterogeneous treatment effects on each outcome
            for each sample
        '''

    def marginal_effect(self, T, X=None):
        ''' Calculates the heterogeneous marginal effect ∂τ(·, ·) around a base treatment
        point conditional on a vector of features on a set of m test samples {T_i, X_i}

        Parameters:
        T: (m × d_t) matrix of base treatments for each sample
        X:  (m × d_x) matrix of features for each sample, optional

        Returns:
        grad_tau: (m × d_y × d_t) matrix of heterogeneous marginal effects on each outcome
            for each sample
        '''

    def effect_interval(self, X=None, *, T0=0, T1=1, alpha=0.05):
        ''' Confidence intervals for the quantities τ(·, ·, ·) produced by the model.
        Available only when inference is not None, when calling the fit method.

        Parameters:
        X:  (m, d_x) matrix of features for each sample, optional
        T0: (m, d_t) matrix of base treatments for each sample, optional
        T1: (m, d_t) matrix of target treatments for each sample, optional
        alpha: float in [0, 1] of the (1-alpha) level of confidence, optional

        Returns:
        lower, upper : tuple of the lower and the upper bounds of the confidence interval
            for each quantity.
        '''

    def marginal_effect_interval(self, T, X=None, *, alpha=0.05):
        ''' Confidence intervals for the quantities effect ∂τ(·, ·) produced by the model.
        Available only when inference is not None, when calling the fit method.

        Parameters:
        T: (m, d_t) matrix of base treatments for each sample
        X: (m, d_x) matrix of features for each sample, optional
        alpha: float in [0, 1] of the (1-alpha) level of confidence, optional

        Returns:
        lower, upper : tuple of the lower and the upper bounds of the confidence interval
            for each quantity.
        '''

Linear in Treatment CATE Estimators

Constant Marginal Effects

In many settings, we might want to make further structural assumptions on the form of the data generating process. One particular prevalent assumption is that the outcome \(y\) is linear in the treatment vector and therefore that the marginal effect is constant across treatments, i.e.:

\[ \begin{align}\begin{aligned}Y =~& H(X, W) \cdot T + g(X, W, \epsilon)\\T =~& f(X, W, Z, \eta)\end{aligned}\end{align} \]

where \(\epsilon, \eta\) are exogenous noise terms. Under such a linear response assumption we observe that the CATE and marginal CATE take a special form of:

\[ \begin{align}\begin{aligned}\tau(\vec{t}_0, \vec{t}_1, \vec{x}) =~& \E[H(X, W) | X=\vec{x}] \cdot (\vec{t}_1 - \vec{t}_0)\\\partial \tau(\vec{t}, \vec{x}) =~& \E[H(X, W) | X=\vec{x}]\end{aligned}\end{align} \]

Hence, the marginal CATE is independent of \(\vec{t}\). In these settings, we will denote with \(\theta(\vec{x})\) the constant marginal CATE, i.e.

\[\theta(\vec{x}) = \E[H(X, W) | X=\vec{x}] \tag{constant marginal CATE}\]

Constant Marginal Effects and Marginal Effects Given Treatment Featurization

Additionally, we may be interested in cases where the outcome depends linearly on a transformation of the treatment vector (via some featurizer \(\phi\)). Some estimators provide support for passing such a featurizer \(\phi\) directly to the estimator, in which case the outcome would be modeled as follows:

\[Y = H(X, W) \cdot \phi(T) + g(X, W, \epsilon)\]

We can then get constant marginal effects in the featurized treatment space:

\[ \begin{align}\begin{aligned}\tau(\phi(\vec{t_0}), \phi(\vec{t_1}), \vec{x}) =~& \E[H(X, W) | X=\vec{x}] \cdot (\phi(\vec{t_1}) - \phi(\vec{t_0}))\\\partial \tau(\phi(\vec{t}), \vec{x}) =~& \E[H(X, W) | X=\vec{x}]\\\theta(\vec{x}) =~& \E[H(X, W) | X=\vec{x}]\end{aligned}\end{align} \]

Finally, we can recover the marginal effect with respect to the original treatment space by multiplying the constant marginal effect (which is in featurized treatment space) with the jacobian of the treatment featurizer at \(\vec{t}\).

\[\partial \tau(\vec{t}, \vec{x}) = \theta(\vec{x}) \nabla \phi(\vec{t}) \tag{marginal CATE}\]

where \(\nabla \phi(\vec{t})\) is the \(d_{ft} \times d_{t}\) jacobian matrix, and \(d_{ft}\) and \(d_{t}\) are the dimensions of the featurized treatment and the original treatment, respectively.

API for Linear in Treatment CATE Estimators

Given the prevalence of linear treatment effect assumptions, we will create a generic LinearCateEstimator, which will support a method that returns the constant marginal CATE and constant marginal CATE interval at any target feature vector \(\vec{x}\), as well as calculating marginal effects in the original treatment space when a treatment featurizer is provided.

Linear CATE Estimator Class
class LinearCateEstimator(BaseCateEstimator):
    self.treatment_featurizer = None

    def const_marginal_effect(self, X=None):
        ''' Calculates the constant marginal CATE θ(·) conditional on a vector of
        features on a set of m test samples {X_i}

        Parameters:
        X: (m × d_x) matrix of features for each sample, optional

        Returns:
        theta: (m × d_y × d_f_t) matrix of constant marginal CATE of each treatment on each outcome
        for each sample, where d_f_t is the dimension of the featurized treatment.
        If treatment_featurizer is None, d_f_t = d_t
        '''

    def const_marginal_effect_interval(self, X=None, *, alpha=0.05):
        ''' Confidence intervals for the quantities θ(·) produced by the model.
        Available only when inference is not None, when calling the fit method.

        Parameters:
        X: (m, d_x) matrix of features for each sample, optional
        alpha: float in [0, 1] of the (1-alpha) level of confidence, optional

        Returns:
        lower, upper : tuple of the lower and the upper bounds of the confidence interval
            for each quantity.
        '''

    def effect(self,  X=None, *, T0, T1,):
        if self.treatment_featurizer:
            return const_marginal_effect(X) * (T1 - T0)
        else:
            dt = self.treatment_featurizer.transform(T1) - self.treatment_featurizer.transform(T0)
            return const_marginal_effect(X) * dt

    def marginal_effect(self, T, X=None)
        if self.treatment_featurizer is None:
            return const_marginal_effect(X)
        else:
            # for every observation X_i, T_i,
            # calculate jacobian at T_i and multiply with const_marginal_effect at X_i

    def marginal_effect_interval(self, T, X=None, *, alpha=0.05):
        if self.treatment_featurizer is None:
            return const_marginal_effect_interval(X, alpha=alpha)
        else:
            # perform separate treatment featurization inference logic

Example Use of API

Let us walk through a simple example of what one can achieve via the latter API even irrespective of the actual estimation method that is being used.

Let us consider a hypothetical data generating process (DGP) governed by the following equations:

\[\begin{split}\begin{align} Y(t) =~& \gamma t^2 + \delta X t + \langle \zeta, W \rangle + \epsilon\\ T =~& \langle \alpha, W \rangle + \langle \beta, Z \rangle + \eta\\ X, Z, \epsilon, \eta \sim~& N(0, 1), ~~ W \sim N(0, I_{d}) \end{align}\end{split}\]

Suppose that we have \(n\) samples from this DGP. For instance, we could create these samples with the following code:

Example Data Generated from Structural Equations
import numpy as np

# Instance parameters
n_controls = 100
n_instruments = 1
n_features = 1
n_treatments = 1
alpha = np.random.normal(size=(n_controls, 1))
beta = np.random.normal(size=(n_instruments, 1))
gamma = np.random.normal(size=(n_treatments, 1))
delta = np.random.normal(size=(n_treatments, 1))
zeta = np.random.normal(size=(n_controls, 1))

n_samples = 1000
W = np.random.normal(size=(n_samples, n_controls))
Z = np.random.normal(size=(n_samples, n_instruments))
X = np.random.normal(size=(n_samples, n_features))
eta = np.random.normal(size=(n_samples, n_treatments))
epsilon = np.random.normal(size=(n_samples, 1))
T = np.dot(W, alpha) + np.dot(Z, beta) + eta
y = np.dot(T**2, gamma) + np.dot(np.multiply(T, X), delta) + np.dot(W, zeta) + epsilon

We can then fit a counterfactual model to the data. In order to learn confidence interval of our CATE, we could pass an additional inference argument to fit, bootstrap interval is supported by all estimators. We can run the following:

Example fit of causal model
# Fit counterfactual model
cfest = BaseCateEstimator()
cfest.fit(y, T, X=X, W=W, Z=Z, inference='bootstrap')

Suppose now that we wanted to estimate the conditional average treatment effect for every point \(X_i\) in the training data and between treatment 1 and treatment 0. This should be an estimate of the quantities: \(\gamma + \delta X_i\). We can also get the confidence interval of the CATE. We can run the following:

Estimating cate for all training features from treatment 0 to 1
X_test = X
# Estimate heterogeneous treatment effects from going from treatment 0 to treatment 1
T0_test = np.zeros((X_test.shape[0], n_treatments))
T1_test = np.ones((X_test.shape[0], n_treatments))
hetero_te = cfest.effect(X_test, T0=T0_test, T1=T1_test)
hetero_te_interval =  cfest.effect_interval(X_test, T0=T0_test, T1=T1_test, alpha=0.1)

Suppose now that we wanted to estimate the conditional marginal effect for every point \(X_i\) at treatment 0. This should be an estimate of the quantities: \(\delta X_i\). We can also get the confidence interval of the CATE. We can run the following:

Estimating marginal cate for all training features at treatment 0
# Estimate heterogeneous marginal effects around treatment 0
T_test = np.zeros((X_test.shape[0], n_treatments))
hetero_marginal_te = cfest.marginal_effect(T_test, X_test)
hetero_marginal_te_interval = cfest.marginal_effect_interval(T_test, X_test, alpha=0.1)

Suppose we wanted to create projections of these estimated quantities on sub-populations, i.e. the average treatment effect or the average treatment effect on the population where \(X_i\geq 1/2\). We could simply achieve this as follows:

Projecting on subpopulations
# Estimate average treatment effects over a population of z's
T0_test = np.zeros((X_test.shape[0], n_treatments))
T1_test = np.ones((X_test.shape[0], n_treatments))

# average treatment effect
ate = np.mean(cfest.effect(X_test, T0=T0_test, T1=T1_test)) # returns estimate of γ + δ 𝔼[x]

# average treatment effect of population with x>1/2
# returns estimate of γ + δ 𝔼[x | x>1/2]
cate = np.mean(cfest.effect(X_test[X_test>1/2], T0=T0_test[X_test>1/2], T1=T1_test[X_test>1/2]))

More importantly, suppose we wanted to understand what would be the overall expected change in response if we were to follow some treatment policy (e.g. treat everyone with \(X_i\geq 0\)). This can also be easily done as follows:

Estimating expected lift of some treatment policy
# Estimate expected lift of treatment policy: π(z) = 𝟙{x > 0} over existing policy
Pi0_test = T
Pi1_test = (X_test > 0) * 1.
# returns estimate of γ/2 + δ/√(2π)
policy_effect = np.mean(cfest.effect(X_test, T0=Pi0_test, T1=Pi1_test))

# Estimate expected lift of treatment policy: π(x) = 𝟙{x > 0} over baseline of no treatment
Pi0_test = np.zeros((X_test.shape[0], n_treatments))
Pi1_test = (X_test > 0) * 1.
# returns estimate of γ/2 + δ/√(2π)
policy_effect = np.mean(cfest.effect(X_test, T0=Pi0_test, T1=Pi1_test))

Footnotes

1

One can always approximate the latter with the former and vice versa, i.e. \(\partial_i \tau(\vec{t},\vec{x}) \approx \tau(\vec{t}, \vec{t} + \delta \vec{e}_i, \vec{x})/\delta\) for some small enough \(\delta\), and similarly, \(\tau(\vec{t_0}, \vec{t_1}, \vec{x}) = \int_{0}^{1} \partial\tau(\vec{t}_0 + q (\vec{t}_1 - \vec{t}_0), \vec{x}) (\vec{t}_1 - \vec{t_0})dq\). However, in many settings more direct methods that make use of the structure might simplify these generic transformations.