Estimating Optimal Treatment Regimes using Pairwise Estimating Optimal Treatment Regimes using Pairwise Comparison Comparison

A treatment regimen is a decision rule that can be used to guide clinicians in connecting patients’ information to the feasible treatment strategy. Estimating optimal treatment regimens has gained increasing attention in the statistics ﬁeld. New models based on machine learning approaches, such as outcome weighted learning (OWL) and entropy learning (EL), were suggested to overcome the computational challenge of the doubly robust estimation due to nonconvexity and non-diﬀerentiability. However, substituting the 0-1 loss by a smooth loss leads to the lack of estimation consistency within the class under consideration. To solve this problem, concordance-assisted learning (CAL) was then proposed to estimate optimal treatment regime through pairwise comparison. However, the concordance function is discontinuous, and the optimization is computationally expensive. In this study, we propose a new method which estimates parameters using a weighted logistic regression based on the pairwise comparison between any two individuals. We obtain the standard errors of the estimated coeﬃcients using bootstrap approach. Simulation results under diﬀerent scenarios and application to the ACTG175 data both demonstrate that the proposed pairwise comparison learning (PCL) method can estimate optimal treatment regimens successfully and outperforms existing approaches.


ABSTRACT
A treatment regimen is a decision rule that can be used to guide clinicians in connecting patients' information to the feasible treatment strategy.Estimating optimal treatment regimens has gained increasing attention in the statistics field.
New models based on machine learning approaches, such as outcome weighted learning (OWL) and entropy learning (EL), were suggested to overcome the computational challenge of the doubly robust estimation due to nonconvexity and non-differentiability.However, substituting the 0-1 loss by a smooth loss leads to the lack of estimation consistency within the class under consideration.To solve this problem, concordance-assisted learning (CAL) was then proposed to estimate optimal treatment regime through pairwise comparison.However, the concordance function is discontinuous, and the optimization is computationally expensive.In this study, we propose a new method which estimates parameters using a weighted logistic regression based on the pairwise comparison between any two individuals.We obtain the standard errors of the estimated coefficients using bootstrap approach.Simulation results under different scenarios and application to the ACTG175 data both demonstrate that the proposed pairwise comparison learning (PCL) method can estimate optimal treatment regimens successfully and outperforms existing approaches.

CHAPTER 1 Introduction
Treatments or interventions for a given disease or disorder often entail a sequence of decisions depending on patients' characteristics and responses.A treatment regimen involves one single decision time point and is the simplest case of a dynamic treatment regimen.Estimation of a treatment regimen is the foundation for extending to dynamic treatment regimen.A treatment regimen is a decision rule that can be used to guide clinicians in connecting patients' information to the feasible treatment strategy.Estimating optimal treatment regimes using data from clinical trials or observational studies has gained increasing attention in the statistics and public health field.
The two most widely used learning approaches at present include the regression-based learning (Watkins and Dayan, 1992;Murphy and Edu, 2005;Qian and Murphy, 2011;Moodie et al., 2014;Laber et al., 2014), and classificationbased learning (Zhang et al., 2012;Zhao et al., 2012;Zhang et al., 2015;Jiang et al., 2020).Watkins and Dayan (1992) proposed Q-learning, which modeled the mean outcome conditional on covariates and treatment.A-learning models the regret function, which provides the benefit lost if we choose a decision rather than the optimal regimen (Murphy, 2003).Murphy and Edu (2005) developed a theory for the generalization error of Q-learning.Regerssion-based methods, such as Q-learning, adopt a pre-specified regression model to estimate conditional mean of the outcome given covariates and treatment.So if the model is misspecified, the estimated regime may be suboptimal.To alleviate the impact of model misspecification, Qian and Murphy (2011) proposed to use penalized Q-learning with a large number of pre-specified functions of covariates.Classification-based learning, on the other hand, constructs models of the contrast function and estimates the contrast function directly by maximizing the estimating function for the mean outcome, which makes it robust to the misspecification of the mean outcome model (Murphy, 2003).Zhang et al. (2012) introduced the inverse probability weighted estimator (IPWE) and the augmented inverse probability weighted estimator (AIPWE) for the mean outcome value function.Following that, the optimal treatment regimen may be derived by maximizing the value function over all given regimens.When the propensity score or outcome model is misspecified, the double robustness of AIPWE can safeguard the estimated regimens (Zhang et al., 2012).Lunceford and Davidian (2004) comparative simulation experiments also established AIPWE's double robustness, that is, the estimator stays stable while at least one of the propensity score or regression models is accurately specified (Bang and Robins, 2005).Especially, when both the propensity score and the regression models are correctly specified, the AIPWE will be semiparametric efficient.Other classification-based method include concordance-assisted learning (CAL).This model approached the risk of misspecification of the propensity score by considering a concordance function, which compared the outcome gains of one treatment over another treatment between any two individuals (Fan et al., 2017).
The prescriptive index, which is a set of covariate weights in a linear decision rule, is estimated by maximizing this concordance function.Then, the threshold estimator, i.e., the intercept of a linear decision rule, can be obtained based on estimated prescriptive index and IPWE/AIPWE.Thus, the optimal linear decision rule can be specified with estimated coefficients and intercept.In this case, estimates derived by CAL are more robust and resistant to a constant shift in the outcome value.
However, these classification-based approaches have some limitations.First, the direct optimization of the IPWE/AIPWE and concordance function is nonconvex and nondifferentiable, and thus the direct optimization can be computationally challenging, especially when the dimension of predictors is high.The second limitation is the lack of easy-to-implement approaches for statistical inferences of these estimators.Even the widely used bootstrap method was demonstrated to be inapplicable to the estimation of IPWE/AIPWE due to the nonsmoothness of the computation procedure (Abrevaya and Huang, 2005).Chakraborty et al. (2013) suggested an adaptive m-out-of-n bootstrap method should be used to obtain reliable inferences of the treatment regimens.
Therefore, several studies attempt to replace the 0-1 loss function in the classification-based methods with convex or continuously differentiable surrogate loss functions so that the optimization of parameters becomes easier (Zhao et al., 2012(Zhao et al., , 2015;;Jiang et al., 2020).Zhao et al. (2012Zhao et al. ( , 2015) ) presented the estimation of optimal treatment in a weighted classification framework and introduced an outcome weighted learning (OWL), where the expected outcome is optimized through support vector machine (SVM) via hinge loss.The hinge loss is a specific type of cost function in which the cost calculation involves a margin or distance from the classification boundary.It is convex and easy to optimize using conventional methods.To give a more straightforward inference procedure for the parameters, Jiang et al. (2020) proposed an entropy learning method to substitute the hinge loss with weighted entropy loss.Entropy loss is also a special case of smooth surrogate loss, enabling us to optimize estimators through logistic regression.However, entropy learning (EL) is still prone to misspecification of the propensity score.Hence, more recently, Liang et al. (2017) adapted the idea of OWL and proposed the sparse concordance-assisted learning algorithm (SCAL) by substituting the 0-1 loss with the hinge loss in CAL.Under various simulation scenarios, SCAL achieved higher accuracy in estimating optimal treatment regimens than OWL and CAL.
In this study, we combine the attractive properties of EL and CAL, and propose a new method for estimating the optimal treatment regime.An optimal treatment decision can be found through pairwise comparison (Fan et al., 2017;Liang et al., 2017).If a patient receives a greater benefit with one treatment than with another, and this difference in benefit is greater than for the other patients, then this patient is more likely to be assigned to the first treatment option by the regime, which is a necessary component of any effective treatment regimen.The benefit/outcomes of two treatments for all patients can be estimated using the AIPWE of the value function.Therefore, we propose to conduct the optimization of the estimated parameters using a weighted logistic regression, with the comparison between any pair of two patients' outcome differences as the response and the difference between this pair of patients' covariates as the predictors.Each patient will be compared in pairs with each of the remaining patients in a non-ordered manner.
Compared to the methods using surrogate loss function, our proposed method leverages the double robustness of AIPWE and overcomes the lack of statistical consistency introduced by substituting the 0-1 loss by a smooth loss (Zhang and Laber, 2019).Comparing person-to-person variations allows our proposed method to achieve statistical consistency for the optimal regime within the class under consideration.Specifically, our proposed model is invariant to a constant shift in the outcome value while EL is not.In addition, the weighted logistic regression framework enables us to address the limitations of nonconvexity and nondifferentiability of 0-1 loss in CAL, allowing for easier parameter optimization.At the same time, because the optimization of treatment regimens is converted into a logistic regression, the bootstrap method, which could not be used directly in CAL, can be used directly to estimate the standard errors to quantify the uncertainty in the new model.This article is organized as follows.The next section introduced the framework and assumptions for estimating treatment regimes and reviewed some current approaches, including doubly robust estimator, outcome-weighted learning, entropy learning, and concordance-assisted learning.The proposed pairwise comparison learning method was also presented.In Section 3, we demonstrated the performance of the proposed method by comparing it with several existing popular methods under linear and non-linear simulation scenarios.Section 4 presented an application to an AIDS clinical trial data.We discussed our results of the proposed method in Section 5. and Y ∈ R is the outcome.We assume that larger outcome is more favorable.
For example, in the real data analysis, the response variable is the CD4 cell count (cells/mm 3 ) at 20 ± 5 weeks after baseline.A higher number indicates a stronger immune system and better health in HIV-infected patients.The observed data ) to be the value of regime d.An optimal treatment regime, for all d ∈ D, which means an optimal treatment regime can achieve a better health outcome for the patient than any other regime.
SUTVA means that there is no interference or hidden variation of treatment (Rubin, 1980).Assumption (A2) implies that all the factors that may influence the treatment assignment or potential outcomes are measured in the observed data, which is satisfied in a randomized trial, but is untestable in an observational study (Robins et al., 2000).The assumption of positivity (A3) implies P (A = a) ≥ ξ by iterated expectation, which means that any individual has a positive probability to be assigned to each treatment.

Doubly Robust Estimator for Value Function
The doubly robust estimator was first proposed by Bang and Robins (2005).
Under the assumptions mentioned previously, Zhang et al. proposed to maximize the doubly robust estimator for the value of treatment regime V (d) to obtain the optimal regimen, which can be given that (Zhang et al., 2012) where is the conditional mean outcome, and is the propensity score.When treatment assignment is randomized and independent of covariates, the propensity score has been defined prior to the trial.However, Rosenbaum (1987) suggested that the estimated scores based on the collected data are preferable to the true ones.Because, in practice, a variety of factors might cause propensity scores to be inaccurate in the final sample, resulting in biased sample estimates (Rosenbaum, 1987).In addition, according to Lunceford and Davidian (2004), estimating the propensity scores will be more efficient even if the true one is known.Hence, in randomized clinical trials, we estimate the propensity score, π(a, x), by n −1 n i=1 I(A i = a).Otherwise, we may posit a logistic regression model π(a, x; γ) = exp( XT γ)/{1 + exp( XT γ)} for estimation, where X is a pre-specified feature vector and γ is an unknown parameter vector that can be estimated via the maximum likelihood estimator (MLE) γ based on (A i , X i ).Let π(a, x) denote the maximum likelihood estimator of π(a, x), then π(a, x) = exp( XT γ)/{1 + exp( XT γ)}.For µ(a, x), we posit a linear regression model, µ(a, x; ζ) = z T ζ, where z is a pre-specified feature vector constructed from x and ζ is an unknown parameter vector.We use ζ to denote the MLE of ζ, and then the estimator of µ(a, x) is μ(a, x) = z T ζ.Then, the doubly robust estimator of V (d) can be written as where For any given treatment regime d, V (d) satisfies the property of double robustness: it is consistent for V (d) even when either π(a, x) or µ(a, x) is misspecified, but not both (Zhang et al., 2012(Zhang et al., , 2015)).Consequently, an estimator to the optimal regime, dopt , is a rule that maximizes V (d), i.e., dopt ∈ arg max d∈D V (d).

Classification-based Learning with Surrogate Loss
Following doubly robust estimation approach, we can first estimate µ(a, x) and π(a, x) using the observed data, and then estimate dopt by maximizing equation (2) (Zhang et al., 2012).However, because of the nonconvexity and nondifferentiablity of 0-1 loss, the classification-based approaches, like discussed before, suffer from slow convergence and computational challenges, and may even yield a risk of overfitting (Zhao et al., 2012).Therefore, a new estimation approach, outcome weighted learning (OWL), was introduced Zhao et al. ( 2012).The original study of OWL did not consider the doubly robust estimator, so V (d) was written as It is suggested finding d opt that maximize (3) is equivalent to the problem of minimizing (Qian and Murphy, 2011) The term (4), which is a weighted classification error, can be estimated using the observed data, as follows Formula ( 5) is a weighted summation of 0-1 loss, which is discontinuous and nonconvex.Hence, it's hard to find direct minimization through statistical inferences.To address this issue, Zhao et al. (2012Zhao et al. ( , 2015) ) proposed to replace the 0-1 loss with the hinge loss used in the support vector machine as shown below where f (X) is the decision function so that d ) is the hinge loss, and λ n is the tuning parameter controlling the amount of penalty on f .And f is some norm of f , for example, can be the ) can be computed as a weighted classification using support vector machine.
However, the hinge loss is still nondifferentiable, making it quite difficult to obtain statistical inferences.Hence, a smoother surrogate loss function, h(a, y), was introduced and to prevent nonconvexity, the choice of h(a, y) needs to satisfy: (I) for a ∈ A, h(a, y) is a twice differentiable and convex function of y; (II) Jiang et al., 2020).If we consider a loss function is h(a, y) = −ay + g(y), a special case can be: Then the loss function is h(a, y) = −(a + 1)y + 2 log(1 + exp(y)).( 7) Equation ( 7) is exactly the entropy loss for a logistic regression.In this case, (5) becomes  2020), Y i is assumed to be positive, we know Kia ≥ 0 for all i and a.When A i = 1, Ki,−1 = 0, and when which is exactly the estimation of equation ( 8).Therefore, the weighted logistic regression framework can be use to estimate entropy learning model.
Ideally, if we add a constant to all Y i 's, the optimal rule should remain the same.However, if we add a constant c to Y i in entropy learning model, the log likelihood function will become Correspondingly, the maximum likelihood estimate for β will change in general.
Hence, the entropy learning model does not guarantee the estimation consistency of the optimal regime within the considered class.Therefore, we consider the pairwise comparison borrowed from the concordance-assisted learning to make our proposed model invariant to a constant shift in Y i 's.

Concordance-assisted Learning Approach
In previous approaches, maximizing the value function is the gain of receiving treat-ment 1 against treatment −1 of a subject.We aim to find the optimal treatment regime that can maximize this reward value.The optimal regime therefore assigns a subject to treatment 1 when his or her K i,1 − K i,−1 > 0 and treatment −1 when Fan et al. (2017) proposed a novel concordance-assisted learning (CAL) algorithm based on the comparison between any two subjects i and j.If , subject i should be more likely to get treatment 1 than treatment −1 compared with subject j.They demonstrated that the optimal treatment regime under the CAL is identical to the optimal treatment regime obtained by maximizing the value function directly.The CAL algorithm includes two steps: 1.They defined a concordance function as follows A robust rank regression method was proposed to estimate Formula (9).
Then, the optimization of the estimated concordance function was achieved using optim function in R to estimate the prescriptive index β.Finally, we have β = arg max β Ĉ(β) 2. Within the class of treatment regimes we obtained from step 1, we estimate the intercept β 0 by maximizing the IPWE of the value function Zhang et al. (2012), which is given by then we have The optimization of V (β, β 0 ) in Formula (10) uses β obtained in step 1 and can be conducted using a grid search to get the optimal threshold β0 .However, the optimization of C(β) is still based on the nonconvex and nondifferentiable 0-1 loss, which may be computationally costly.Hence, we propose a new method to estimate the maximizers more easily based on a weighted logistic regression framework.

Proposed Method
The proposed method consists of two steps.First, we solve the prescriptive index using a weighted logistic regression.Second, we can find the intercept by maximizing the AIPWE defined in Formula (2).
To solve the prescriptive index β, we define a pairwise comparison function given by If we consider an entropy loss in equation ( 7) to replace the 0-1 loss, equation (11) becomes where Equation ( 12) corresponds the opposite of the entropy loss for a weighted logistic regression.So we view the maximization of F (d) as a weighted logistic regression, with I( Ki,1 − Ki,−1 > Kj,1 − Kj,−1 ) i =j as the response, (X i −X j ) i =j as the predictor and |( Ki,1 − Ki,−1 ) − ( Kj,1 − Kj,−1 )| i =j as the sample weights.Then, the maximum likelihood estimate of β can be solved as β by minimizing the loss function.
The objective of the weighted logistic regression is to maximize the likelihood of data which is assumed to be Bernoulli-distributed as follows , and p ij is the probability given by the logistic function So the score function of the log likelihood for this weighted logistic regression is which can be represented as Since Ki,a ≥ 0 for all i, j and a, and either Ki,−1 = Kj,−1 = 0 or Ki,1 = Kj,1 = 0, the value of r ij = I( Ki,1 − Ki,−1 > Kj,1 − Kj,−1 ) i =j is equal to 0.5(a + 1).So J(β) becomes Thus, the maximization of F (β) is equivalent to minimizing J(β).We can derive the optimal coefficients β = βMLE = arg min β J(β).
Then, to find the intercept β 0 , we utilize the similar weighted logistic regression strategy in entropy learning (EL).But instead of IPWE, we used the doubly robust estimator of value function for Kia given by where So we can use a weighted logistic regression with I( Ki,1 − Ki,−1 > 0) as the response, X i as the predictor, | Ki,1 − Ki,−1 | as the weights, and β as the offset.Let β0 denote the estimate of β 0 .The optimal treatment regime under the proposed approach is dopt (X) = 2I( βT X + β0 > 0) − 1.Then, we draw bootstrap samples with replacement from the observed data and then obtain μ, π and K based on the bootstrap samples.Next, we estimate β and β0 by minimizing/maximizing equations ( 11) and ( 13).We repeat the bootstrap process for B times so the The estimates and standard error of β0 can be also obtained following this process.
Given that the bootstrap samples are drawn at random with replacement from the whole sample, the correlation that is introduced via the pairwise comparisons can also be captured by the bootstrap process.Thus, bootstrap approach works for our proposed methods Arcones and Gine (1992).

Simulation Settings
To evaluate and compare the performance of the existing models and our proposed model, we consider 4 single-stage scenarios used in previous study of entropy learning (Jiang et al., 2020).Because these 4 scenarios were used to compare the performance of entropy learning, outcome-weighted learning and Q-learning in this previous study, which were also included as models for comparison in the current study.We also considered 2 additional scenarios with increased intercepts to demonstrate the influence of a constant shift in outcome values.We simulated treatment A randomly from {−1, 1}, which is independent of the covariates may include the interaction between the treatment and the covariates.The distribution of X and the expression of Q(X, A) will be specified later.The error term was generated following = |R|/10, where R ∼ N (0, 1).We use the absolute value of R to ensure that is always positive, because EL and OWL requires the outcome to be positive.We considered the following 4 linear scenarios and 2 nonlinear scenarios.
SCENARIO 5. We simulated x 1 , x 2 , x 3 independently and uniformly from determined by the sign of the interaction effect.SCENARIO 6.We generate Y using the same function defined in Scenario 3. In the contrast, we simulated x 1 , x 2 , x 3 independently and uniformly from {−1, 0, 1}.Namely, x 1 , x 2 and x 3 are discrete.

Estimation and classification performance
For each scenario, we carried out 500 simulation runs with two different sample sizes, n ∈ {100, 400} subjects.We first compared the regimes obtained through the Q-learning (QL), outcome-weighted learning (OWL), entropy learning (EL), concordance assisted learning (CAL), and the proposed pairwise comparison method.
For each simulated dataset, we also estimated the standard errors for the estimated have the largest standard errors for β, but our proposed method gives much lower standard errors for the intercept and coefficients.Especially for linear scenarios, the asymptotic relative efficiency (ARE) for our proposed method and QL with the other 3 models are extremely high, around 50,000.
Then, we computed the estimated value of the estimated regime and corresponding standard errors obtained from QL, OWL, EL, CAL and our proposed PCL method by averaging over 100 bootstrap replications.The true value of the estimated regime, V ( dopt ), were approximated using a validation set with 100,000 patients following the estimated regime.First of all, the standard errors decrease for all the models under all 6 scenarios as the sample size increases.From Table 7 and 8, we note the estimated and true values of the estimated regime in our proposed method are close with those in QL under linear scenarios 1-4.However, under nonlinear scenarios 5 and 6, the true value of the estimated regime in our proposed method become similar to those in CAL, with much smaller standard errors than those in QL, OWL, and EL.Especially, when the covariates are discrete in Models 6, CAL and our proposed method performed significantly better than QL, OWL, and EL.Same trends are also observed from the box plots in Figure 1 and 2. When the true treatment regimen is linear, as in Scenarios 1-4, QL and our proposed similarly have much smaller variations in the estimated and true value of estimated regime than the other 3 methods.And the estimated values in OWL, EL, and CAL tend to have an upward bias in Scenario 1-4.However, in nonlinear scenarios 5-6, the variation in the true value of the estimated regime increases for QL.Our proposed method has consistently smaller variation across either linear or nonlinear scenarios, especially when the sample size is large enough.In addition, when comparing Scenario 2 with Scenario 3 and 4, the failure of OWL, EL, and CAL to maintain invariance for constant shifts in the outcome is also reflected in the estimated and true values of the estimated regime.When a constant is added to Y , the standard errors in our proposed method and QL change very slightly while those in OWL, EL, and CAL increase sharply.The larger the constant is, the larger the standard error becomes.
We also computed the misclassification rate given different sample sizes and treatment regimens.Figure 3 gives the box plots of the misclassification rates in the validation set of 100,000 patients for all five models.In linear scenarios 1-4, the misclassification rates for QL and our proposed method are much smaller and even decrease toward zero as the sample size increases.OWL, EL, and CAL not only have larger values of misclassification rates, but also have larger variations.In nonlinear scenarios 5 and 6, the misclassification rate for QL increases and become larger than those for other models.Especially when the covariates are discrete in scenario 6, OWL performs very poorly and fails to get accurate classification, which was also observed in Jiang et al. (2020)'s study .However, our proposed method still has the smallest misclassification rates with least variation.9.
From Table 9, we find that CAL may suffer from low coverage rate, especially when sample size is small and true regimen is linear.In addition, QL has relatively high coverage when the model is specified correctly in Scenario 1, but when the model is specified incorrectly in Scenario 5 and 6, the coverage rates are significantly reduced.However, under either linear or nonlinear scenarios, the coverage rates of our proposed method are close to the nominal level (95%), and improve as the sample size increases, indicating a stable performance in predicting the value function for the treatment regimen.We also noticed that under Scenario 5 when

Results
We used Q-learning, outcome-weighted learning, entropy learning, concordance-assisted learning, and our proposed method to estimate the optimal treatment regime.The estimation of the intercepts and the coefficients with their standard errors obtained through bootstrap samples are reported in (a) β0 is the intercept and β1−11 is the coefficient for Age, Karnofsky score, weight, CD4 count at baseline, CD8 count at baseline, sex, race, symptomatic indicator, hemophilia, homosexual activity, and history of intravenous drug use.(b) * indicates the predictor is significant at a level of 0.05 Based on the results in Table 10 and 11, we observe that age, race and homosexual activity are identified as significant predictors in QL, CAL and our proposed method.Age is the most important covariate since it has the highest absolute value for estimated coefficients.Lu et al. (2013) and Fan et al. (2017) also noted similar findings in their studies.At baseline, young patients (age < 40) have higher CD4 cell counts (355 cell/mm 3 ) than elder patients (341 cell/mm 3 ).At 20 ± 5 weeks (a) β0 is the intercept and β1−11 is the coefficient for Age, Karnofsky score, weight, CD4 count at baseline, CD8 count at baseline, sex, race, symptomatic indicator, hemophilia, homosexual activity, and history of intravenous drug use.(b) * indicates the predictor is significant at a level of 0.05 after baseline, young patients still have higher CD4 cell counts (391 cell/mm 3 ) than elder patients (379 cell/mm 3 ) but have less improvement (36.3 cell/mm 3 ) than elder patients (38.3 cell/mm 3 ).This result reveals that all the patients with AIDS received greater benefits with treatment AZT + ddI than with AZT + ddC.
But AZT + ddC is more favorable to young patients, and AZT + ddI for old patients.
We then randomly split dataset into 70% training and 30% testing sets for 100 times to predict the value function and treatment regimen.The predicted value functions in 100 testing set are plotted in the box plot in Figure 4.By averaging those values, we find that the optimal treatment regimens obtained by our proposed method and QL give larger average V , with 404.677 cell/mm 3 for our proposed method and 405.104 cell/mm 3 for QL, while the average value functions obtained by the other 3 methods are smaller, with 371.686 cell/mm 3 for EL, 383.583 cell/mm 3 for OWL, and 399.933 cell/mm 3 for CAL.The estimated values from our proposed method also have smaller standard errors than those from other methods as shown in Figure 4, indicating a better performance in predicting the outcome of estimated treatment regimen.Comparing the predicted treatment regimens obtained across 5 models, we note that our method's predicted treatment regimens in 100 testing sets are closest to those of QL, with a overlap rate of 98%.

Discussion
In this study, we propose a new method to estimate the optimal treatment regimen based on pairwise comparison.Simulation results under different scenarios and application to the ACTG175 data both demonstrate that our proposed method can estimate optimal treatment regimens successfully and outperforms existing approaches.Our simulation results show that unlike Q-learning which is prone to misspecification of the model, our proposed model can estimate coefficients with much smaller standard errors than other methods regardless of whether the model is correctly defined.When predicting the true value of the estimated regime and the optimal regime, our proposed model has smaller variations and lowest misclassification rate.And the coverage rates under different scenarios are consistently higher, close to 95%.In addition, when the covariates are discrete, our proposed method can still estimate optimal treatment regimens successfully while OWL cannot.Finally, our model is invariant to a constant shift in the outcome value while EL is not.
There are still some open questions that need to be addressed in the future.
First of all, for the sake of simplicity, we only investigated linear decision rules in this study.In order to generalize our proposed approach to a wider variety of decision rules, we can consider a nonparametric function of X, g(•), so we have d(X) = 2I(g(X) + c > 0).We can also consider tree-based and list-based regimes (Laber and Zhao, 2015;Zhang et al., 2018).More research is encouraged in this direction.
In addition, this study only considered one single decision point.EL and CAL mentioned in this study can handle multiple decisions.We may employ a similar strategy to extend our proposed method to multiple decision time points to estimate optimal dynamic treatment regimens.
Our proposed method in this study only estimated an optimal treatment regimen among two treatments, but can be generalized to multiple treatment settings (Qi et al., 2020).In the simplest way, we can still utilize pairwise comparison between any two treatments among multiple treatment to find the optimal regime through multiple steps.But the order of the comparisons and how to find the pairs more effectively when there are more than 3 treatments need more efforts to investigate in the future.
In terms of the real data application, we did not consider any interactions or splines for the covariates.But our proposed method is flexible enough to include interactions or splines for the covariates.To test the potential causal effects of the covariates, sensitivity analysis is needed in the future.
Moreover, this study only considered the cases with single continuous outcome.When the outcome is a binary variable, we can posit a logistic regression model for the estimation of µ(a, x), then the maximum likehood estimator of µ(a, x) is μ(a, x; ζ) = exp(z T ζ)/{1 + exp(z T ζ)}, where z is a feature vector constructed from covariates x and ζ is an estimated parameter vector.Additionally, we may extend our proposed method to the problems with multiple outcomes, such as treatment outcome and adverse events in a clinical trial.Laber et al. (2018) combined nonparametric Q-learning with policy-search to estimate the most efficacious treatment regime subject to constraints on the risk of adverse events.Following this idea, we may estimate optimal treatment regime by maximizing the proposed pairwise comparison function in equation ( 11) while controlling the risk of adverse events using a constrained logistic regression framework.
and AssumptionsWe consider a single-stage randomized trial or observational study with two treatment arms A ∈ A = {−1, 1}.And X ∈ R p are patient's baseline covariates, denote the potential outcome given a ∈ A. Let D denote a class of treatment regimes of interests, and a given treatment regime d ∈ D maps R p to A. For example, we may consider a class of linear decision rules d(x) = 2I(x T β+β 0 > 0)−1, where β is p-dimensional vector of prescriptive indexes and β 0 is the intercept.Based on the covariates x, the estimated treatment regimen d can be −1 or 1.The potential outcome under regime d is Y * (d(X)) and we define we can view the estimation of β as a weighted logistic regression with I( Ki,1 − Ki,−1 > 0) as the response, X i as the predictor and | Ki,1 − Ki,−1 | as the sample weights.To verify this framework, we use IPWE, Kia = Y i I(A i = a)/(aπ + (1 − a)/2).In Jiang et al. ( estimates of β are β * (1), β * (2), . . ., β * (B).Finally, the standard error of β can be computed as the standard deviation of the B replications ŝe β = { B b=1 [ β8 (b) − β * (•)] 2 /(B − 1)} 1/2 ,

Figure 1 :
Figure 1: Boxplot of the estimated value of the estimated regime V ( d), and the true value of the estimated regime V ( d) when n = 100

Figure 2 :
Figure 2: Boxplot of the estimated value of the estimated regime V ( d), and the true value of the estimated regime V ( d) when n = 400

Figure 3 :
Figure 3: Boxplot of misclassification rate in validation set

Figure 4 :
Figure 4: Estimated values on test sets across 100 random splits PCL) method are comparable with those of Q-learning, and are better than those of OWL, EL and CAL with improved efficiency as measured by standard errors.
From Table1and 3, we observe that when model is correctly specified under Scenario 1, the estimated coefficients of our proposed pairwise comparison learning (

Table 1 :
Estimated optimal treatment regimes under Scenarios 1 and 2 (n = 100) The true values for β 0 , β 1 , β 2 , and beta 3 is 0.272, -0.680, -0.680, and 0 b.SE is the standard errors obtained from 100 bootstrap samples c.MCSE is the Monte Carlo empirical standard error

Table 3 :
Estimated optimal treatment regimes under Scenarios 1 and 2 (n = 400) The true values for β 0 , β 1 , β 2 , and beta 3 is 0.272, -0.680, -0.680, and 0 b.SE is the standard errors obtained from 100 bootstrap samples c.MCSE is the Monte Carlo empirical standard error

Table 4 :
Estimated optimal treatment regimes under Scenarios 3 and 4 (n = 400) The true values for β 0 , β 1 , β 2 , and beta 3 is 0.272, -0.680, -0.680, and 0 b.SE is the standard errors obtained from 100 bootstrap samples c.MCSE is the Monte Carlo empirical standard error

Table 8 :
Comparison of estimated value and true value of the estimated regime with standard errors using Q-learning, outcome weighted learning (OWL), entropy learning (EL), concordance-assisted learning (CAL) and proposed pairwise com-

Table 9 :
Coverage ratesCR Vt of the expected value function using Q-learning,

Table 11 :
Estimated optimal treatment regimes for ACTG175 data from CAL and PCL