CLINICAL PREDICTION MODELS FOR DIAGNOSIS OF APPENDICITIS IN CHILDREN WITH ABDOMINAL PAIN

Appendicitis is common in children, but remains difficult to diagnose accurately. The clinician must integrate information from the history, physical examination and screening laboratory tests to decide whether to reassure, order diagnostic imaging, or proceed to the operating room. This process is best framed as a decision problem with two thresholds; a lower threshold, below which further testing may be unnecessary, and an upper threshold, above which further testing need not delay appendectomy. The goal of this analysis was to model the probability of appendicitis. This project analyzes observations by 23 physicians on 143 children with abdominal pain evaluated in a Pediatric Emergency Department. Clinicians recorded the presence or absence of various signs and symptoms, and provided their gestalt estimate of the probability of appendicitis (priorprob) prior to obtaining screening laboratory tests such as white blood cell count (wbc). A final diagnosis of appendicitis was confirmed pathologically in 45 (31.5%) patients. Exploratory plots utilize nonparametric exploratory kernel density and locally weighted scatterplot smoothing. Missing data is imputed using both single and multiple imputation. Receiver Operator Characteristic curves illustrate the superior discrimination of a logistic clinical factors model vs. the Pediatric Appendicitis Score which dichotomizes wbc. The Akaike Information criteria provide support for a model that substitutes gestalt clinical probability (priorprob) for individual clinical factors. The bootstrap is used to produce bias-corrected calibration plots for each model and to estimate confidence intervals for coefficients. To account for the correlation within physicians, Generalized Linear Mixed models with clinician specific random effect(s) were fit using maximum likelihood and Bayesian methods. The apparent importance of gender in exploratory plots is confirmed using parametric models. Contrary to prior studies, the presence of fever reduces the probability of appendicitis. Conditional predictions from the preferred (random intercept) Bayesian model suggest that one can most confidently omit imaging in girls with low clinical suspicion (priorprob) and low white blood cell counts (wbc). Conversely, the best case for proceeding directly to the operating room can be made for boys with both high priorprob and high wbc. When levels of priorprob and wbc are discordant, imaging, or further observation, will be necessary.

Appendicitis is common in childhood, but remains difficult to diagnose accurately. After obtaining a history and physical examination, the treating clinician(s) may decide to send the child home without further testing, perform diagnostic imaging using ultrasound or computed axial tomography (CT scan), admit for observation, or proceed to the operating room. The costs of unnecessary surgery are balanced by an increased risk of perforation if the diagnosis is missed on initial evaluation. Perforated appendicitis results in greater morbidity. Radiologic imaging increases cost, and may introduce further delays. Concerns have been raised about possible long term effects of radiation from CT scan.
Most prior studies have treated the diagnosis of appendicitis as a classification/discrimination problem. The result has been the creation of a number of "clinical scores" and associated proposed decision rules. Most recently, the focus of these decision rules has been to define a patient at "low risk" for appendicitis [1,2].

Justification and Significance of the Problem
Harrell and others make a compelling argument that suggests many of these attempts are flawed [3,4].
Harrell argues that many decision rules ignore subject heterogeneity and categorize predictions as either diseased or normal, letting fear of probabilities and costs/utilities lead the analyst, not the treating physician, to be the provider of the utility function. Middle probabilities allow for gray zones and deferred decisions pending further testing. One such decision analysis has been published [5].
Harrell further argues that many physician investigators exhibit "dichotomania", attempting to find cutpoints in continuous predictor variables using improper scoring rules such as sensitivity and specificity. Mathematically, such cutpoints waste information, and cannot exist unless the relationship with outcome is discontinuous.
With this perspective in mind, the goals of this study were to develop and compare clinical prediction models [4] as follows: 1. Using multiple logistic regression, develop a prediction model from history, physicial examination and laboratory tests. I will explore methods to avoid resorting to stepwise variable selection techniques, use model validation techniques based on the bootstrap,and use graphical methods to aid in understanding model predictions [4,6].
2. Develop a prediction model using the clinician's estimate of the prior probability of appendicitis, calibrated for gender and adjusted for the subsequently obtained white blood cell (wbc) count.
3. Compare the potential predictive utility of a model based on a clinical score vs. use of the clinician's subjective assessment of prior probability. 4. Extend the second model to account for dependency of clinical prior probability within clinicians using Generalized Linear Mixed Models (GLMM).
5. Replicate the logistic models and the GLMM from a Bayesian perspective and explore incorporation of an informative prior using data from a study performed at the Children's Hospital of Philadelphia (CHOP).

Methodology 1.3.1 Study Design
Existing data from two completed, IRB approved studies of appendicitis in children presenting to the Emergency Department (ED) are available.
In the primary study, performed at Hasbro Children's Hospital (HCH) in Providence, RI we enrolled a prospective cohort of children presenting to a pediatric emergency department with abdominal pain in whom the treating physician considered a diagnosis of appendicitis. Faculty and fellows recorded potentially predictive information obtained during a structured history and physical examination. Clinicians were also asked to mark a vertical hash on a 10cm line to express their clinical estimate of the probability of appendicitis. These clinical variables were recorded prior to availability of laboratory tests (eg. wbc count) or results of abdominal imaging (ultrasound or computed tomography).
In those children who had surgery, the final pathology report was used as the criterion for a diagnosis of appendicitis. In non-operative cases, telephone follow-up was done after seven days to ensure that symptoms had resolved.

Descriptive Statistics
At Hasbro Children's Hospital, 143 children were evaluated by 23 phyisicians.
Individual physicians saw as few as 1 patient, to as many as 17 patients each. Overall, 45 (31.5%) children had a final diagnosis of appendicitis, leaving 98 (68.5%) without appendicitis.
A colleague kindly provided de-identified data from a similar study done at the Children's Hospital of Philadelphia (CHOP). This study enrolled 217 patients greater than five years old, of whom 86 (39.6%) had appendicitis. This supplemental dataset will be used to develop an informative prior in the Bayesian analysis.

Reproducible Research
The goal of reproducible research is to tie specific instructions to data analysis and experimental data so that scholarship can be recreated, better understood and verified [7].
The source documents for this thesis were written using the GNU Emacs text editor using the add-on package Emacs Speaks Statistics(ESS) [8]. R: A Language and Environment for Statistical Computing, was used for all statistical analyses [9]. The Sweave function [10] executes R code chunks embedded in the source file, producing a L A T E X document which incorporates statistical analyses and graphics with prose for typesetting using the URI thesis format [11].

Descriptive Statistics
The reporttools package creates L A T E X tables with descriptive statistics for the variables collected [12].
Factors presumed to be associated with appendicitis were recorded. A history of fever at home or in the ED was defined as fever if body temperature was ever ≥ 38 • C. The variable migrate was coded yes if pain had migrated to the right lower quadrant (RLQ). Each child was asked: "What is your favorite food?" "If we had some here, would you want to eat it now?" If not, anorexia was coded as yes.
Finally, emesis was defined as any history of vomiting. The next table summarizes the physical examination variables. If the patient was tender to palpation in the RLQ of the abdomen, the rlqpain was present.
The variables hoppain, coughpain, shakepain and percpain were coded as present if the patient reported pain with attempts to hop or cough, or in response to a gentle pelvic shake or manual percussion of the abdomen. Rebound tenderness (rebound ) was considered present if the patient complained of pain after sudden release of abdominal compression. Urinary ketones were measured using a pointof-care dipstick test which provide semiquantitative levels. Thus, ketones were at least ordinal, and were treated as a continuous variable in regression models.
The 4 category scale for urinary ketones ketones was reduced to three levels (none, small, medium to large) because of low frequencies of observations with values of 2 (moderate) and 3 (large). After taking a history and performing a physicical examination, but prior to knowledge of the white blood cell count, clinicians expressed their gestalt clinical estimate of the percent probability of appendicitis (priorprob) by making a vertical hash mark on a 10 cm line. Total white blood cell count (wbc) is often measured as an indicator of inflammation. The percentage of polymorphonuclear cells multiplied by the total count is the absolute neutrophil count (anc), and may represent a more specific indicator of inflammation. Information is available regarding the duration of pain in hours (durpain), but it was not considered in the current models.
If a larger dataset were available, it might be useful to model interactions between durpain and other clinical variables, since patients may develop increasingly severe signs of inflammation over time.

Exploratory Data Analysis
This section is philosophically somewhat at odds to the recommendations not to perform variable selection based on the relationship of predictors to the outcome.
However, nonparametric exploratory graphics such as those offer a nonparametric approach to visualizing relationships between predictors and outcomes.
The distribution of the continous variables wbc and priorprob are graphically displayed as empirical kernel density plots conditional on the final diagnosis and gender using the R function densityplot in package lattice [13]. q q q qq q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q Female 0 10 20 30 40 q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q

Male
No Appendicitis Appendicitis Note that the variance of wbc count is larger for patients with appendicitis, and there is a suggestion of bimodality in the distribution of the wbc in girls with appendicitis.
The distributions (shown below) of the clinical probability of appendicitis (priorprob) for patients who ultimately were found not to have appendicitis appears to be (at least) bimodal, whereas the distribution of priorprob is unimodal for those with appendicits. 50 100 q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q Female 0 50 100 q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q

Male
No Appendicitis Appendicitis The plot below uses Loess (locally weighted scatterplot smoothing) to estimate the smoothed proportion of children with appendicitis, conditional on gender and white blood cell count. The slope of the relationship between wbc and smoothed predicted probability appears to be similar for boys and girls, suggesting a gender by wbc interaction term will not be needed. The next plot depicts the smoothed proportion of children with appendicitis conditional on gender and estimated prior probability. Note that clinicians appear to greatly overestimate the probability of appendicitis in girls. Only one of the previously published clinical prediction rules has included gender [14]. Thus, I will consider including gender as a covariate in my candidate prediction models. These data have relatively few missing values for each variable. The software default is to discard any row (patient) with any missing data (listwise deletion).

Gestalt
If all available variables were included in the analysis, case-wise deletion would reduce the sample size from 143 to 122, reducing the precision of predictions. Such listwise deletion assumes that the data are missing completely at random (MCAR), ie. complete observations are a random subsample of the full dataset. To the degree that the MCAR assumption is violated, and the missing data mechanism is missing at random (MAR), the resulting regression parameters will be biased. Thus, in many common scenarios, imputation of missing values may be superior to complete case analysis [15].

Simple Imputation
As a first approach, I have set the value of each missing to the sample (unconditional) median for that variable. Missing values of categorical variables are assigned the most common category for that variable.

Multiple Imputation
In Chapter 2, the R package mice is used to perform multiple imputation [16].

Previous Studies
The Pediatric Appendicitis Score (PAS) score was developed using a prospective sample of 1,170 children with abdominal pain, of whom 63% had appendicitis [3]. The author of this paper employed stepwise logistic regression to create a final model from which he developed a weighted additive score, composed of 8 variables. Rather than using a model to directly predict the probability of appendicitis, he reported the sensitivity and specificity of various cutpoints of the (max 10 point) PAS score. The author of the original paper does not explain why or how he chose to dichotomize the continuous variables (white blood cell count and increased absolute neutrophil count).
A group at Boston Children's Hosptital has defined a different decision rule to identify patients at "low-risk" for appendicitis [8]. They analyzed 24 potential predictors recorded in a derivation set of 425 patients with abdominal pain, of whom 157 (37%) had appendicitis. They selected variables with less than 10% missing data and a p-value of ≤ 0.001 for a bivariate association with appendicitis.
These 12 variables were then entered into a backward stepwise logistic regression analysis. Six variables were retained, and used to define a weighted score. The variables, with weights in parenthesis, were; nausea (2), history of focal RLQ pain (1), migration of pain (1), difficulty walking (1), rebound tenderness or pain with percussion (2) and absolute neutrophil count > 6.75 (6).

Models Using Factors from the Pediatric Appendicitis Score
Steyerberg argues that overfitting is a major problem in regression modeling.
If potential predictors are included in the model based on univariate associations with the outcome, the effect of such predictors is overestimated, a phenomenon known as testimation bias [2, p. 88]. Given how they were developed, it is very likely that the prediction rules suffer from overfitting.
Nonetheless, I will first consider the factors included in the Pediatric Appendicitis Score; white blood cell count, absolute neutrophil count, pain with cough/percussion/hop, anorexia, vomiting/nausea, fever, tenderness in RLQ and history of migration of pain to the RLQ.
Harrell [1,11] and Senn [12] make multiple criticisms of the common practice of dichotomizing continuous variables. White blood count (wbc) and absolute neutrophil count (anc) are kept as continuous variables. Rather than dichotomizing wbc at some arbitrary cut-point, which makes the strong assumption that there is a piecewise uniform relationship, I have kept it as a continuous variable. Given the small dataset, I will assume wbc enters the model linearly. In a larger dataset, Harrell recommends use of restricted cubic spline functions to avoid linearity assumption [1].

Redundancy Analysis
As a first step, a redundancy analysis is done using Harrell's R redun function in package Hmisc to determine if any variables can be predicted from a combination of the remaining variables using flexible parametric additive models [13].
The absolute neutrophil count can be predicted from the other variables with an R 2 of 0.96. This is not surprising since absolute neutrophil count is defined as total white blood cell count (wbc) times the proportion of neutrophils. These two variables are strongly collinear, as shown in Figure 5 below.  White blood cell count is familiar to clinicians and readily available. Although others have found anc to be a slightly better univariate predictor of appendicitis [14], redundancy analysis suggests that anc can be dropped from the model with little loss of predictive information.

Variable Clustering
As a prelude to data reduction, a hierarchical cluster analysis on the variables is done, using squared Spearman correlations as a similarity measure and plotted in Figure 6. The first cluster; anorexia, vomiting and ketosis, makes clinical sense since ketones are a consequence of fasting resulting from nausea. Kharbanda fit a re-cursive partitioning model of appendicitis, and found that vomiting and anorexia were surrogate variables for nausea [8]. It should be noted that young children have difficulty describing the sensation of nausea. Determination of anorexia is also difficult. In this study, we asked each child "What is your favorite food?" "If we had some here, would you want to eat it now?" Thus, a parsimonious choice might be to substitute urinary ketones for the variables emesis and anorexia.
A second cluster, pain with shaking, percussion, hopping or cough, likely reflects how techniques represent alternative ways to elicit signs of peritoneal irritation. I've created a new dichotomous variable, periton, which evaluates to 'yes' if the patient had pain with cough, percussion, hopping, or a gentle shake of the pelvis.
The inter-rater reliability of all of these measures may be limited. Cohen's κ measures the chance corrected agreement between two raters who each classify patients into mutually exclusive categories [15]. Rebound tenderness, rebound was found to have moderate reliability (Cohen's κ = 0.54), compared to less than moderate agreement for tenderness to percussion and palpation [16]. However, as noted by Samuel, a pediatric surgeon who developed the Pediatric Appendicitis Score, "Rebound tenderness is a particularly painful clinical feature to elicit and results in undue pain, loss of confidence and trust, and ultimately leads to loss of cooperation. Hence this sign should not be elicited in children" [3]. For me, this is a compelling argument for not including this sign in a prediction model.
Rather than perform "testimation" and step-wise selection, I chose to create a preliminary model using the variables found in the Pediatric Appendicitis Score; wbc, cough/percussion/hop, anorexia, vomiting/nausea, fever, tenderness in RLQ and history of migration of pain to the RLQ.
Since redundancy analsysis suggests that anc can be predicted from wbc and other variables, I removed anc from the model.
Finally, given the apparent influence of gender seen in the exploratory plots, gender is included in the model, with female gender as the reference category.
Since the outcome variable is binary, a generalized linear model with logit link (logistic model) is appropriate. That

Generalized Linear Models
and When the distribution of y i given µ i is from the exponential family there is a natural link function for the family. For a binomial response the natural link is the logit link defined as with inverse link Because µ i is the probability of the ith observation being a "success", η i is the log of the odds ratio. With the logit link, this is the multiple logistic regression model. Models will be fitted using the lrm function (Logistic Regression Model) provided by the rms: Regression Modeling Strategies Package [17].

Multiple Logistic Regression Models
To avoid the loss of precision and possible bias of a complete case analysis, and to ensure that nested models are fitted on the same data, I have used the full dataset with simple (median) imputation for the preliminary models.
A likelihood ratio test is used to determine if the model that includes gender adds predictive information. The LR χ 2 is 5.562 with a p-value of 0.018. Therefore, we retain gender.
Akaike's information criterion (AIC) provides a method which can be used to compare two competing, non-nested models of different complexity (lower is better). A model which uses the ketones as a proxy for gastrointestinal distress is supported as it has an AIC of 121.63, compared to a higher AIC of 124.03 when anorexia and emesis are substituted. Thus, the model with ketones is preferred.
Thus, our preliminary model using clinical factors is: , where We can conclude that that ketones, fever, wbc and gender are significant predictors of appendicitis after adjusting for the presence or absences of a history of migration of pain, signs of peritonitis and right lower quadrant tenderness. Surprisingly, the coefficient for fever is negative (-2.02). Thus, children with fever in this sample are less likely to have appendicitis. This finding contradicts the implicit assumption in the PAS score that fever is a positive predictor.
The direction and importance of the relationship between fever and appendicitis may depend on the duration of illness. Fever early in the clinical course may point to other bacterial or viral illnesses. However, fever in patients with longer duration of symptoms may be associated with intra-abdominal inflammation. Future studies should investigate a possible interaction between symptom duration and fever.
A Receiver Operating Characteristic curve (ROC curve) is a plot of the true positive rate against the false positive rate over the range of a predictor. The area under the ROC curve, or c-statistic, is a measure of predictive discrimination. A useless predictor has an ROC area of 0.5. The package ROCR is used for the ROC plots [18].
The ROC method provides a way to compare the discrimination ability of the PAS score to that for the predictions from clincical factors model.

Model Using Gestalt Estimate of Probability
It is apparent from the analysis of deviance for our preliminary model ( Table 4) that most of the predictive power comes from wbc. Yen et al. found relatively poor inter-rater reliability for dichotomized physical exam variables in pediatric patients with abdominal pain [16].
An alternative approach would be to substitute each clinicians' gestalt estimate of the probability of appendicitis for individual history and physical examination factors. Recall that clinicians provided this estimate using a visual analog scale (VAS), by making a mark on a 10cm line (recorded in the variable priorprob on a scale of zero to 100).
Because the linear predictor η is intended to approximate the logit of the probability of appendicitis, the component of X for the clinician's gestalt estimate is expressed in the analagous form as logit(priorprob). In the ideal case in which clinicians are perfect diagnosticians, the coeffient of logit(priorprob) would be one [20].
The exploratory plot strongly suggest that clinicians overestimate the probability of appendicitis priorprob in girls. Thus, gender is again included in the preliminary model.

Bootstrap Estimation and Validation
The nonparametric bootstrap is used to estimate the regression coefficients for our model by refitting the model repeatedly on samples, with replacement, from these data. This approach does not rely on asymptotic sampling distributions to estimate the standard errors of the regression coefficients, and provides more accurate estimates for small datasets such as this one.
The bootstrap distributions of the regression coefficients found using the boot package are shown below [21]. Note that several of these distributions appear to be significantly skewed.  We can also use the bootstrap to assess internal validity calculating indices of discrimination and plotting a calibration curve for a set of 1000 bootstrap samples. The rms package function calibrate is used to produce calibration plots for the two models [17]. The function uses the bootstrap to get overfitting-corrected estimates of predicted vs. observed values using nonparametric smoothers. Note that the first model exhibits greater problems with calibration, particularly for small predicted probabilities. Secondly, there appear to be greater discrepencies between the 'apparent' and bootstrap 'bias-corrected' lines in the clinical predictors model, suggesting some overfitting.   Another suggested way of displaying the ability of the model to discriminate is to plot side-by-side box plots of the predicted probabilities for the two possible outcomes as shown in Figure 11. This plot highlights four patients without appendicitis that were predicted to have a high probability of appendicitis by the gestalt probability model.  The p-value for the le Cessie-van Houwelingen test is 0.57, therefore we cannot reject the null hypothesis of global goodness of fit [22].

Clinician Effects
Thus far, we have assumed that patients are independent. However, all patients were evaluated by a group of 23 clinicians who saw between 1 and 17 pa-tients each. It is likely that each clinician used the estimated probability scale in a systematically different way. Thus, we would expect some degree of dependency between the prior probability estimates in patients who were evaluated by the same physician.
One simple approach to evaluate this would be to add 23-1 dummy variables, treated the evaluating clinician doc as a fixed effect. However, the coefficients for individual clinician effects (not shown) have very large standard errors, reflecting the small number of patients evaluated by each doctor. Gelman and Hill refer to this as the varying intercept model and the model omitting clinicians as the "complete pooling"or "constant intercept" model [23]. Given we are not primarily interested in comparing the diagnoses of specific doctors, a better approach may be to treat doc as a random effect.

List of References
[1] F. E. Harrell, Jr., Regression modeling strategies : with applications to linear models, logistic regression, and survival analysis.  In a generalized linear mixed model (GLMM) the n-dimensional vector of linear predictors, η, incorporates both fixed effects, β, and random effects, b, as where X is an n × p model matrix and Z is an n × q model matrix.
The distribution of the random effects is modeled as as a multivariate normal (Gaussian) distribution with mean 0 and q ×q variance-covariance matrix Σ. That Generalized linear mixed models add random effect(s) to the model, allowing for clinician specific variation in intercepts, and possibly slopes. Models with a random intercept only will be compared to models with correlated random intercept and slopes using Likelihood Ratio (LR) tests.
The methods used by lme4 integrate over random effects to compute the likelihood using either a Laplace approximation, or in some situations adaptive Gauss-Hermite quadrature (GHQ), which is more accurate, but more computationally intensive [1, 2].

Random Slope and Intercept Model
The first model assumes random slope of priorprob within doc and a correlated random intercept. The Laplace approximation is used for these fits. The standard deviation of the random intercept term is 0.982. The predicted probability for a doctor not in the study will be based on the fixed-effects only, because it applies to a doctor not in this study, the expected value for the random effect is zero in the absence of any information on that doctor.
There is a simple random effect for each doc. To get a prediction for a specific physician one adds the random effect to the value of η before transforming it.
In the next chapter, the preceding generalized linear mixed models are simulated using Bayesian methods. The model is defined in a text file using a dialect of the BUGS language. Two types of relations are defined. A stochastic relation (∼) defines a stochastic node which represents a random variable in the model. A deterministic relation (<-) defines a deterministic node, the value of which is determined exactly by the values of its parents.

Bayesian Logistic Regression
One of the advantages of the Bayesian approach is that the posterior from one model can provide prior information for subsequent models. The data from a similar study at the Children's Hospital of Philadelphia (CHOP) allow us to model the final diagnosis of appendicitis as a function of wbc and gender.

Logistic Regression Model: Children's Hospital of Philadelphia Data
In this section, I fit a logistic model to the CHOP dataset, with covariates wbc and gender. Diffuse, non-informative independent normal priors with mean zero and large variance (small precision) were assumed for the coefficient vector β.
The BUGS code for model is shown below:  Table 13. CHOP data: Posterior Summary

Logistic Regression Models -HCH data Diffuse Priors
We now fit a model which includes gender, wbc, priorprob as covariates, using diffuse priors.
One of the advantages of Bayesian modeling is the ability to directly sample from functions of the parameters such as odds-ratios (OR).
The BUGS code for the model is shown below:  We can now compare the estimates and predictions from the model which utilizes the prior information from the CHOP data. The credible intervals for coefficients are notably smaller when prior information is utilized.
The mean of the posterior coefficient for wbc, conditional on gender from the CHOP data model, is used to provide prior information. To account for uncertainty due to data from different populations, and the fact that the coefficient for wbc is now conditional on logit(priorprob), I have doubled the standard deviation (precision/4). As expected, the precision of the posterior standard deviation for 'b.wbc' improves, with little change in the posterior mean coefficient.

Bayesian Generalized Linear Mixed Models
The models that follow can be defined as multilevel logistic models. These can be thought of as a generalization of generalized linear models, where intercepts, and possibly slopes, are allowed to vary by group [3].

Varying-intercept
The BUGS code for the varying intercept model is shown below:  The clinician specific random coefficient model is: The coefficient µ α is the expected intercept for a randomly chosen clinician.
The coefficient for an individual clinician (α j ) follows a normal distribution with mean µ α and a standard deviation of σ doc . The random intercept α j for each clinician is plotted below. The dotted horizontal line is µ α .

Varying-intercept, varying-slope, no correlation between intercepts and slopes
In this model, we assume a random intercept and slope for each physician, the following BUGS code. The model statement for this model is as follows:  The clinician specific random coefficient and intercept model is: The coefficient µ α is the expected intercept and µ β is the expected slope coefficient for logit(priorprob) for a randomly chosen doctor.
The intercept and slope coefficients for individual clinicians (α j ) and (β j ) follow a normal distributions with means µ α and µβ and standard deviations σ α [doc] and σ β [doc] .
The Bayesian posterior summary estimates for the random coefficients model are shown below. The standard deviation of the clinician specific slopes is relatively small, but still greater than zero.

Predictions from Bayesian Random Intercepts Model
The primary use of this model will be to predict the probability of appendicitis when evaluated by a 'new' clinician with similar diagnostic ability to those who evaluated patients in this study. The Bayesian random intercept model is my preferred choice to provide predictions.
It is informative to plot predictions for three groups of patients; Figure 16, high gestalt clinical probability (priorprob=90%), Figure 17, lower gestalt clinical probability (priorprob= 10%) and Figure 18, those those with priorprob = 50%, conditional on gender and wbc. For each sample from the posterior distribution of coefficients, a linear predictor η can be calculated for a specified design matrix.
An adaptation of the function bprobit.probs in the package LearnBayes is used to calculate predicted probabilities [4]. In each plot the median and 95% credible intervals of the posterior predictive distribution are plotted in subsequent graphs over a range of wbc.
In the first plot, the predicted probability of appendicitis with 95% credible intervals is plotted for boys (in blue) and girls (in pink) for a patient with a gestalt clinical predicted probability (priorprob=90%). Note that precise, high probability predictions are possible only for boys. In the next plot, the predicted probability of appendicitis, with 95% credible interval, is plotted for boys (in blue) and girls (in pink) for a patient with a gestalt clinical predicted probability (priorprob=50%). No precise high or low predicted probabilites are found for boys or girls over the range of wbc.  . It is therefore surprising that most of the recent efforts to assist the clinician with the diagnosis of appendicitis rely only indirectly on probability models, but instead propose clinical decision rules.
Feinstein was among the first to point out the inadequacy of binary models for the clinical reality of three-zone diagnostic decisions [3]. The evaluation and management of children at risk for appendicitis is best framed as a decision problem with at least two thresholds; a lower threshold, below which further testing may be unnecessary, and an upper threshold where it may be most appropriate to remove the appendix [4]. Thus, the decision makers need a probability model.
How best to create a probability model? One approach is to build on prior research, utilizing multiple dichotomous factors from the history, physical examination and screening laboratory tests. The practices of 'testimation' and stepwise selection have been routinely applied to small datasets. The result has been overfitted models which are unlikely to predict well in future patients. The deeply entrenched practice of dichotomizing continuous variables, and the recent enthusiasm for recursive partitioning, encourage 'dichotomania' and throw away much of the predictive information contained in continuous variables. In Chapter 2, the area under the ROC curve, an index of discrimination, is much greater for a logistic model with wbc as a continuous variable, than for the PAS score. In the process of developing a clinical factors model, it became apparent that it was important to include gender in the model. Surprisingly, I found the presence of fever predicted a significantly LOWER risk of appendicitis for the children in this sample.
An alternative, and arguably preferable approach, is to take advantage of each clinicians ability to provide a 'gestalt' clinical probability estimate between zero and 100%. I attempted to specify a clinical factors model without regard to outcomes (except in preliminary exploratory plots). Nonetheless, internal validation plots with optimism estimated from bootstrap samples suggest that the smaller clinical probability model, adjusted for gender, may perform better in a new sample. Each patient arrives with a nuanced and ideosyncratic story, surely providing more information to an experienced clinician than a sum of binary variables.
Experience suggests that individual physicians may vary in their diagnostic ability or in how they use the probability scale.
Patients in clinical studies such as this are evaluated by a finite group of clinicians, and it is likely that probability estimates by a particular clinician will be more similar than those by a different clinician, introducing a within-physician correlation. Given the small number of patients evaluated by a each doctor, fixedeffects estimates are very poorly determined. The generalized linear mixed models in Chapters 3 and 4 allow a variance component to be estimated. The models fit by maximum likelihood and Bayesian MCMC approaches suggest that a clinician specific random intercepts model is adequate to account for clinician variation. The Bayesian models are particularly well suited for estimating the expected variability of the model predictions.
The conditional prediction plots in Chapter 4 suggest several clinical heuristics. One can feel most confident omitting imaging (CT or ultrasound) in girls with low clinical suspicion AND low white counts ( Figure 17). Conversely, the best case for proceeding directly to the operating room can be made for boys with both high clinical suspicion and high white blood cell counts ( Figure 16). When there is equipoise after the history and physical exam, imaging will be necessary, irrespective of wbc ( Figure 18). When clinical probability and white blood count are discordant (one high, the other low), further evaluation should always be considered.
It is likely that there will be more false positive imaging studies in patients with low prior probability of appendicitis, and more false negative studies in patients with a high prior probability of appendicitis. Novel inflammatory markers are currently under active development. It is naive to expect that a single threshold value of a new marker will be equally useful in all patients. Rather, the appropriate threshold for a continuous marker will depend on the covariate pattern in a given patient.
It is not easy to define clear probability thresholds. They should reflect costs (risks and benefits) and may differ between patients. Although I have treated appendicitis as binary, in reality the pathology ranges from pain due to obstruction of a hollow tube, to advanced peritonitis in ruptured appendicitis. Decision making is complicated by the fact that the emergency physician often is most focused on the lower threshold, and greatly regrets (and may be sued for) missing the early diagnosis of appendicitis. The surgeon must make the decision to remove the appendix, and has an interest in minimizing the number of patients without appendicitis who are taken to the operating room unnecessarily.
In future, it is clear that much larger sample sizes are necessary. This will require multicenter investigations. Hierarchical models become particularly attractive in this setting, as one can add covariates at multiple levels. For example, clinician level predictors might include level of training, years of experience, type of training (surgical/pediatrics/emergency medicine). Hospital level predictors might include patient volume and resource availability. Regional and national level predictors may also be relevant.
The Bayesian paradigm seems particularly relevant as it allows a specification of priors which reflect information from previous studies. Journal policies which encourage reproducible research and availability of data will facilitate incorporation of prior information.