AMS Copyright Notice © Copyright 1997 American Meteorological Society (AMS). Permission to use figures, tables, and brief excerpts from this work in scientific and educational works is hereby granted provided that the source is acknowledged. Any use of material in this work that is determined to be "fair use" under Section 107 or that satisfies the conditions specified in Section 108 of the U.S. Copyright Law (17 USC, as revised by P.L. 94-553) does not require the Society's permission. Republication, systematic reproduction, posting in electronic form on servers, or other uses of this material, except as exempted by the above statements, requires written permission or license from the AMS. Additional details are provided in the AMS Copyright Policies, available from the AMS at 617-227-2425 or email@example.com. Permission to place a copy of this work on this server has been provided by the AMS. The AMS does not guarantee that the copy provided here is an accurate copy of the published work.
An estimator of shrinkage based on information contained in a single sample is presented and the results of a simulation study are reported. The effects of sample size, amount, and severity of nonrepresentative data in the population, inclusion of noninformative predictors, and least (sum of) absolute deviations and least (sum of) squared deviations regression models are examined on the estimator. A single-sample estimator of shrinkage based on drop-one cross-validation is shown to be highly accurate under a wide variety of research conditions.
Meteorologists have long recognized the importance of accurately quantifying statistical forecast skill. One of the primary tools of meteorological forecasting is multiple regression analysis (Murphy and Winkler 1984 ) where, given data on a response variable yi and associated predictor variables xij, where j = 1, ..., p;i = 1, ..., n; p denotes the number of predictors; and n represents the number of events; the goal is to find some function of the xij values that is an accurate and precise predictor of yi. It is generally recognized that any estimate of forecast skill grounded in a multiple regression model that is based on a sample of observations is characteristically higher than the forecast skill that would be obtained from a multiple regression model that is based on the entire population of observations (Mosteller and Tukey 1977 ; Picard and Cook 1984 ; Michaelsen 1987 ; Barnston and Van den Dool 1993 ). It is also widely accepted that the fit of the multiple regression model to new sample data is nearly always less precise than the fit of the same multiple regression model to the original sample data on which the model was based. This is reflected in lower forecast skill levels obtained when sample-based multiple regression models are used to predict future events.
It is useful to have elementary terms to distinguish between the fit of a multiple regression model to the sample data on which the model has been determined and the fit of the same multiple regression model to an independent sample of data. The former is termed “retrospective” fit and the latter is termed “validation” fit (Copas 1983 ). The term “shrinkage” denotes the drop in skill from retrospective fit to validation fit (Copas 1983 ) and indicates how useful the sample-based regression coefficients will be for prediction on other datasets. For purposes of clarification, shrinkage involves the following four-step procedure. First, a multiple regression model is fit to a sample dataset by optimizing the regression coefficients relative to a fitting criterion, for example, least squares. Second, the goodness of fit of the multiple regression model is measured by an index, such as a squared multiple correlation coefficient. Third, the obtained multiple regression model is applied to an independent sample dataset and a second goodness-of-fit index is obtained for the independent dataset. Fourth, a ratio of the two indices is constructed where the goodness-of-fit index from the original dataset is the denominator. This ratio is termed shrinkage since it is usually less than unity.
et al. (1996) investigated the effects of sample size, type of regression
model, and noise-to-signal ratio on the degree of shrinkage in five populations
that differed in the amount and degree of contaminated data. Shrinkage
was defined as the ratio of the validation fit of a sample regression equation
to the retrospective fit of the same sample regression equation where the
validation fit was assessed on five independent samples, averaged over
10000 simulations. While this index of shrinkage is both rigorous and comprehensive,
the use of six independent samples precludes its use in routine research
situations. In this paper, an estimate of shrinkage is developed that is
based on a single sample and can easily be employed by research meteorologists.
Comparisons with the index of shrinkage given by Mielke
et al. (1996) indicate that the single-sample estimate of shrinkage
is very accurate under a wide variety of conditions. The single-sample
estimate of shrinkage is related to cross-validation methods that have
become standard for assessing the predictive validity of forecast skill.
Historically, users of multiple regression procedures have developed methods to assess how accurately sample regression coefficients estimate the corresponding population regression coefficients. The usual procedure is to test the sample regression coefficients on an independent set of sample data. This practice has come to be known as “cross-validation.” A comprehensive historical background on cross-validation is provided by Stone (1974 , 1978) , Geisser (1975) , Mosteller and Tukey (1977) , and Snee (1977) . Camstra and Boomsma (1992) present an extensive overview of the use of cross-validation in regression, where the emphasis is on the prediction of individual observations, and in covariance structure analysis, where the emphasis is on future values of variances and covariances. Michaelsen (1987) and Elsner and Schmertmann (1994) describe and discuss cross-validation methods as they pertain to meteorological forecasting.
It is widely recognized that to be useful, any sample regression equation must hold for data other than those on which the regression equation was developed. When sample data are used to determine the regression coefficients that best predict the response variable from the set of predictor variables, assuming that the variables to be used in the regression equation have already been selected, prediction performance is usually overestimated (Picard and Cook 1984). Because the sample regression coefficients are determined by an optimizing process that is conditioned on the sample data, the regression equation generally provides better predictions for the sample data on which it is based than for any other dataset. This is sometimes referred to as “testing on the training data” (Glick 1978). It should be noted that the use of cross-validation precludes any manipulation of the dataset prior to the development of the regression model and subsequent cross-validation.
In general, cross-validation consists of determining the regression coefficients in one sample and applying the obtained coefficients to the predictor scores of another sample. The initial sample is termed the “calibration” or “training” sample and the second sample is called the “validation” or “test” sample (Browne 1975a ,b ; Huberty et al. 1987 ; Camstra and Boomsma 1992 ; MacCallum et al. 1994). The calibration sample is used to calculate the regression coefficients, and the predictive validity of the fitted equation is verified on the validation sample.
As defined, cross-validation requires two samples. Because a second sample is often not readily available, an alternative approach is often used in which a large sample is randomly split into two subsamples. One subsample is specified as the calibration sample and the second sample is designated the validation sample. The many problems associated with this approach to cross- validation are summarized in Lachenbruch and Mickey (1968) , Picard and Cook (1984) , and Picard and Berk (1990) . Setting aside the obvious loss of information in splitting samples (Browne and Cudeck 1992 ), a significant problem is the difficulty in procuring large samples, which are not available in many research situations. In addition, when calibration sample sizes are small, the regression coefficients are less precise than those that would be obtained if the entire sample had been used (Horst 1966 ). Mosier (1951) suggested a double cross-validation procedure where the regression coefficients are calculated for both the calibration and validation samples and the two regression equations are cross-validated on the sample that was not used to establish the regression coefficients. Questions have been raised as to exactly what should be done when the results of the two cross-validations differ (Snee 1977 ). It has been suggested that if the two sets of regression coefficients are not too different, then a new set of coefficients may be obtained from the combined calibration and validation samples (Mosier 1951 ). While no estimate of predictive validity is available for the combined sample, Mosier (1951) posited that it may be approximated by the average of the predictive validities obtained for the original calibration and validation samples.
Cross-validation is certainly not limited to just two samples. The data can be divided into more than two samples and multiple cross-validations can be obtained. Multiple cross-validation involves partitioning an available sample of size n into a calibration sample of size nk and a validation sample of size k. The cross- validation procedure is realized by withholding each validation sample of size k, calculating a regression model from the remaining calibration sample of size nk, and validating each of the (nk) possible regression models on the remaining sample of size k held in reserve. Since k = 1 requires validating only n regression models on the remaining sample of size 1 held in reserve, this special case is both easily implemented and commonly used. In various literature, the case where k = 1 is termed drop-one cross-validation, leave-one-out cross-validation, hold-one-out cross-validation, or the U method. Stone (1978) provides a thorough review of drop-one cross-validation. Drop-one cross-validation is an exhaustive method involving substantial redundancy in the participation of each data point (far more redundancy when k > 1). However, the exhaustive features of drop-one cross-validation may provide a comprehensive evaluation of predictive accuracy and a solid estimate of predictive skill (Barnston and Van den Dool 1993 ).
Drop-one cross-validation is usually credited to Lachenbruch (1967) or Lachenbruch and Mickey (1968) . However, Toussaint (1974) has traced the drop-one method to earlier sources under different names (Glick 1978 ). Currently, the drop-one method is the cross-validation procedure of choice and it is not unusual to see the term cross-validation virtually equated with the drop-one method (e.g., Nicholls 1985 ; Livezey et al. 1990 ).
For many researchers, the method of choice for cross- validation is to create a model on one sample and test the model on a second sample drawn from the same population; alternatively, a model is created on a substantial portion of a sample and tested on the remaining portion of the sample. In either case, the selection of predictors can be based on information in the population or some other out-of-sample source, or the selection of predictors can involve subset selection based on in-sample information. In addition, the regression coefficients are nearly always based on information in the calibration sample. Much of the early work in cross-validation specifically limited analyses to fixed models where the number and variety of predictors is determined a priori and not based on subset selection (e.g., Browne 1975a , 1975b ; Camstra and Boomsma 1992 ; MacCallum et al. 1994 ). Thus, cross-validation in this context implies validation of the sample regression coefficients only. In those cases where subset selection is based on the sample information, cross-validation implies validation of the subset selection process and the sample regression coefficients.
The advent of double cross-validation brought additional complications. Given fixed predictors, the regression coefficients from each sample are tested on the other sample and any differences can be consolidated by some form of weighted averaging of the regression coefficients (Subrahmanyam 1972 ). However, given sample-based subset selection, there is the added complication that each sample will select a different number and/or a different set of predictors. It is much more difficult to resolve discrepancies between the two sample validation results. Browne (1970) provides results of random sampling experiments demonstrating the effects of not fixing the predictors beforehand. With drop- one cross-validation it is possible to conceive of up to n different but overlapping sets of predictors and up to n different values for the regression coefficients for each predictor. The satisfactory and optimal combining of these differences appears very difficult; see, for example, Browne and Cudeck (1989) and MacCallum et al. (1994) .
Cross-validation is not without its critics and there is evidence that suggests some possible drawbacks to drop-one cross-validation. Glick (1978) and Hora and Wilcox (1982) provide simulation studies of drop-one cross-validation in discriminant analysis. Both studies indicate that the estimates have relatively high variability over repeated sampling, possibly due to the repeated use of the original data. The results of both Glick (1978) and Hora and Wilcox (1982) were based on discriminant analysis, which has a binary error function. Efron (1983) notes that cross-validation performs somewhat better given a smooth residual sum of squares error function. Finally, some investigators note that a model that fits the validation sample as well as the calibration sample is not necessarily a validated model. Maltz (1994) , for example, argues that cross-validation may only show that the procedure used to split the sample did, in fact, divide the sample into two similar subgroups.
If a specific
sample dataset exhibits a high first-order autoregressive pattern, drop-one
cross-validation may overestimate the validation fit. For example, if a
single sample consists of cases selected from a time series, then the cases
in a given cycle (e.g., a month, a year, or a decade) may be highly correlated.
In such cases a drop-k cross-validation may be required to mitigate
the cyclic pattern, where k exceeds the length of the cycle. Michaelsen
(1987) has researched the effects of autoregressive effects on cross-validation
in statistical climate forecast models.
3. Statistical measures
Let the population and sample sizes be denoted by N and n, respectively, let yi denote the response variable, and let xi1, ..., xip denote the p predictor variables associated with the ith of n events. Consider the linear regression model given by
A measure of agreement is employed to determine the correspondence between the yi and i values, for i = 1, ..., n. Many researchers have utilized measures of agreement in assessing prediction accuracy, for example, Willmott (1982) , Willmott et al. (1985) , Kelly et al. (1989) , Tucker et al. (1989) , Gray et al. (1992) , McCabe and Legates (1992) , Badescu (1993) , Elsner and Schmertmann (1993) , Hess and Elsner (1994) , Cotton et al. (1994) , and Lee et al. (1995) . Watterson (1996) provides a comprehensive comparison of various measures of agreement.
In this simulation study, the measure of agreement for both the LAD and LSD prediction equations is given by
4. Data and simulation procedures
The present study investigates the accuracy and utility of a single-sample estimator of shrinkage. Also considered are the effects of sample size, type of regression model (LAD and LSD), and noise-to-signal ratio in five populations that differ in amount and degree of contaminated data. Sample sizes (n) of 15, 25, 40, 65, 100, 160, 250, and 500 events are obtained from a fixed population of N = 3958 events, which, for the purpose of this study, is not contaminated with extreme cases; a fixed population of N = 3998 events consisting of the initial population and 40 moderately extreme events (1% moderate contamination); a fixed population of N = 3998 events consisting of the initial population and 40 very extreme events (1% severe contamination); a fixed population of N = 4158 events consisting of the initial population and 200 moderately extreme events (5% moderate contamination); and a fixed population of N = 4158 events consisting of the initial population and 200 very extreme events (5% severe contamination). The 3958 available primary events used to construct each of the five populations used in this study consist of a response variable and p = 10 predictor variables. Specifics of the meteorological data used to construct these five populations are given in Mielke et al. (1996) .
Two prediction models
are considered for each of the five populations. The first prediction model
(case 10) consists of p = 10 independent
variables, and the second prediction model (case 6) consists of p = 6
independent variables. In case 10, 4 of the 10 independent variables in the
initial population of N = 3958 events were found to contribute no information
to the predictions. Case 6 is merely the prediction model with the four noninformative
independent variables of case 10 deleted. Both the case 10 and case 6 prediction
models were constructed from the initial fixed population of N = 3958
events. The reason for the two prediction models is to examine the effect of
including noninformative independent variables (i.e., noise) in a prediction
5. Findings and discussion
The results of the study are summarized in Tables 1, 2, 3, 4, 5. In Tables 1, 2, 3, 4, 5, each row is specified by 1) a sample size (n), 2) p = 10 (case 10) and p = 6 (case 6) independent samples, and 3) LAD and LSD regression analyses. In each of the five tables the first column (C1) contains the true values for the designated population and the second column (C2) contains the average of 10000 randomly obtained sample estimates of ,, where the values are based on the sample regression coefficients for each of the 10000 independent samples, that is, a measure of retrospective fit. The third column (C3) measures the effectiveness of validating sample regression coefficients. In this column the sample regression coefficients from 10000 random samples were first obtained from column C2, then for each of these 10000 sets of sample regression coefficients an additional five independent random samples of the same size (n = 15, ..., 500) were drawn from the population. The sample regression coefficients from C2 were then applied to each of the five new samples, and values were computed for each of these five samples for a total of 50000 values. The average of the 50000 values is reported in column C3, yielding a measure of validation fit. The fourth column (C4) contains the average of 10000 randomly obtained drop-one sample values where each of the values is based on the same sample data that yields one of the 10000 sample values composing the averages in column C2. Thus, each value in column C4 represents the average of n times 10000 values. The fifth column (C3/C2) contains the ratio of the average value of C3 to the corresponding value of C2, that is, the index of shrinkage. The sixth column (C4/C2) contains the ratio of the average value of C4 to the average value of C2, that is, the drop-one single- sample estimator of shrinkage, as measured by C3/C2. The seventh column (C4/C3) contains the ratio of the average value of C4/C2 to the average value of C3/C2, that is, the ratio of the drop-one single-sample estimator of shrinkage to the index of shrinkage. The eighth column (C3/C1) contains the ratio of the validation fit of C3 to the corresponding true fit, measured by the population value given in C1. The values of columns C1, C2, C3, C3/C2, and C3/C1 are contained in Mielke et al. (1996) .
It should be noted in this context that both C3 and C4 are free from any selection bias. Selection bias occurs when a subset of predictor variables is selected from the full set of predictor variables in the population based on information contained in the sample. In this study, selection bias has been controlled by selecting the two sets of predictor variables (i.e., cases 10 and 6) from information contained in the population and not from information contained in any sample. Specifically, in the case of C3, the predictor variables were selected from information in the population, the regression coefficients were based on information contained in the sample for these (10 or 6) predetermined predictor variables, then the regression coefficients were applied to five new independent samples of the same size and drawn from the same population. This process was repeated for 10000 samples, producing 50000 values. Each C3 value is an average of these 50000 values. Thus, while there is an optimizing bias due to retrospective fit, there is no selection bias. In the case of C4, the predictor variables were again selected from information contained in the population and the regression coefficients were based on information contained in the sample, after dropping one observation. A value was calculated on the set of n 1 y and values, and the procedure was repeated n times, dropping a different observation each time. The entire process was repeated for 10000 samples, producing n times 10000 values. Each C4 value is an average of these n times 10000 values. Thus, there is no selection bias. The advantage to this approach is that the optimizing bias can be isolated and examined while the selection bias is controlled. In addition, this approach is more conservative as validation fit is almost always better when subset selection is included (MacCallum et al. 1994 ). The drawback to this approach is that the results cannot be generalized to studies that selected both prediction variables and regression coefficients based on sample information and, in addition, shrinkage may be increased.
The ratio values in column C3/C2 in Tables 1, 2, 3, 4, 5 provide a comprehensive index of shrinkage that serves as a benchmark against which the accuracy of the drop- one single-sample estimator of shrinkage given in column C4/C2 can be measured. The ratio values in column C4/C3 were obtained by dividing the ratio values in column C4/C2 by the corresponding ratio values in column C3/C2. They provide the comparison ratio values by which the drop-one single-sample estimator of shrinkage is evaluated.
For each of the five populations summarized in Tables 1, 2, 3, 4, 5, the ratio values in column C4/C3 are close to unity for samples with n > 25. The few C4/C3 values that exceed 1.0 are probably due to sampling error. It should be noted that the C4/C3 ratios tend to be less than unity for the smaller sample sizes. When n 25, reductions from unity of the C4/C3 values are 4.5%–11% for the LAD regression model and 4.5%–15% for the LSD regression model in population 1. For populations 2–5, the corresponding reductions are 4%–10% (LAD) and 4.5%–14% (LSD), 3.5%–9.5% (LAD) and 3.5%–14% (LSD), 3%–6% (LAD) and 4.5%–11% (LSD), and 0%–5.5% (LAD) and 0%–11% (LSD), respectively. Thus, the drop-one single-sample estimator (i.e., C4/C2) is an excellent estimator of shrinkage (i.e., C3/C2), although it is conservative for very small samples. This conclusion holds for all sample sizes greater than n = 25, both cases (6 and 10), both regression models (LAD and LSD), and all five populations with differing degrees and amounts of data contamination.
Column C3/C1 summarizes, in ratio format, the validation fit (C3) to the true population value (C1). This is sometimes referred to as “expected skill” (Mielke et al. 1996 ). In general, the C3/C1 values indicate the amount of skill that is expected relative to the true skill possible when an entire population is available. More specifically, the C3/C1 values indicate the expected reduction in fit of the y and values for future events (Mielke et al. 1996 ). A C3/C1 value that is greater than 1.0 is cause for concern since this indicates that the sample regression coefficients provide a better validation fit, on the average, than would have been possible had the actual population been available.
Inspection of column C3/C1 in Table 1 reveals that the LSD regression model consistently performs better than the LAD regression model, case 10 has lower values than case 6, and the C3/C1 values increase with increasing sample size. Table 2, with 1% moderate contamination, yields a few C3/C1 values greater than 1.0 and they all appear with the LSD regression model. Table 3, with 1% severe contamination, shows the same pattern, but the C3/C1 ratio values are somewhat higher. Table 4, with 5% moderate contamination, continues the same motif and Table 5, with 5% severe contamination, contains C3/C1 values considerably greater than 1.0 for nearly every case. It is abundantly clear that with only a small amount of moderate or severe contamination, the LSD regression model produces inflated estimates of expected skill. The LAD regression model, based on absolute deviations about the median, is relatively unaffected by even 1% severe contamination, but the LSD regression model, based on squared deviations about the mean, systematically overestimates the validation fit and yields inflated values of expected skill (i.e., C3/C1).
Since C3 (validation fit ) values and C4 (drop-one single-sample validation fit ) values are essentially the same for all five populations, both cases, both regression models, and all sample sizes, it is readily apparent that C4/C1 ratios would be nearly identical to the C3/C1 ratios in Tables 1, 2, 3, 4, 5. Consequently, caution should be exercised in using drop-one estimators with the LSD regression model as they will likely provide inflated estimates of validation fit when contaminated data are present. Because the drop-one estimate of shrinkage is equivalent to drop- one cross-validation, the same caution applies to drop-one cross-validation with an LSD regression model.
While it is abundantly evident that LSD regression systematically overestimates validation fit, the reason for the optimistic C3/C1 values is not as manifest. It is obvious that the inflated estimates of expected skill for LSD regression in Tables 1, 2, 3, 4, 5 are systematically related to sample size with larger sample sizes associated with C3/C1 values in excess of 1.0. This is probably due to a moderately or severely contaminated population event occurring in a single sample. Very small samples (e.g., n = 15) are not likely to include a contaminated event, whereas very large samples (e.g., n = 500) are much more likely to include one or more contaminated events. Table 6 provides the probability values that no contaminated population event belongs to a single sample for both 1% and 5% contamination. The probability that no contaminated event belongs to a single sample with 1% moderate or severe contamination in the population is given by (3958/3998)n, and the probability that no contaminated event belongs to a single sample with 5% moderate or severe contamination in the population is given by (3958/4158)n in Table 6. For 1% moderate or severe contamination, the probability of selecting no contaminated events from the population is greater than 0.50 for samples of size n 65. For 5% moderate or severe contamination, the probability of selecting no contaminated events from the population never exceeds 0.50. Given the well-known sensitivity of LSD regression to extreme events, it is not surprising that LSD regression yields optimistic levels of expected skill for larger samples that are more likely to contain one or more moderate or severely contaminated events. It should be noted in Table 5 that neither LSD nor LAD regression is able to accommodate 5% severe contamination.
The single sample estimate of shrinkage, C4, is higher for 6 predictors than for 10 predictors in Table 1 with LAD regression and n 160 and with LSD regression and n 250, in Table 2 with LAD regression and n 250, and with LSD regression and n 40. The same relationship holds for both LAD and LSD regression in Table 3 with n 40 and in Tables 4 and 5 with n 25. These results are consistent with the influence of contamination since when n is small, the influence of additional noninformative predictors is mitigated because the probability of selecting a contaminated event in each sample is reduced. Clearly, regression models containing noninformative predictors should be avoided (Browne and Cudeck 1992 ).
The standard deviations of the 10000 values composing C2, SD(|C2), and the standard deviations of the 10000 drop-one values composing C4, SD(|C4), are given for each sample size (n = 15, ..., 500), case (10 and 6 predictors), and regression model (LAD and LSD) combination in Tables 7, 8, 9, 10, and 11, which correspond to the five contamination levels of Tables 1, 2, 3, 4, 5 , respectively. In particular,
Mielke et al. (1996) investigated the effects of sample size, type of regression model, and noise-to-signal ratio on the degree of shrinkage in five populations containing varying amounts and degrees of data contamination. Shrinkage was measured as the ratio of the validation fit of a sample-based regression model to the retrospective fit of the same regression model where the validation fit was assessed on five independent samples from the same population. While the Mielke et al. (1996) index of shrinkage is both rigorous and comprehensive, it involves an additional five independent samples and thus is not useful in routine applications. In this paper a drop-one single-sample estimator of shrinkage is developed and evaluated on the same dataset used by Mielke et al. (1996) . The drop-one single-sample estimator provides an accurate estimate of shrinkage for the five populations, both regression models, both cases, and all sample sizes, although the estimator is slightly conservative for very small sample sizes.
Finally, a caution is raised because the drop-one single-sample estimate of shrinkage is, in fact, an estimate of shrinkage. There is evidence that the drop-one method provides inflated estimates of validation fit for the LSD regression model when the population data is contaminated by extreme values, e.g., populations 1–4 in Tables1, 2, 3, 4 . In population 5 (Table 5) with 5% severe contamination, both the LSD and LAD regression models provide estimates of validation fit that are too high.
Acknowledgments. This study was supported by National Science
Badescu, V., 1993: Use of Willmott’s index of agreement to the validation of meteorological models. Meteor. Mag.,122, 282–286.
Barnston, A. G., and H. M. Van den Dool, 1993: A degeneracy in cross-validated skill in regression-based forecasts. J. Climate, 6, 963–977.
Browne, M. W., 1970: A critical evaluation of some reduced-rank regression procedures. Research Bulletin 70-21, Educational Testing Service, Princeton, NJ.
——, 1975a: Predictive validity of a linear regression equation. Br. J. Math. Statist. Psychol.,28, 79–87.
——, 1975b: A comparison of single sample and cross-validation methods for estimating the mean squared error of prediction in multiple linear regression. Br. J. Math. Statist. Psychol., 28, 112–120.
——, and R. Cudeck, 1989: Single sample cross-validation indices for covariance structures. Mult. Behav. Res., 24, 445–455.
——, and ——, 1992: Alternative ways of assessing model fit. Sociol. Meth. Res., 21, 230–258.
Camstra, A., and A. Boomsma, 1992: Cross-validation in regression and covariance structure analysis. Soc. Meth. Res.,21, 89–115.
Copas, J. B., 1983: Regression, prediction, and shrinkage. J. Roy. Statist. Soc., 45B, 311–354.
Cotton, W. R., G. Thompson, and P. W. Mielke, 1994: Real-time mesoscale prediction on workstations. Bull. Amer. Meteor. Soc., 75, 349–362.
Efron, B., 1983: Estimating the error rate of a prediction rule: Improvement on cross-validation. J. Amer. Statist. Assoc., 78, 316–331.
Elsner, J. B., and C. P. Schmertmann, 1993: Improving extended- range seasonal predictions of intense Atlantic hurricane activity. Wea. Forecasting, 8, 345–351.
——, and ——, 1994: Assessing forecast skill through cross-validation. Wea. Forecasting, 9, 619–624.
Geisser, S., 1975: The predictive sample reuse method with applications. J. Amer. Statist. Assoc., 70, 320–328.
Glick, N., 1978: Additive estimators for probabilities of correct classification. Pattern Recog.,10, 211–222.
Gray, W. M., C. W. Landsea, P. W. Mielke, and K. J. Berry, 1992: Predicting Atlantic seasonal hurricane activity 6–11 months in advance. Wea. Forecasting, 7, 440–455.
Hess, J. C., and J. B. Elsner, 1994: Extended-range hindcasts of tropical-origin Atlantic hurricane activity. Geophys. Res. Lett., 21, 365–368.
Hora, S. C., and J. B. Wilcox, 1982: Estimation of error rates in several-population discriminant analysis. J. Marketing Res., 19, 57–61.
Horst, P., 1966: Psychological Measurement and Prediction. Wadsworth, 455 pp.
Huberty, C. J., J. M. Wisenbaker, and J. C. Smith, 1987: Assessing predictive accuracy in discriminant analysis. Mult. Behav. Res., 22, 307–329.
Kelly, F. P., T. H. Vonder Haar, and P. W. Mielke, 1989: Imagery randomized block analysis (IRBA) applied to the verification of cloud edge detectors. J. Atmos. Oceanic Technol., 6, 671–679.
Lachenbruch, P. A., 1967: An almost unbiased method of obtaining confidence intervals for the probability of misclassification in discriminant analysis. Biometrics,23, 639–645.
——, and M. R. Mickey, 1968: Estimation of error rates in discriminant analysis. Technometrics,10, 1–11.
Lee, T. J., R. A. Pielke, and P. W. Mielke, 1995: Modeling the clear- sky surface energy budget during FIFE 1987. J. Geophys. Res., 100, 25585–25593.
Livezey, R. E., A. G. Barnston, and B. K. Neumeister, 1990: Mixed analog/persistence prediction of seasonal mean temperatures for the USA. Int. J. Climatol., 10, 329–340.
MacCallum, R. C., M. Roznowski, C. M. Mar, and J. V. Reith, 1994:Alternative strategies for cross-validation of covariance structure models. Mult. Behav. Res.,29, 1–32.
Maltz, M. D., 1994: Deviating from the mean: The declining significance of significance. J. Res. Crime Delinq., 31, 434–463.
McCabe, G. J., and D. R. Legates, 1992: General-circulation model simulations of winter and summer sea-level pressures over North America. Int. J. Climatol., 12, 815–827.
Michaelsen, J., 1987: Cross-validation in statistical climate forecast models. J. Climate Appl. Meteor., 26, 1589–1600.
Mielke, P. W., K. J. Berry, C. W. Landsea, and W. M. Gray, 1996: Artificial skill and validation in meteorological forecasting. Wea. Forecasting, 11, 153–169.
Mosier, C. I., 1951: Symposium: The need and means of cross-validation, I. Problems and designs of cross-validation. Educ. Psych. Meas., 11, 5–11.
Mosteller, F., and J. W. Tukey, 1977: Data Analysis and Regression. Addison-Wesley, 586 pp.
Murphy, A. H., and R. L. Winkler, 1984: Probability forecasting in meteorology. J. Amer. Statist. Assoc., 79, 489–500.
Nicholls, N., 1985: Predictability of interannual variations of Australian seasonal tropical cyclone activity. Mon. Wea. Rev., 113, 1144–1149.
Picard, R. R., and R. D. Cook, 1984: Cross-validation of regression models. J. Amer. Statist. Assoc., 79, 575–583.
——, and K. N. Berk, 1990: Data splitting. Amer. Statist., 44, 140–147.
Snee, R. D., 1977: Validation of regression models: Methods and examples. Technometrics, 19, 415–428.
Stone, M., 1974: Cross-validatory choice and assessment of statistical predictions. J. Roy. Statist. Soc.,36B, 111–147.
——, 1978: Cross-validation: A review. Math. Operationsforsch. Statist., Ser. Statistics, 9, 127–139.
Subrahmanyam, M., 1972: A property of simple least squares estimates. Sankhya, 34B, 355–356.
Toussaint, G. T., 1974: Bibliography on estimation of missclassification. IEEE Trans. Inf. Theory,20, 472–479.
Tucker, D. F., P. W. Mielke, and E. R. Reiter, 1989: The verification of numerical models with multivariate randomized block permutation procedures. Meteor. Atmos. Phys., 40, 181–188.
Watterson, I. G., 1996: Nondimensional measures of climate model performance. Int. J. Climatol.,16, 379–391.
Willmott, C. J., 1982: Some comments on the evaluation of model performance. Bull. Amer. Meteor. Soc., 63, 1309–1313.
——, S. G. Ackleson, R. E. Davis, J. J. Feddema, K. M. Klink, D. R. Legates, J. O’Donnell, and C. M. Rowe, 1985: Statistics for the evaluation and comparison of models. J. Geophys. Res., 90, 8995–9005.
affiliation: NOAA/AOML/Hurricane Research Division, Miami, Florida.
Corresponding author address: Dr. Paul W. Mielke Jr., Department of Statistics, Colorado State University, Fort Collins, CO 80523-1877.