Descriptive analyses were carried out initially to identify general trends and associations in both the health outcome and exposure data. Temporal trends in the health outcome data were investigated and a profile of each data source was created to determine the proportion of cases within each age group and for each water service area. The distributions of primary ICD-9 codes captured in the case definition were also determined. Preliminary statistics were determined for the water quality and environmental parameters. Comparisons were made between the two water treatment plants to determine if general differences in water quality existed.
The primary objective of this portion of the analysis was to compare the risk of gastroenteritis for residents within each of the water service areas in Edmonton. Results from the descriptive analyses demonstrated that some differences in water quality existed between the two water treatment plants. All individuals with a valid Edmonton postal code who could be linked to the Rossdale or E.L. Smith water service areas, as well as individuals residing in the mixed zone, were included in this portion of the analysis. Infants (< 2 years) were excluded from this investigation since it was hypothesised that their intake of tap water in the home was likely minimised due to neonatal feeding practices, and thus not compatible with the remaining study population with respect to exposure patterns.
A case-control study design was used for this analysis. The modeled binomial response was case status: gastroenteritis case or respiratory control. Respiratory controls were used to help control for the potential influences of environmental factors on health outcome. The effect of modeled variables on health outcome was given by odds ratios. Each health outcome data set for the multivariate logistic regression portion of the analysis contained binary health outcomes, and each data set was modeled independently. The components of these analyses are summarised in Table 4.
A generalised linear model (GLM) with a binomial distribution and logit link, was the underlying statistical model used to analyse these data (McCullagh and Nelder, 1983). The resulting multivariate logistic modeling process was carried out using the GENMOD procedure in SAS, version 8.0. Separate models were evaluated for each of the four health outcome data sources.
Due to the enormity of the data originating from the physician-office billing data source, a random subset of the data was taken to facilitate analysis. A 1:3 ratio of cases to controls was formed by randomly selecting 20% of original cases. This randomised selection process was carried out using a random number function (RANUNI) in SAS, version 8.0.
Spatial regression models were subsequently carried out to provide a visual representation of the geographic differences in risk of gastroenteritis within the city of Edmonton. A binomial generalised additive model (GAM) was utilised to analyse the same health outcome data (Hastie and Tibshirani, 1990 ). GAM models are an extension of GLM models, in that a flexible additive non-linear relationship may be modeled between the independent predictor (risk factor) and the response. The response is modeled as a sum of smooth functions in the predictors, where these functions are estimated using smoothers (a curve is fit to the data points locally, so that at any point, the curve depends only on the observations at that point, and some specified neighbouring points - this estimate of the response is referred to as a smooth, and procedures for producing such fits are called smoothers). Model fit is improved through the flexible parameterisation of variables in GAM models, and parameter estimates may be derived with greater accuracy.
GAM models were used for this spatial component of the analysis to accommodate for the inclusion of geographical co-ordinates in the modeling process. The location of cases and controls was determined by using their corresponding 6-digit residential postal code. Longitude and latitude coordinates assigned to the centroid point within each postal code area were incorporated into the loess smoothing function within the GAM model (Cleveland and Devlin, 1988). This method of fitting the location effect directly into the statistical model as a non-linear parameter has also been applied in other studies where the objective has been to detect spatial clusters (Preisler et al, 1997; Brillinger, 1994; Chambers and Hastie, 1992; Cook and Pocock, 1983). The gam function in S-PLUS 2000, release 2 (MathSoft, Inc.) was used to carry out this spatial component of the multivariate logistic regression analysis, as smoothing functions are easily accommodated in this analytical software package.
Similar to the GLM analysis, a random subset consisting of 20% of the original cases captured in the physician-office data source wa s created. Due to the computational intensity associated with fitting a loess smoother to the geographical co-ordinates, controls were selected to satisfy a 1:1 ratio of cases to control s using the same procedure described in the previous section.
A list of variables investigated in both of the multivariate logistic regression processes is provided in Table 5.
To estimate the risk of gastroenteritis for residents within each of the water service areas, a categorical variable was created to reflect the primary water source in the generalised linear modeling. Levels of this variable, SOURCE were:
The risks associated with each of the water service areas were distinguished between when the Rossdale intake pipe was moved, which coincided with the introduction of particle counters in this plant (December 10, 1997). This distinction was made since raw water quality from the Rossdale plant changed following these events. Statistical comparisons in risk among SOURCE levels were carried out using contrasts. Since residential location is a reflection of the quality of water received at home, none of the water quality nor environmental parameters associated with the time series analysis were included in these analyses.
Since SAS does not easily accommodate loess smoothing functions in regression models, a separate spline was created for each 2-month period, so that different slopes (risk estimates) could be determined for each time interval, and thus control for seasonal trends in viral gastroenteritis. This technique is used when it can be expected that the effects vary across different levels of a variable. Comparisons between models using splines in SAS and the loess smoothing function in S-PLUS to adjust for seasonal trends, produced similar results (data not provided).
Household income was used as an indicator of socio-economic status, which was hypothesised to influence the risk of illness. The average household income reported in 1995 for each enumeration area (Statistics Canada, 1996) was linked to each postal code, and subsequently to each case and control. This variable was included in all models.
Age was fit as a categorical variable in these models to adjust for potential differences in risk between age groups. Thus, an averaged estimate of risk for residents of different water service areas was obtained. Categories for age were identical to the levels of stratification used in the time series analysis: 2 to 18 years, greater than 18 to 65 years, and greater than 65 years.
A loess smoother was fit to the longitude and latitude co-ordinates of the centroid points for each postal code. This independent (risk factor) variable was used to derive estimates of the risk of gastroenteritis for various locations within Edmonton. A categorical term, DEC1097 was also modeled to represent the risk prior to and after the relocation of the Rossdale intake pipe. The inclusion of these variables in the spatial model replaced the categorical term SOURCE used in the GLM modeling process.
Since S-PLUS easily accommodates smoothing functions, seasonal fluctuations in gastroenteritis were also modeled using the loess smoothing function. Average household income and age were also included in these spatial models.
For the GLM models, step-wise model selection based on the variables listed in Table 5 was conducted for each health outcome data set. The effect of each variable was evaluated by using the likelihood ratio test (Fahrmeir and Tutz, 1994). Criteria for inclusion in the final model was set at the 5% level of statistical significance. Model-fit was examined using Akaike's Information Criterion (Sakamoto et al, 1986). Variables in the spatial model (GAM) were chosen to reflect the variables identified in the final multivariate logistic regression model.
Similar to the Vancouver study, generalised additive models (GAM) were used to investigate temporal relationships between water quality and gastroenteritis. This approach has also been applied in other time series studies (Morris et al, 1998; Schwartz et al, 1997). The application of this methodology for time series investigations has been described in detail in the Vancouver study (Aramini et al, 2000). S-PLUS 2000, release 2 (MathSoft, Inc.) was used to carry out this portion of the analysis.
A loess smoothing function (Cleveland and Devlin, 1988) was used to describe the potentially non-linear relationship between the risk of gastroenteritis and turbidity, along with other water quality and environmental parameters described in sections 2.1 and 2.2. In addition, the smoothing function was also applied to fit a seasonal (long-term) parameter in an attempt to control for seasonal trends in gastroenteritis. A 220-day cycle, which was employed and described in the Vancouver study, was used to represent this seasonal trend.
To determ ine the influence of water quality from each plant on the corresponding serviced population, individuals from each of the four data sources were analysed separately, based on the primary water source. Individuals whose primary water source could not be uniquely identified (mixed zone - shaded grey area in Figure 2) were excluded from this analysis. By matching the date of health event to the date of the recorded water quality parameter, daily water quality values observed for each plant were then linked to the appropriate individuals.
As explained in the Vancouver study, both binomial (case-control) and poisson GAM models (Hastie and Tibshirani, 1990) were fit to the data. In the binomial models, the modeled outcome was case status: gastroenteritis case or respiratory control. The outcome for the poisson models was daily count of gastroenteritis cases. The effect of modeled variables on gastroenteritis was given by odds ratio and relative risks for the binomial and poisson models, respectively.
Since susceptibility to illness varies across different age groups, separate analyses were conducted for each of the following age groups: 2 to 18 years, greater than 18 to 65 years, and greater than 65 years. These age groups are identical to those used in the Vancouver study, with the exception that infants (< 2 years) were excluded from this investigation. Therefore, each health outcome data set used in this analysis was specific to each age group, water service area, and data source (refer to Table 6).
In the binomial analysis, a randomised subset of cases and controls were selected from the physician-office data, using a 1:3 ratio of cases to controls. This was the same subset described in the GLM modeling approach. Physician -office data were not reduced for the poisson analysis since gastrointestinal outcomes were collapsed into daily counts, and the resulting smaller data set was therefore easily processed by the software application.
A list of the independent (risk factor) variables assessed in the time series analysis is provided in Table 7. Time series variables that were lagged are denoted with subscript i. A description of some of the key variables is provided below.
Separate analyses were conducted for each time series health outcome data set. For the water quality and environmental parameters, values from 0 to 40 days prior to the day of the health outcome event were modeled. This range of lagged values was selected to reflect multiples of the incubation periods commonly reported for prevalent waterborne pathogens. Therefore, for each time series health outcome data set and combination of variables analysed, 41 models were evaluated (one for each lag day).
Finished water turbidity (TB i) was the primary variable of interest for this analysis. However, the effects of other water quality parameters were also evaluated. Five-minute readings that were provided for some of the water quality parameters were summarised as daily observed values, including the mean, median, and maximum reading for that day. Statistical comparisons among models with the various parameterisations of these data, indicated that the daily mean provided the best fit to the data. Therefore, all 5-minute readings were summarised as daily means. Rare occurrences of missing values, which were occasionally the result of off-line filters, were replaced with the mean of adjacent observed values.
Environmental parameters were also examined. In addition to the effect of weather on raw water turbidity, precipitation was also hypothesised to influence the level of exposure at home. It was hypothesised that people are more likely to stay indoors as a result of inclement weather, facilitating person-to-person spread of infectious gastroenteritis. Therefore, precipitation and temperature lagged up to 40 days prior to the day of the health outcome event were similarly investigated.
Temporal confounders were included in the models to adjust for temporal variations, including the seasonal parameter. Day-of-week effects were significant in the Vancouver study and were similarly included in this study. Statutory holidays were also considered to determine if changes in the accessibility of medical services, as well as holiday behaviours, had an impact on gastroenteritis. To that end, separate categories were created for the following holidays and holiday weekends:
In addition, the week following each holiday event was also assessed (with the exception of Remembrance Day).
Finally, an autoregressive term was included in all of the models to adjust for the correlation arising from daily observations made of each health outcome event. As discussed in the Vancouver study, even in the absence of a relationship with water quality, the number of hospital admissions, physician visits, emergency room visits, and treatments provided in a long-term care facility on any given day can be expected to be related to the number the day before. Several possible reasons for this include persons sharing a common food or water source, and variable pathogen incubation times. The autoregressive term was expressed as a ratio of the number of cases to the number of controls, and was fit as a linear term in the model.
Based on the descriptive analyses and on the multivariate logistic regression results, quantitative time series analysis was conducted for the water service area that represented the highest potential risk of endemic waterborne gastroenteritis (Rossdale, prior to December 10, 1997). Similar to the multivariate logistic regression, the effect of each variable listed in Table 7 was determined by conducting deviance comparisons using the likelihood-ratio test, and comparing AIC values and parameter estimates. Significant lags were identified by comparing the change in deviance in models with and without the aggedl variable, thereby evaluating the overall effect of that lagged variable. However, multiple statistical comparisons increase the probability of erroneously detecting a statistically significant association, commonly referred to as Type I error (Steel et al, 1997). Therefore, when evaluating the effect of parameters up to 40 days prior to the health outcome event, only lags that were significant over two to three consecutive days were further assessed for consistency among the different age groups and health outcome data sources.