BIOSTATISTICS
SPRING 2001 ASSIGNMENTS


 Assignment 1

 Assignment 7

 Assignment 2

 Assignment 8

 Assignment 3

 Assignment 9

 Assignment 4

 Assignment 10

 Assignment 5

 Assignment 11

 Assignment 6

 Assignment 12

Assignment 1 - Getting organized
Due date: February 2, 2001

Introduction
Since this course depends on your datasets, your first assignment is to get your data together in a usable fashion. Even if your dataset isn't complete (if, for example, your senior project is still in progress), you should at least have some data by the end of January. For this first class, therefore, you are expected to bring organized data, with appropriate metadata with you to class, and be prepared to describe it (in a 10-minute or so presentation). Note that such a description requires an hypothesis.

Some definitions
Hypothesis - what's your question? You can't organize your data for exploration or analysis if you don't have a question that you want your data to answer, or a hypothesis that you're interested in testing with your data. This is not a course in data dredging.
Organized data - your data should be organized in a spreadsheet. Since we'll be using IBM computers throughout the semester, please enter your data into an IBM-compatible spreadsheet (such as Excel, Lotus, or QuattroPro). The lab computers all have Excel 97 loaded on them. As a rule, cases (individual observations) are each entered as a single row, while variables that you measured for each case are entered into each column. Be sure to label your columns. If you need additional descriptive materials to help you organize your data, please read the extra on-line notes on Data Management or read the first few chapters (pp. 8-37) of Sokal and Rohlf (1995).
Metadata - a description of your data structure. What's the filename? Who entered the data? How were the data checked for accuracy during data entry? What do the rows and columns signify? What do variable names (usually 8 or fewer characters) actually mean? In what format are your data entered (e.g., integers, real numbers, alphanumeric characters)? When were your data collected? How? What is their precision and accuracy (be sure that you know the difference). Metadata includes any additional information that another user might find helpful if she were to access your data from a databank and then try to re-analyze them.

What you turn in (bring it to class)
1. A written statement of your hypothesis/question
, with sufficient background information that someone else, who knows little about your system, could understand your question. This statement should not exceed one word-processed page (1" margins, 12 point type).

2. A print-out of your dataset, with all rows and columns labeled. If the data exceed one printed page (normal type size), make sure that the print-out carries over row and column labels onto subsequent pages. Excel can do this, as can most other spread-sheet packages.

3. You should bring a written description of your dataset (i.e., the metadata). The metadata should be as long as necessary (1" margins, 12 point type).

4. A copy of your data and metadata on a high-density (1.4 Mb) PC-formatted, virus-free, 3½" diskette (MS Excel and MS Word, respectively)

References
Sokal, R. R., and F. J. Rohlf. 1995. Biometry. W. H. Freeman and Company, New York, New York, USA (on reserve in the library)



Assignment 2 - Exploratory Data Analysis
Due date: February 9, 2001

Introduction
Data exploration (aka exploratory data analysis, or EDA) and display is a fundamental process of data analysis. EDA is used to summarize your data, to visualize patterns in your data, and to refine your hypotheses, while data display presents those patterns to others. While EDA is often done 'on the fly', and with low-resolution graphics or print-outs, data display is 'presentation-quality' graphics, analogous to what you'd read in a standard scientific presentation, and is what should end up in your thesis or independent project. For this week, you will do some background reading in EDA and data display, and you will begin to summarize and illustrate your data.

Perhaps the most common and convenient way to summarize data is to report measures of location, spread (error), and confidence. Examples of measures of location are the mean, trimmed mean, median, and mode; examples of measures of spread are the standard deviation, standard error, variance, percentiles, range, and coefficient of variation; and confidence is usually expressed as a k% confidence interval or k% prediction interval. If your data fall into obvious groups (treatments), then summaries are usually reported for each group.

Summary statistics can be reported in tabular form ('Tables') or graphic form ('Figures'). While tables are more precise, figures are usually more compelling.

Required reading (read before coming to class)
Selvin, S. 1998. Modern applied biostatistical methods using S-Plus, chapter 2. Oxford University Press, New York, USA.
Ellison, A. M. 1993. Exploratory data analysis and graphic display. Pages 14-45 in S. M. Scheiner & J. Gurevitch, editors. Design and analysis of ecological experiments. Chapman and Hall, New York, USA.

Recommended reading
Cleveland, W. S. 1993. Visualizing data. Hobart Press, Summit, New Jersey, USA.
du Toit, S. H. C., A. G. W. Steyn, and R. H. Stumpf. 1986. Graphical exploratory data analysis. Springer-Verlag, New York, New York, USA.
Sokal, R. R., and F. J. Rohlf. 1995. Pages 39-60 in Biometry. W. H. Freeman and Company, New York, New York, USA.
Tufte, E. R. 1983. The visual display of quantitative information. Graphics Press, Cheshire, Connecticut, USA
Tukey, J. W. 1977. Exploratory data analysis. Addison-Wesley, Reading, Pennsylvania, USA.

Additional material on EDA is available within the on-line class notes.

Assignment (bring it to class)
1. Begin to explore and illustrate your data. You should prepare as many graphs as you think appropriate to illustrate your hypotheses, in rough (EDA) form. All of the graphic types and elements that you need are available in S-Plus (DO NOT USE EXCEL FOR GRAPHICS!). The graphics palette in S-Plus was designed by William Cleveland (who's written several books on EDA), and contains all the possibilities shown in Tufte's (1983) book on EDA and my chapter on EDA.

2. Compute summary statistics for your variables of interest. If appropriate, compute them 'by' categories of interest. Use S-Plus (Statistics --> Data Summaries --> Summary Statistics) for computation. Write a one-page summary of your summary statistics, that shows that you understand the meaning and differences between different measures of location, spread, and confidence.

3. Plot your summary statistics in (a) way(s) that enables rapid comparison between or among groups of interest. Produce at least three different types of plots illustrating your summary statistics (example: box plots, bar charts, and category plots). Remember: pie charts are not allowed! Write a one or two paragraph description of your plots, in standard scientific style, drawing the reader's attention to the results that you think are most relevant.

4. Using the confidence intervals that you computed in part A, discuss, in 1-2 paragraphs the apparent similarities or differences among your different treatment groups.

During class time, you will each present your EDA graphics and data summaries for critique. Be prepared to explain why you chose the graphic types that you did, and why they illustrate your hypotheses. Be prepared to help each other improve the clarity of your graphics. About a third of the class time will be devoted to presentation and critique, while the remainder will be allotted to improving your graphics.



Assignment 3 - Probability, data distributions, hypothesis testing, and statistical power
Due date: February 16, 2001

Part I. Probability distributions and hypothesis testing

Introduction
One of the first steps in analyzing data is assessing the underlying distribution(s) of the variable(s) of interest. For example, coin flips can yield two possible outcomes: 'heads' or 'tails'; a long run of coin flips (of fair coins) gives rise to a binomial distribution of data. There are many other distributions that underlay common phenomena: the Poisson distribution and the Gaussian (or 'normal') distribution are two of the more common. Many basic statistical calculations, as well as most statistical tests used by biologists are based on the assumption that the sampled population (not the sample itself) has a known probability distribution (usually normal) that can be parameterized (hence the use of the term 'parametric' statistics). This week, you will explore the distribution(s) of your variable(s).

We will also use your data distribution to introduce you to (or re-aquaint you with) hypothesis testing. Most of you are (or should be) familiar with standard hypothesis testing and P-values; these give rise to the oft-asserted (and routinely mis-used) 'significance' of your data. You will use the distributions, and measures of location and spread that you developed in Assignment 2 to test formal hypotheses about the shape of your data distributions. The type of statistical test that we will use for this assignment is a goodness-of-fit test.

The familiar Chi-square test is an example of a goodness-of-fit test. Selvin covers a few goodness-of-fit tests in this week's reading (see especially pp. 145-155, but each example has an associated goodness-of-fit test). Sokal and Rohlf (1995) also cover the standard Chi-square test for enumerative (count data) and its extensions to multinomial or continuous data (such as testing whether or not your data fit a Gaussian distribution).

S-Plus provides two tests determining goodness-of-fit: the Chi-square and the Kolmogorov-Smirnov tests (Statistics --> Compare Samples --> One Sample --> Kolmogorov-Smirnov GOF or ...-> Chi-square GOF).

Required reading
Selvin, S. 1998. Modern applied biostatistical methods using S-Plus, chapter 3. Oxford University Press, New York, USA.

Recommended reading
Sokal, R. R. & F. J. Rohlf. 1995. Pages 61-175 in Biometry. W. H. Freeman and Company, New York, New York, USA.

Additional material on probability distributions and goodness-of-fit tests are available within the on-line class notes.

Assignment (bring it to class)
1.
Plot your variables in a way that you can visualize their distribution. Useful graphs for visualizing data distributions include: histograms, box-plots, dot-plots, and stem-and-leaf plots. S-Plus will do all of these (Graph --> 2D Plots), although stem-and-leaf plots can only be done from the command prompt (use the command: stem(variable)).

2. Describe (in 1-2 paragraphs) what distribution(s) ought to be the best fit(s) for your variable(s). Explain why you expect these distributions to be the appropriate ones.

3. Use a one-sample goodness-of-fit test to determine if your data actually fit the distribution you predicted in (2.).

4. Generate a simulated dataset with the same number of observations as your raw data, whose values, for each simulated variable, come from a simulated distribution that you predicted in part 2. Use the S-Plus menu commands to do this (Data --> Random Numbers)

5. Plot your simulated variables, and visually compare the two plots. Do you have a good match? If not, why not? Use a two-sample goodness of fit test to determine if the simulated data and your real data are statistically indistinguishable (use Statistics --> Compare Samples --> Two Samples --> Kolmogorov-Smirnov GOF)

6. If your data are not 'normally' distributed, can you transform them so that they fit a normal distribution? An example is the logarithmic transformation: new variable = ln(old variable) for data that are left-skewed. See the S-Plus help file under Data --> Transform to learn how to do this quickly.

7. Write a one-page exposition describing this first attempt at hypothesis testing. Use accurate language in describing your null and alternative hypotheses, and state your conclusions in appropriate statistical terminology. Refer to your figures when appropriate.

Part II - Statistical Power

Introduction
After doing part I, you should have a good sense of hypothesis testing, the meaning of a P-value, and how to determine if your results are statistically significant. P-values are based on the acceptable probability of committing a Type I error (a) (rejecting the null hypothesis when in fact it is true), which is fixed prior to the start of an experiment (traditionally, a = 0.05). This acceptable probability determines the 'significance' of your results, and what you report in your scientific paper is the probability of your data given the null hypothesis is less than the a-level: P(data|H0) < a. The converse, Type II error (b) is rarely discussed. More importantly, what you may be most interested in is the probability of rejecting your null hypothesis when it is, in fact false, and should be rejected. This quantity, referred to as the power of your statistical test, equals 1-b. Interestingly (but I hope not surprisingly), statistical power does not simply equal 1-a, the obtained P-value. Rather, it depends in a rather complex way on sample size, effect size, and your pre-determined a-level. In this light, power analysis can be used to address several important questions:

1. What sample size is needed in order to detect a difference (i.e. to see an effect) of a particular size, given predetermined values for a and b? This question should be asked before you set up your experiment!

2. What sample size would have been needed in order to detect a difference (i.e. to see an effect) equal to that observed in your study given predetermined values for a and b? This question is usually asked after you finished your experiment.

3. What is the smallest difference (effect size) that you could have detected, given your sample size, and defined values for a and b? Also a post-hoc question.

4. Given your actual sample size, a-level, and effect size, what was the power of your experiment? Asked post-hoc, often with discouraging answers.

Required reading
Ottenbacher, K. J. 1996. The power of replications and the replications of power. The American Statistician 50: 271-275.
Peterman, R. M. 1990. The importance of reporting statistical power: the forest decline and acidic deposition example. Ecology 71: 2024-2027.

Assignment (bring it to class)

1. Determine the statistical power of one of the goodness-of-fit tests you conducted in part I. You can either do this by hand, or use S-Plus (Statitsics --> Power and Sample Size).

2. Write 1-2 paragraphs that illustrate that you understand the difference between statistical significance and statistical power, and their relationships to your data distributions. Given these differences, and in light of your own data, would you like to change the a-level that you've come to associate with statistical significance, at least for a while?



Assignment 4 - Correlation and Regression
Due date: February 23, 2001

Introduction
Probably the most common statistical procedures used by biologists are correlation and linear regression (for data measured on a continuous scale), and analysis of variance (ANOVA) for comparing mean responses among more than two treatment groups (for two treatment groups, use the familiar t-test or its non-parametric equivalent). Regression and ANOVA are usually discussed together, because Fisher demonstrated that all degrees of freedom and sums of squares (i.e., deviations from overall or within-group means) in an ANOVA problem are reducible to single-degree-of-freedom contrasts analyzable by regression. This week and next week, we will focus on correlation and regression; comparing among categorical groups and ANOVA will be covered in the subsequent two classes.

For both regression and ANOVA, you must specify the independent variable (continuous in regression, discrete [categorical] in ANOVA). The independent variables are assumed to be measured without error. For correlation, the assumption is that both variables were measured with error, and that there is no obvious 'independent' variable. Before setting out to do one of these statistical procedures, make sure that the procedure is appropriate for the question being asked, the data are structured appropriately, and the data conform to the necessary assumptions.

Required reading
Selvin, S. 1998. Modern applied biostatistical methods using S-Plus, pp. 195-226. Oxford University Press, New York, USA.

Assignment (bring it to class)
1.
Calculate pair-wise correlations (Statistics --> Data Summaries --> Correlations) and simple linear regression statistics (Statistics --> Regression --> Linear) for any pair of variables. If your data are not amenable to correlation or regression analysis, use another class member's data (especially useful datasets include: Tina's dataset on butterfly morphology; and Rebecca's dataset on pitcher-plant morphology). Be sure to consult with the data owner, and share your results!

a) choose a pair of variables and compute the correlation or regression statistics
b) check that the data meet the assumptions of correlation/regression. Illustrate that you have, in fact, checked the data.
c) if the data do not meet the assumptions, transform them appropriately, and do (a) again.
d) if you can't find an appropriate transformation, compute the Spearman's correlation coefficient based on ranks (in S-Plus, first transform the data using the rank function, then compute the correlation coefficient as in part (a)).

2. Determine if your regression or correlation statistics are 'significant'.

3. Write up what you've done. Be sure that you not only present your results and show that you've checked assumptions, but also that you interpret the meaning of all output parameters. One or two pages ought to be sufficient.



Assignment 5 - General linear models
Due date: March 2, 2001

Introduction
Linear regression and correlation are used when both the independent and dependent variables are continuous, and the distribution of error terms ("residuals") is Gaussian (or "normal"). Linear regression is a special case of a family of regression techniques, which together are called general linear models (or GLMs). "General", because different types of distributions of error terms can be specified, or because different types of dependent variables can be analyzed; "linear", because the parameters of the model are linear; and "models" because these are estimates of reality.

The text describes two additional types of GLMs: the linear-logistic model, and the Poisson model. In the first (linear-logistic model), the independent variable is continuous, but the dependent variable is binary. Example class datasets for the linear-logistic include Heather's scallop data, in which the response variable is "attached" (1) or "unattached" (0), and the independent variable to use would be scallop size; and Bart's EEG data, in which the response variable is EEG waveform "present" (1) or "absent" (0), and the independent variable is amount of visual noise masking the face. In the second (Poisson model), the independent variable is similarly continuous, but the dependent variable is a count. The example class dataset for the Possion model is Calley's spider data, in which the response variable is "number of families present at a site" (ranging from 0 to k) and the independent variable is "latitude".

In terms of implementation in S-Plus, general linear models work just like ordinary linear models (Assignment 4), except that you must specify the "link function" or distribution for the data ("binomial" for the linear-logistic; "poisson" for the Poisson). The interpretation of the results is similar, and is described in detail in the text.

Required reading
Selvin, S. 1998. Modern applied biostatistical methods using S-Plus, pp. 227-265. Oxford University Press, New York, USA.

Assignment (bring it to class)
1. Carry out either a linear-logistic regression or a Poisson regression. For this assignment, please use one of the three appropriate class datasets (Heather's scallops, Bart's EEGs, or Calley's spiders).

2. Determine if the regression is "significant". If you can, use multiple independent variables, and compare residual deviances, as described in the text.

3. Write up what you've done. Be sure that you not only present your results and show that you've checked assumptions, but also that you interpret the meaning of all output parameters. One or two pages ought to be sufficient.



Assignment 6 - Analysis of tabular data: beyond the chi-square
Due date: March 9, 2001

Introduction
Tabular, or categorical data are very common in the life sciences. Examples include gene frequencies (and testing for Hardy-Weinberg equilibrium), information on the presence or absence of species, responses to drug treatments, etc. The basic data structure for tabular data is a table. Here is an example of a 2 x 2 table:

 

Treatment
   Saline  Adjuvant
Aborted 5 15
Successful birth 2 18

Each entry indicates the number of rats treated with either saline solution or adjuvant, and that either aborted their pregnancies (5 in saline, 15 in adjuvant) or successfully gave birth to a litter of rats (2 in saline, 18 in adjuvant). The null hypothesis to be tested is that the treatment had no effect on the outcome of the pregnancy. A chi-square test can be used to test this hypothesis. Here, the result is that the test statistic, X2 = 0.69, with a P-value of 0.41 with 1 degree of freedom. S-Plus returns a warning that the results may not be appropriate because of small sample size. Fisher's exact test is more appropriate for small samples (see Selvin, p. 324-325), but only applies to 2 x 2 tables (the chi-square test applies to r x c tables, for all r and c). Here, Fisher's exact test gives a similar result (P = 0.41). Be sure you understand how the two tests work, and how they differ.

S-Plus implements a number of statistical tests for tabular data. For 2 x 2 tables, Fisher's exact test is the best choice. Other appropriate tests include the chi-square test or the test for equality of proportions (prop.test()) . You should choose the proportions test over the chi-square test if the expected value of any given cell is < 5. A series of 2 x 2 tables, such as would result from applying the saline and adjuvant treatment to a number of different rat strains, can be compared using the Mantel-Haenszel test, mantelhaen.test(). This method tests the hypothesis that the observed variability in a series of 2 x 2 tables arises from random variation. If the number of observations per cell is large, the results of the Mantel-Haenszel test should be very close to that achieved with logistic regression (convince yourself that this should be the case).

For 2 x c tables, a chi-square test can be used, as can Kendall's test for association. For the latter, you'd need to use the "home-made" function described by Selvin (1998) on p. 340. To implement alternatives to a chi-square for larger (r x c) tables you need to write your own small S-Plus functions, as described on pp. 334-340 in Selvin (1998). If the table can be "ordered" (as with, for example, Calley's spider data), you can use log-linear (Poisson) regression, as in Assignment 5.

Required reading
Selvin, S. 1998. Modern applied biostatistical methods using S-Plus, chapter 6. Oxford University Press, New York, USA.

Assignment (bring it to class)

1. Conduct a thorough analysis of a tabular dataset. For this week's assignment, you could use Heather's scallop data, which has two types of tables:

2 x 2 table:
 

Attachment data
 Substrate type Number attached Number unattached
Simple

a11

a12
Complex

a21

a22

2 x c table:
 

Number of fibers produced by attached scallops
 Substrate type 1 2 3 4 5 6
Simple a11 a12 a13 a14 a15 a16
Complex a21 a22 a23 a24 a25 a26

Or, you could use Jennifer's rat data, which would be arranged as a 2 x c table:
  Number of corpora lutea in the ovaries
 Treatment  1 2 3 ...
Saline  a11  a12  a13  ...
Adjuvant  a21  a22  a23  ...

Another possibility is to use Gaytri's NEE data. She has values for respiration and photosynthesis done in six types (categories) of sites (bog hummocks, bog hollows, beaver ponds 1-4, beaver ponds 6-8, poor fens, and control plots. The response variables are continuous (and so are more appropriately analyzed with ANOVA, see Assignment 8), but if the response variables can be grouped into categories (talk to Gaytri about that), then these data could be analyzed using tabular methods. Similarly, you could analyze Laurel's data on water chemistry of three different streams, with and without beaver.

2. State the hypothesis to be tested.

3. Be sure to use at least one method other than a chi-square test. Extra credit will be given if you write your own S-Plus function, and include the S-Plus code in your assigment.

4. Write it up in two pages or less!



Assignment 7 - Mid-term recap
Due date: March 16, 2001

I will critique, but not grade, this assignment. You can use the critique to refine and improve your final project.

Assignments turned in late will not be critiqued!

Introduction
This week's assignment is an opportunity to catch your breath, catch up, and get a head-start on your final paper.

Assignment
First, re-read the description of the final project. By this point in the semester, you should be able to write up the introduction, and possibly the methods and results for at least one type of statistical analysis.

Your introduction must include:

Your methods must include:

Your results must include:

Please limit this first part of your final paper to 10 pages.

Papers should be e-mailed to me as attachments by 11:59pm on March 16.

I will critique, but not grade, this assignment. You can use the critique to refine and improve your final project.

Assignments turned in late will not be critiqued!



Assignment 8 - ANOVA
Due date: March 30, 2001

Introduction
See Assignment 4.

Required reading
Selvin, S. 1998. Modern applied biostatistical methods using S-Plus, pp. 361-390. Oxford University Press, New York, USA.

Assignment (bring it to class)
1. Compute one-way and two-way ANOVAs for sets of variables measured over a more than two levels of the given factor(s). Appropriate data for this assignment include Laurel's water chemistry data, Gaytri's photosynthesis/respiration data, Samantha's N-addition data, Jennifer's rat data (also useful for t-tests), and Rebecca's morphology data.

a) Choose a set of variables and independent factors and compute the ANOVA statistics. For one-way ANOVA, use Statistics --> Compare samples --> k Samples --> One-way ANOVA, while for two-way ANOVA, use Statistics --> Analysis of Variance (it's up to you to determine whether to use the Fixed Effects or Random Effects option, but be sure to explain your choice!).
b) Compute appropriate post-hoc multiple comparisons to determine which groups differ from each other. Use Statistics --> Multiple Comparisons to accomplish this, once you've done (a).
c) check that the data meet the assumptions of ANOVA. Illustrate that you have, in fact, checked the data. S-Plus has many tools to help you do this.
d) if the data do not meet the assumptions, transform them appropriately, and do (a) again.

2. Determine if your ANOVA statistics are 'significant'.

3. Write up what you've done. Be sure that you not only present your results and show that you've checked assumptions, but also that you interpret the meaning of all output parameters. One or two pages ought to be sufficient.

 



Assignment 9 - Multivariate explorations
Due date: April 13, 2001

Introduction
Up until now, we have been dealing primarily with two-dimensional data (one or more independent variables, one dependent variable). Yet many datasets have multiple dependent variables. Multivariate techniques can be used to see if there are relationships between the many dependent variables and the independent variable(s) of interest.

While there are many different multivariate techniques, here we will explore two: principal components analysis (PCA), and cluster analysis. PCA creates 'composite' variables that are linear combinations of variables. In other words, it allows you to determine how several variables 'hang together'. The goal of a PCA is to summarize a multivariate dataset using a few components (usually 2-3), and to determine how much variation in the variables can be 'explained' by those components.

Cluster analysis, on the other hand, forms groupings of variables based on their 'distance' from each other in multivariate space. In other words, imagine plotting a 10-dimensional graph, where 9 of the dimensions were values of your dependent variables and the 10th was the value for your independent variable. You could compute the distance between each point, and then hierarchically group observations by their distance from each other in 10-dimensional space. Observations that are closer together would cluster together, while those that were farther apart would not cluster together.

Assumptions of PCA and cluster analysis are similar to those of regression: independent observations, approximately normal distributions of variables (technically, multivariate normal, but we won't worry about that here), uncorrelated residuals, etc.

S-Plus has good routines, both analytical and graphical, for PCA and cluster analysis (under Statistics --> Multivariate and Statistics --> Cluster Analysis).

Required reading

Selvin, S. 1998. Modern applied biostatistical methods using S-Plus, pp. 391-406. Oxford University Press, New York, USA.

Recommended reading
Manly, B. F. J. 1986. Multivariate statistical methods: a primer (especially pages 1-71; 100-113). Chapman & Hall, London.

Assignment (bring it to class)

1. Conduct a PCA and a cluster analysis on some set of multivariate data. The appropriate class data sets are Rebecca's morphology of pitcher-plants and Tina's morphology of butterflies. Please use one of these two datasets for your analysis, and share the results with Rebecca or Tina.

2. Write it up in a page or so; please include graphical output!



Assignment 10 - Time-series analysis and survivorship curves
Due date: April 20, 2001

Introduction
Many biological data violate the independence assumption of regression, ANOVA, and related statistical tests. Not surprisingly, therefore, there are a number of different options for analyzing datasets in which observations are not independent. The situations encountered most commonly are those where the observations show temporal or spatial autocorrelation. Data collected over a period of time on the same individuals, or at the same location, often show correlations between observations based simply on temporal proximity of observations. Examples of such data include patterns of survivorship of individuals within a defined population, data recording annual variability confounded by seasonal periodicities, etc. The same applies for observations collected in a small area: for example, individual plants growing close to each other are more likely to be similar in size than individuals growing far away. This week we will focus on data that are correlated in time, but the same techniques can be used for spatially autocorrelated data.

S-Plus has excellent analytical routines for time-series analysis and survivorship analysis (Statistics --> Time Series and Statistics --> Surivival, respectively). There is also a time-series plot option under Graph --> 2D Plots.

Required reading
Selvin doesn't cover time-series analysis, but his chapter 8 does cover the related survival analysis. For a detailed explanation of time-series analysis, please read (on reserve in the library):
1. von Ende, C. N. 1993. Repeated-measures analysis: growth and other time-dependent measures. Pages 113-137 in S. M. Scheiner and J. Gurvetich, editors. Design and analysis of ecological experiments. Chapman & Hall, New York.
2. Rasmussen, P. W., D. M. Heisey, E. V. Hordheim, and T. M. Frost. 1993. Time-series intervention analysis: unreplicated large-scale experiments. Pages 138-158 in S. M. Scheiner and J. Gurvetich, editors. Design and analysis of ecological experiments. Chapman & Hall, New York.

Assignment (bring it to class)

1. Explore temporal autocorrelation in a dataset, and attempt to determine if there are significant trends in the data. For datasets with multiple treatments, you should try to de-trend the time-series data before testing for differences among treatments. The appropriate datasets here are Samantha's data on stream chemistry and any of the photosynthesis/respiration datasets (of Emily, Gaytri, and Liz)

2. Examine a dataset appropriate for using survival/failure-time analysis. Jen's dataset on rats could be used for survivorship analysis. See if you can do this. Make sure that the dataset has at least two different treatment groups so that you can learn how to compare between them. S-Plus will illustrate these differences graphically, and provides a statistical test for differences among treatment groups.

3. Write it up; one page each for the time-series analysis and the surivorship analysis.



Assignment 11 - Statistical Estimation
Due date: May 4, 2001

Happy spring.

There is no assignment due May 4, but please try do to the reading (Selvin, chapter 5)

Good luck with finishing your theses



Assignment 12 - Final Paper
Due date: May 10, 2001

Your final paper is due by 11:59pm on May 10, 2001.



 Bio Dept.

 Math Dept.

 MHC Home

 Aaron Ellison

 Other Stat. Sites

 Biostatistics Home Page