BIOSTATISTICS
SPRING 2000 ASSIGNMENTS


 Assignment 1

 Assignment 7

 Assignment 2

 Assignment 8

 Assignment 3

 Assignment 9

 Assignment 4

 Assignment 10

 Assignment 5

 Assignment 11

 Assignment 6

 Assignment 12

Assignment 1 - Getting organized
Due date: January 28, 2000

Introduction
Since this course depends on your datasets, your first assignment is to get your data together in a usable fashion. Even if your dataset isn't complete (if, for example, your senior project is still in progress), you should at least have some data by the end of January. For this first class, therefore, you are expected to bring organized data, with appropriate metadata with you to class, and be prepared to describe it (in a 10-minute or so presentation). Note that such a description requires an hypothesis.

Some definitions
Hypothesis - what's your question? You can't organize your data for exploration or analysis if you don't have a question that you want your data to answer, or a hypothesis that you're interested in testing with your data. This is not a course in data dredging.
Organized data - your data should be organized in a spreadsheet. Since we'll be using IBM computers throughout the semester, please enter your data into an IBM-compatible spreadsheet (such as Excel, Lotus, or QuattroPro). The lab computers all have Excel 97 loaded on them. As a rule, cases (individual observations) are each entered as a single row, while variables that you measured for each case are entered into each column. Be sure to label your columns. If you need additional descriptive materials to help you organize your data, please read the extra on-line notes on Data Management or read the first few chapters (pp. 8-37) of Sokal and Rohlf (1995).
Metadata - a description of your data structure. What's the filename? Who entered the data? How were the data checked for accuracy during data entry? What do the rows and columns signify? What do variable names (usually 8 or fewer characters) actually mean? In what format are your data entered (e.g., integers, real numbers, alphanumeric characters)? When were your data collected? How? What is their precision and accuracy (be sure that you know the difference). Metadata includes any additional information that another user might find helpful if she were to access your data from a databank and then try to re-analyze them.

What you turn in (bring it to class)
1. A written statement of your hypothesis/question
, with sufficient background information that someone else, who knows little about your system, could understand your question. This statement should not exceed one word-processed page (1" margins, 12 point type).

2. A print-out of your dataset, with all rows and columns labeled. If the data exceed one printed page (normal type size), make sure that the print-out carries over row and column labels onto subsequent pages. Excel can do this, as can most other spread-sheet packages.

3. You should bring a written description of your dataset (i.e., the metadata). The metadata should be as long as necessary (1" margins, 12 point type).

4. A copy of your data and metadata on a high-density (1.4 Mb) PC-formatted, virus-free, 3½" diskette (MS Excel and MS Word, respectively)

References
Sokal, R. R., and F. J. Rohlf. 1995. Biometry. W. H. Freeman and Company, New York, New York, USA (on reserve in the library)



Assignment 2 - Exploratory Data Analysis
Due date: February 4, 2000

Introduction
Data exploration (aka exploratory data analysis, or EDA) and display is a fundamental process of data analysis. EDA is used to summarize your data, to visualize patterns in your data, and to refine your hypotheses, while data display presents those patterns to others. While EDA is often done 'on the fly', and with low-resolution graphics or print-outs, data display is 'presentation-quality' graphics, analogous to what you'd read in a standard scientific presentation, and is what should end up in your thesis or independent project. For this week, you will do some background reading in EDA and data display, and you will begin to summarize and illustrate your data.

Perhaps the most common and convenient way to summarize data is to report measures of location, spread (error), and confidence. Examples of measures of location are the mean, trimmed mean, median, and mode; examples of measures of spread are the standard deviation, standard error, variance, percentiles, range, and coefficient of variation; and confidence is usually expressed as a k% confidence interval or k% prediction interval. If your data fall into obvious groups (treatments), then summaries are usually reported for each group.

Summary statistics can be reported in tabular form ('Tables') or graphic form ('Figures'). While tables are more precise, figures are usually more compelling.

Required reading (read before coming to class)
Ellison, A. M. 1993. Exploratory data analysis and graphic display. Pages 14-45 in S. M. Scheiner & J. Gurevitch, editors. Design and analysis of ecological experiments. Chapman and Hall, New York, USA.

Recommended reading
Cleveland, W. S. 1993. Visualizing data. Hobart Press, Summit, New Jersey, USA.
du Toit, S. H. C., A. G. W. Steyn, and R. H. Stumpf. 1986. Graphical exploratory data analysis. Springer-Verlag, New York, New York, USA.
Sokal, R. R., and F. J. Rohlf. 1995. Pages 39-60 in Biometry. W. H. Freeman and Company, New York, New York, USA.
Tufte, E. R. 1983. The visual display of quantitative information. Graphics Press, Cheshire, Connecticut, USA
Tukey, J. W. 1977. Exploratory data analysis. Addison-Wesley, Reading, Pennsylvania, USA.

Additional material on EDA is available within the on-line class notes.

Assignment (bring it to class)
1. Begin to explore and illustrate your data. You should prepare as many graphs as you think appropriate to illustrate your hypotheses, in rough (EDA) form. All of the graphic types and elements that you need are available in S-Plus (DO NOT USE EXCEL FOR GRAPHICS!). The graphics palette in S-Plus was designed by William Cleveland (who's written several books on EDA), and contains all the possibilities shown in Tufte's (1983) book on EDA and my chapter on EDA.

2. Compute summary statistics for your variables of interest. If appropriate, compute them 'by' categories of interest. Use S-Plus (Statistics --> Data Summaries --> Summary Statistics) for computation. Write a one-page summary of your summary statistics, that shows that you understand the meaning and differences between different measures of location, spread, and confidence.

3. Plot your summary statistics in (a) way(s) that enables rapid comparison between or among groups of interest. Produce at least three different types of plots illustrating your summary statistics (example: box plots, bar charts, and category plots). Remember: pie charts are not allowed! Write a one or two paragraph description of your plots, in standard scientific style, drawing the reader's attention to the results that you think are most relevant.

4. Using the confidence intervals that you computed in part A, discuss, in 1-2 paragraphs the apparent similarities or differences among your different treatment groups.

During class time, you will each present your EDA graphics and data summaries for critique. Be prepared to explain why you chose the graphic types that you did, and why they illustrate your hypotheses. Be prepared to help each other improve the clarity of your graphics. About a third of the class time will be devoted to presentation and critique, while the remainder will be allotted to improving your graphics.



Assignment 3 - Probability, data distributions, and hypothesis testing
Due date: February 11, 2000

Introduction
One of the first steps in analyzing data is assessing the underlying distribution(s) of the variable(s) of interest. For example, coin flips can yield two possible outcomes: 'heads' or 'tails'; a long run of coin flips (of fair coins) gives rise to a binomial distribution of data. There are many other distributions that underlay common phenomena: the Poisson distribution and the Gaussian (or 'normal') distribution are two of the more common. Many basic statistical calculations, as well as most statistical tests used by biologists are based on the assumption that the sampled population (not the sample itself) has a known probability distribution (usually normal) that can be parameterized (hence the use of the term 'parametric' statistics). This week, you will explore the distribution(s) of your variable(s).

We will also use your data distribution to introduce you to (or re-aquaint you with) hypothesis testing. Most of you are (or should be) familiar with standard hypothesis testing and P-values; these give rise to the oft-asserted (and routinely mis-used) 'significance' of your data. You will use the distributions, and measures of location and spread that you developed in Assignment 2 to test formal hypotheses about the shape of your data distributions. The type of statistical test that we will use for this assignment is a goodness-of-fit test.

The familiar Chi-square test is an example of a goodness-of-fit test. Lange et al. cover a couple of other goodness-of-fit tests in chapter 16, and we will return to them at the end of the semester. Sokal and Rohlf (1995) covers the standard Chi-square test for enumerative (count data) and its extensions to multinomial or continuous data (such as testing whether or not your data fit a Gaussian distribution).

S-Plus provides two tests determining goodness-of-fit: the Chi-square and the Kolmogorov-Smirnov tests (Statistics --> Compare Samples --> One Sample --> Kolmogorov-Smirnov GOF or ...-> Chi-square GOF).

Recommended reading
Sokal, R. R. & F. J. Rohlf. 1995. Pages 61-175 in Biometry. W. H. Freeman and Company, New York, New York, USA.

Additional material on probability distributions and goodness-of-fit tests are available within the on-line class notes.

Assignment (bring it to class)
1.
Plot your variables in a way that you can visualize their distribution. Useful graphs for visualizing data distributions include: histograms, box-plots, dot-plots, and stem-and-leaf plots. S-Plus will do all of these (Graph --> 2D Plots), although stem-and-leaf plots can only be done from the command prompt (use the command: stem(variable)).

2. Describe (in 1-2 paragraphs) what distribution(s) ought to be the best fit(s) for your variable(s). Explain why you expect these distributions to be the appropriate ones.

3. Use a one-sample goodness-of-fit test to determine if your data actually fit the distribution you predicted in (2.).

4. Generate a simulated dataset with the same number of observations as your raw data, whose values, for each simulated variable, come from a simulated distribution that you predicted in part 2. Use the S-Plus menu commands to do this (Data --> Random Numbers)

5. Plot your simulated variables, and visually compare the two plots. Do you have a good match? If not, why not? Use a two-sample goodness of fit test to determine if the simulated data and your real data are statistically indistinguishable (use Statistics --> Compare Samples --> Two Samples --> Kolmogorov-Smirnov GOF)

6. If your data are not 'normally' distributed, can you transform them so that they fit a normal distribution? An example is the logarithmic transformation: new variable = ln(old variable) for data that are left-skewed. See the S-Plus help file under Data --> Transform to learn how to do this quickly.

7. Write a one-page exposition describing this first attempt at hypothesis testing. Use accurate language in describing your null and alternative hypotheses, and state your conclusions in appropriate statistical terminology. Refer to your figures when appropriate.



Assignment 4 - Correlation and Regression
Due date: February 18, 2000

Introduction
Probably the most common statistical procedures used by biologists are correlation and linear regression (for data measured on a continuous scale), and analysis of variance (ANOVA) for comparing mean responses among more than two treatment groups (for two treatment groups, use the familiar t-test or its non-parametric equivalent). Regression and ANOVA are usually discussed together, because Fisher demonstrated that all degrees of freedom and sums of squares (i.e., deviations from overall or within-group means) in an ANOVA problem are reducible to single-degree-of-freedom contrasts analyzable by regression. This week, we will focus on correlation and regression, and next week we will focus on ANOVA.

For both regression and ANOVA, you must specify the independent variable (continuous in regression, discrete [categorical] in ANOVA). The independent variables are assumed to be measured without error. For correlation, the assumption is that both variables were measured with error, and that there is no obvious 'independent' variable. Before setting out to do one of these statistical procedures, make sure that the procedure is appropriate for the question being asked, the data are structured appropriately, and the data conform to the necessary assumptions.

Required reading
Lange et al., chapters 2-4.

Recommended reading
If you need a review of t-tests and other comparisons for two groups, see Sokal and Rohlf, pp. xxx-xxx.

Assignment (bring it to class)
1.
Calculate pair-wise correlations (Statistics --> Data Summaries --> Correlations) and simple linear regression statistics (Statistics --> Regression --> Linear) for any pair of variables. If your data are not amenable to correlation or regression analysis, use another class member's data. Be sure to consult with the data owner, and share your results!

a) choose a pair of variables and compute the correlation or regression statistics
b) check that the data meet the assumptions of correlation/regression. Illustrate that you have, in fact, checked the data.
c) if the data do not meet the assumptions, transform them appropriately, and do (a) again.
d) if you can't find an appropriate transformation, compute the Spearman's correlation coefficient based on ranks (in S-Plus, first transform the data using the rank function, then compute the correlation coefficient as in part (a)).

2. Determine if your regression or correlation statistics are 'significant'.

3. Write up what you've done. Be sure that you not only present your results and show that you've checked assumptions, but also that you interpret the meaning of all output parameters. One or two pages ought to be sufficient.



Assignment 5 - Analysis of Variance and Statistical Power
Due date: February 25, 2000

Part I - ANOVA

Introduction
See last week's assignment.

Required reading
Lange et al., chapter 5.

Assignment (bring it to class)
1. Compute one-way and two-way ANOVAs for sets of variables measured over a more than two levels of the given factor(s). If your data are not amenable to ANOVA, use another class member's data. Be sure to consult with the data owner, and share your results!

a) Choose a set of variables and independent factors and compute the ANOVA statistics. For one-way ANOVA, use Statistics --> Compare samples --> k Samples --> One-way ANOVA, while for two-way ANOVA, use Statistics --> Analysis of Variance (it's up to you to determine whether to use the Fixed Effects or Random Effects option, but be sure to explain your choice!).
b) Compute appropriate post-hoc multiple comparisons to determine which groups differ from each other. Use Statistics --> Multiple Comparisons to accomplish this, once you've done (a).
c) check that the data meet the assumptions of ANOVA. Illustrate that you have, in fact, checked the data. S-Plus has many tools to help you do this.
d) if the data do not meet the assumptions, transform them appropriately, and do (a) again.

2. Determine if your ANOVA statistics are 'significant'.

3. Write up what you've done. Be sure that you not only present your results and show that you've checked assumptions, but also that you interpret the meaning of all output parameters. One or two pages ought to be sufficient.

Part II - Statistical Power

Introduction
By now, you should have a good sense of hypothesis testing, the meaning of a P-value, and how to determine if your results are statistically significant. P-values are based on the acceptable probability of committing a Type I error (a) (rejecting the null hypothesis when in fact it is true), which is fixed prior to the start of an experiment (traditionally, a = 0.05). This acceptable probability determines the 'significance' of your results, and what you report in your scientific paper is the probability of your data given the null hypothesis is less than the a-level: P(data|H0) < a. The converse, Type II error (b) is rarely discussed. More importantly, what you may be most interested in is the probability of rejecting your null hypothesis when it is, in fact false, and should be rejected. This quantity, referred to as the power of your statistical test, equals 1-b. Interestingly (but I hope not surprisingly), statistical power does not simply equal 1-a, the obtained P-value. Rather, it depends in a rather complex way on sample size, effect size, and your pre-determined a-level. In this light, power analysis can be used to address several important questions:

1. What sample size is needed in order to detect a difference (i.e. to see an effect) of a particular size, given predetermined values for a and b? This question should be asked before you set up your experiment!

2. What sample size would have been needed in order to detect a difference (i.e. to see an effect) equal to that observed in your study given predetermined values for a and b? This question is usually asked after you finished your experiment.

3. What is the smallest difference (effect size) that you could have detected, given your sample size, and defined values for a and b? Also a post-hoc question.

4. Given your actual sample size, a-level, and effect size, what was the power of your experiment? Asked post-hoc, often with discouraging answers.

Required reading
Ottenbacher, K. J. 1996. The power of replications and the replications of power. The American Statistician 50: 271-275.
Peterman, R. M. 1990. The importance of reporting statistical power: the forest decline and acidic deposition example. Ecology 71: 2024-2027.

Assignment (bring it to class)

1. Compute the power of one of the statistical tests you have done this semester (from Assignments 3, 4, or 5). You can either do this by hand, or use S-Plus (Statitsics --> Power and Sample Size).

2. Write 1-2 paragraphs that illustrate that you understand the difference between statistical significance and statistical power, and their relationships to your data distributions. Given these differences, and in light of your own data, would you like to change the a-level that you've come to associate with statistical significance, at least for a while?



Assignment 6 - Multivariate analysis
Due date: March 3, 2000

Introduction
Up until now, we have been dealing primarily with two-dimensional data (one independent variable, one dependent variable). Yet many datasets have multiple independent variables. Multivariate techniques can be used to see if there are relationships between the many independent variables and the dependent variable of interest.

While there are many different multivariate techniques, here we will explore two: principal components analysis (PCA), and cluster analysis. PCA creates 'composite' variables that are linear combinations of independent variables. In other words, it allows you to determine how several variables 'hang together' as predictors. The goal of a PCA is to summarize a multivariate dataset using a few components (usually 2-3), and to determine how much variation in the dependent variable can be 'explained' by those components.

Cluster analysis, on the other hand, forms groupings of independent variables based on their 'distance' from each other in multivariate space. In other words, imagine plotting a 10-dimensional graph, where 9 of the dimensions were values of your independent variables and the 10th was the value for your dependent variable. You could compute the distance between each point, and then hierarchically group observations by their distance from each other in 10-dimensional space. Observations that are closer together would cluster together, while those that were farther apart would not cluster together.

Assumptions of PCA and cluster analysis are similar to those of regression: independent observations, approximately normal distributions of variables (technically, multivariate normal, but we won't worry about that here), uncorrelated residuals, etc.

S-Plus has good routines, both analytical and graphical, for PCA and cluster analysis (under Statistics --> Multivariate and Statistics --> Cluster Analysis).

Required reading

Lange et al., chapter 17

Recommended reading
Manly, B. F. J. 1986. Multivariate statistical methods: a primer (especially pages 1-71; 100-113). Chapman & Hall, London.

Assignment (bring it to class)

1. Conduct a PCA and a cluster analysis on some set of multivariate data.

2. Write it up in a page or so; please include graphical output!



Assignment 7 - Time series analysis and survivorship curves
Due date: March 10, 2000

Introduction
Many biological data violate the independence assumption of regression, ANOVA, and related statistical tests. Not surprisingly, therefore, there are a number of different options for analyzing datasets in which observations are not independent. The situations encountered most commonly are those where the observations show temporal or spatial autocorrelation. Data collected over a period of time on the same individuals, or at the same location, often show correlations between observations based simply on temporal proximity of observations. Examples of such data include patterns of survivorship of individuals within a defined population, data recording annual variability confounded by seasonal periodicities, etc. The same applies for observations collected in a small area: for example, individual plants growing close to each other are more likely to be similar in size than individuals growing far away. This week we will focus on data that are correlated in time; spatial autocorrelation will be dealt with at the end of the month (Assignment 9)

S-Plus has excellent analytical routines for time-series analysis and survivorship analysis (Statistics --> Time Series and Statistics --> Surivival, respectively). There is also a time-series plot option under Graph --> 2D Plots.

Required reading
Lange et al., chapters 7, 9, 11, 12, and 14.

Assignment (bring it to class)

1. Explore temporal autocorrelation in a dataset, and attempt to determine if there are significant trends in the data. For datasets with multiple treatments, you should try to de-trend the time-series data before testing for differences among treatments.

2. Examine a dataset appropriate for using survival/failure-time analysis. This can either be a student's dataset, or one from the textbook. Make sure that the dataset has at least two different treatment groups so that you can learn how to compare between them. S-Plus will illustrate these differences graphically, and provides a statistical test for differences among treatment groups.

3. Write it up; one page each for the time-series analysis and the surivorship analysis.



Assignments 8 - Bayesian inference
Due date: April 14, 2000

Introduction
Bayesian inference offers an alternative to 'frequentist' hypothesis testing (P-values, a, b, power; what we've been working on up until now). There are three essential differences between Bayesian and frequentist statistical inference. First, frequentist inference treats population parameters (e.g., m, s) as fixed, while Bayesian inference treats them as random. Second, while frequentist inference addresses the probability of your data (x) given your null hypothesis: P(x|H0), Bayesian inference addresses the probability of your null (or alternative) hypothesis given your data: P(H0|x). Finally, and most controversially, Bayesian inference takes into account existing information that might be available about the probability of your hypothesis. Unfortunately, you can't get at the second piece without some information on the prior probability of your hypothesis (the third piece), and so until recently, Bayesian inference has been widely disparaged in basic scientific research (although it is used extensively in business decision-making and medical diagnoses and expert systems). Lange et al. (chapter 18) gives an illustrative examples. I provide others in a recent article (Ellison 1996) that also addresses the potential utility of Bayesian inference in both basic and applied research.

Required reading
Lange et al., chapter 18
Ellison, A. M. 1996. An introduction to Bayesian inference to ecological research and environmental decision making. Ecological Applications 6: 1035-1046.

Assignment (bring it to class)

1. Decide on a prior probability distribution for one of your own variables of interest. Note that this assignment demands that you extract information from the biological literature about possible parameter values.

2. What is that prior probability distribution? In other words, write the probability density function (pdf) for your prior. Note the prior estimate of your mean and variance.

3. Write a 1-2 paragraph justification for your prior probability distribution. Be sure to cite the appropriate biological literature to support your prior.

4. Based on your prior probability distribution and your data, compute the posterior probability distribution for your data, along with a posterior estimate of your mean and variance, and a credibility interval. Write out all computational steps.

5. (Extra credit) Conduct a Bayesian regression or ANOVA to complement your results from Assignments 4 or 5.

 



Assignment 9 - Spatial statistics
Due date: April 21, 1999

Introduction
Like time-series analysis, spatial statistics deals with biological data violate the independence assumption of regression, ANOVA, and related statistical tests. In this case the dependencies have to do with spatial autocorrelation. Data collected in a defines spatial context often show correlations between observations based simply on spatial proximity of observations. For example individual plants growing close to each other are more likely to be similar in size than individuals growing far away. Hedda's dataset on pitcher-plant size in a dense population are our class example for spatially-autocorrelated data.

In this assignment, you will examine three types of spatial statistics: spatial interpolation (trend surfaces), kriging, and point process analysis.

Spatial interpolation is a graphical way to describe spatial patterns. It generates graphs that look like contour plots or topographic maps, where the values of the contour lines correspond to the z-value (or observed value) at each point (given by x, y coordinates). For example, each of Hedda's plants has an x,y coordinate in space, and the z-value is the number of leaves at each date (there are actually 3 z-values for each plant, since each was measured 3 times).

Kriging examines spatial autocorrelation just like time-series analysis examines temporal autocorrelation. The question here is how big is the spatial neighborhood that has an effect on plant size? For this, use the last sample date of Hedda's observations.

Point process analysis answers the question, are the objects randomly distributed in space, or is there clumping or hyperdispersion. Here, the question is, are the plants randomly distributed in space (in other words, did we do a good job setting up this experiment)?

How to do this assignment
Functions for doing trend surfaces, kriging, and point-process analysis are available as add-ins to S-Plus 2000. All functions work only at the command prompt.

First things first:
It will be a lot easier to do this assignment, and to use these functions, if you first read chapter 16 of W. N. Venables and B. D. Ripley's book Modern applied statistics with S-Plus, second edition, which is on reserve in the library. In the third edition, this is chapter 14, but there is almost no difference in the text, and there is no difference in the functions. This chapter will introduce you to spatial statistics, and then lead you through some worked examples.

Second:
Once you're ready to try analyzing Hedda's data, issue the following command at the command prompt >:

> library(spatial, first=T)

This will give you access to the commands and example datasets described by Venables and Ripley.

Third:
Apply the functions surf.ls, correlogram, variogram, surf.gls, and ppinit to Hedda's data, and see what you get. Try to interpret the output following from the examples given in Venables and Ripley.

Write it up and bring it to class.



Assignment 10 - Non-linear regression
Due date: April 28, 2000

Introduction
While correlation and linear regression are the most common 'curve-fitting' procedures used by biologists, often continuous data are related in non-linear ways (such as data on growth, which is often better fit by an exponential model; uptake kinetics, which may be sigmoid or asymptotic; or light attenuation data, which may be negative exponential or hyperbolic). Because data transformations may not bring the data in line with the assumptions of linear regression, or because they may obscure true relationships, biologists often use non-linear regression to relate two continuous variables to each other.

Non-linear regression requires two things. First, all the assumptions of linear regression (a predictor variable and a dependent variable; independence; homoscedasticity, normality of residuals, etc.). Second, you need to have a predictive non-linear equation (such as y = a^x). It's up to you, the data, and the metadata to generate the appropriate non-linear equation.

Required reading
Lange et al., chapters 10, 13, and 19-21.

Assignment (bring it to class)

1. Conduct a non-linear regression analysis for any pair of variables. S-Plus has many different routines for non-linear regression, including logistic, log-linear, local (aka loess), and generalizable (completely user-defined) non-linear models. If your data are not amenable to non-linear modelling, use another class member's data. Be sure to consult with the data owner, and share your results!

a) choose a pair of variables, determine an appropriate non-linear model, and fit it.
b) check that the data meet the assumptions of regression. Illustrate that you have, in fact, checked the data.

2. Write it up in a page or so.

 


 Bio Dept.

 Math Dept.

 MHC Home

 Aaron Ellison

 Other Stat. Sites

 Biostatistics Home Page