











Introduction
Since this course depends
on your datasets, your first assignment is to get your data together
in a usable fashion. Even if your dataset isn't complete (if,
for example, your senior project is still in progress), you should
at least have some data by the end of January. For this first
class, therefore, you are expected to bring organized data, with
appropriate metadata with you to class, and be prepared to describe
it (in a 10minute or so presentation). Note that such a description
requires an hypothesis.
Some definitions
Hypothesis 
what's your question? You can't organize your data for exploration
or analysis if you don't have a question that you want your data
to answer, or a hypothesis that you're interested in testing with
your data. This is not a course in data dredging.
Organized
data  your data should be organized in a spreadsheet. Since
we'll be using IBM computers throughout the semester, please enter
your data into an IBMcompatible spreadsheet (such as Excel, Lotus,
or QuattroPro). The lab computers all have Excel 97 loaded on
them. As a rule, cases (individual observations) are each entered
as a single row, while variables that you measured for each case
are entered into each column. Be sure to label your columns. If
you need additional descriptive materials to help you organize
your data, please read the extra online notes on Data
Management or read the first few chapters (pp. 837) of Sokal and Rohlf (1995).
Metadata
 a description of your data structure. What's the filename? Who
entered the data? How were the data checked for accuracy during
data entry? What do the rows and columns signify? What do variable
names (usually 8 or fewer characters) actually mean? In what format
are your data entered (e.g., integers, real numbers, alphanumeric
characters)? When were your data collected? How? What is their
precision and accuracy (be sure that you know the difference).
Metadata includes any additional information that another user
might find helpful if she were to access your data from a databank
and then try to reanalyze them.
What you turn in (bring it to class)
1.
A written statement of your hypothesis/question, with sufficient
background information that someone else, who knows little about
your system, could understand your question. This statement should
not exceed one wordprocessed page (1" margins, 12 point
type).
2. A printout of your dataset, with all rows and columns labeled. If the data exceed one printed page (normal type size), make sure that the printout carries over row and column labels onto subsequent pages. Excel can do this, as can most other spreadsheet packages.
3. You should bring a written description of your dataset (i.e., the metadata). The metadata should be as long as necessary (1" margins, 12 point type).
4. A copy of your data and metadata on a highdensity (1.4 Mb) PCformatted, virusfree, 3½" diskette (MS Excel and MS Word, respectively)
References
Sokal,
R. R., and F. J. Rohlf. 1995. Biometry. W. H. Freeman and
Company, New York, New York, USA (on reserve in the library)
Introduction
Data exploration (aka
exploratory data analysis, or EDA) and display is a fundamental
process of data analysis. EDA is used to summarize your data,
to visualize patterns in your data, and to refine your hypotheses,
while data display presents those patterns to others. While EDA
is often done 'on the fly', and with lowresolution graphics or
printouts, data display is 'presentationquality' graphics, analogous
to what you'd read in a standard scientific presentation, and
is what should end up in your thesis or independent project. For
this week, you will do some background reading in EDA and data
display, and you will begin to summarize and illustrate your data.
Perhaps the most common and convenient way to summarize data is to report measures of location, spread (error), and confidence. Examples of measures of location are the mean, trimmed mean, median, and mode; examples of measures of spread are the standard deviation, standard error, variance, percentiles, range, and coefficient of variation; and confidence is usually expressed as a k% confidence interval or k% prediction interval. If your data fall into obvious groups (treatments), then summaries are usually reported for each group.
Summary statistics can be reported in tabular form ('Tables') or graphic form ('Figures'). While tables are more precise, figures are usually more compelling.
Required reading (read before coming to class)
Ellison, A. M. 1993. Exploratory data analysis and
graphic display. Pages 1445 in S. M. Scheiner & J. Gurevitch,
editors. Design and analysis of ecological experiments. Chapman
and Hall, New York, USA.
Recommended reading
Cleveland, W. S.
1993. Visualizing data. Hobart Press, Summit, New Jersey,
USA.
du Toit, S. H. C., A. G. W. Steyn, and R.
H. Stumpf. 1986. Graphical exploratory data analysis. SpringerVerlag,
New York, New York, USA.
Sokal, R. R., and F. J.
Rohlf. 1995. Pages 3960 in Biometry. W. H. Freeman and
Company, New York, New York, USA.
Tufte, E. R.
1983. The visual display of quantitative information. Graphics
Press, Cheshire, Connecticut, USA
Tukey, J. W.
1977. Exploratory data analysis. AddisonWesley, Reading,
Pennsylvania, USA.
Additional material on EDA is available within the online class notes.
Assignment (bring it to class)
1.
Begin to explore and illustrate your data. You should prepare
as many graphs as you think appropriate to illustrate your
hypotheses, in rough (EDA) form. All of the graphic types and
elements that you need are available in SPlus (DO NOT USE EXCEL
FOR GRAPHICS!). The graphics palette in SPlus was designed by
William Cleveland (who's written several books on EDA), and contains
all the possibilities shown in Tufte's (1983) book on EDA and
my chapter on EDA.
2. Compute summary statistics for your variables of interest. If appropriate, compute them 'by' categories of interest. Use SPlus (Statistics > Data Summaries > Summary Statistics) for computation. Write a onepage summary of your summary statistics, that shows that you understand the meaning and differences between different measures of location, spread, and confidence.
3. Plot your summary statistics in (a) way(s) that enables rapid comparison between or among groups of interest. Produce at least three different types of plots illustrating your summary statistics (example: box plots, bar charts, and category plots). Remember: pie charts are not allowed! Write a one or two paragraph description of your plots, in standard scientific style, drawing the reader's attention to the results that you think are most relevant.
4. Using the confidence intervals that you computed in part A, discuss, in 12 paragraphs the apparent similarities or differences among your different treatment groups.
During class time, you will each present your EDA graphics and data summaries for critique. Be prepared to explain why you chose the graphic types that you did, and why they illustrate your hypotheses. Be prepared to help each other improve the clarity of your graphics. About a third of the class time will be devoted to presentation and critique, while the remainder will be allotted to improving your graphics.
Introduction
One of the first steps in
analyzing data is assessing the underlying distribution(s) of
the variable(s) of interest. For example, coin flips can yield
two possible outcomes: 'heads' or 'tails'; a long run of coin
flips (of fair coins) gives rise to a binomial distribution of
data. There are many other distributions that underlay common
phenomena: the Poisson distribution and the Gaussian (or 'normal')
distribution are two of the more common. Many basic statistical
calculations, as well as most statistical tests used by biologists
are based on the assumption that the sampled population (not the
sample itself) has a known probability distribution (usually normal)
that can be parameterized (hence the use of the term 'parametric'
statistics). This week, you will explore the distribution(s) of
your variable(s).
We will also use your data distribution to introduce you to (or reaquaint you with) hypothesis testing. Most of you are (or should be) familiar with standard hypothesis testing and Pvalues; these give rise to the oftasserted (and routinely misused) 'significance' of your data. You will use the distributions, and measures of location and spread that you developed in Assignment 2 to test formal hypotheses about the shape of your data distributions. The type of statistical test that we will use for this assignment is a goodnessoffit test.
The familiar Chisquare test is an example of a goodnessoffit test. Lange et al. cover a couple of other goodnessoffit tests in chapter 16, and we will return to them at the end of the semester. Sokal and Rohlf (1995) covers the standard Chisquare test for enumerative (count data) and its extensions to multinomial or continuous data (such as testing whether or not your data fit a Gaussian distribution).
SPlus provides two tests determining goodnessoffit: the Chisquare and the KolmogorovSmirnov tests (Statistics > Compare Samples > One Sample > KolmogorovSmirnov GOF or ...> Chisquare GOF).
Recommended reading
Sokal, R. R. &
F. J. Rohlf. 1995. Pages 61175 in Biometry. W. H. Freeman
and Company, New York, New York, USA.
Additional material on probability distributions and goodnessoffit tests are available within the online class notes.
Assignment (bring it to class)
1.
Plot your variables in a way that you can visualize their
distribution. Useful graphs for visualizing data distributions
include: histograms, boxplots, dotplots, and stemandleaf plots.
SPlus will do all of these (Graph >
2D Plots), although stemandleaf plots can only be done
from the command prompt (use the command: stem(variable)).
2. Describe (in 12 paragraphs) what distribution(s) ought to be the best fit(s) for your variable(s). Explain why you expect these distributions to be the appropriate ones.
3. Use a onesample goodnessoffit test to determine if your data actually fit the distribution you predicted in (2.).
4. Generate a simulated dataset with the same number of observations as your raw data, whose values, for each simulated variable, come from a simulated distribution that you predicted in part 2. Use the SPlus menu commands to do this (Data > Random Numbers)
5. Plot your simulated variables, and visually compare the two plots. Do you have a good match? If not, why not? Use a twosample goodness of fit test to determine if the simulated data and your real data are statistically indistinguishable (use Statistics > Compare Samples > Two Samples > KolmogorovSmirnov GOF)
6. If your data are not 'normally' distributed, can you transform them so that they fit a normal distribution? An example is the logarithmic transformation: new variable = ln(old variable) for data that are leftskewed. See the SPlus help file under Data > Transform to learn how to do this quickly.
7. Write a onepage exposition describing this first attempt at hypothesis testing. Use accurate language in describing your null and alternative hypotheses, and state your conclusions in appropriate statistical terminology. Refer to your figures when appropriate.
Introduction
Probably the most common
statistical procedures used by biologists are correlation and
linear regression (for data measured on a continuous scale), and
analysis of variance (ANOVA) for comparing mean responses among
more than two treatment groups (for two treatment groups, use
the familiar ttest or its nonparametric equivalent). Regression
and ANOVA are usually discussed together, because Fisher demonstrated
that all degrees of freedom and sums of squares (i.e., deviations
from overall or withingroup means) in an ANOVA problem are reducible
to singledegreeoffreedom contrasts analyzable by regression.
This week, we will focus on correlation and regression, and next
week we will focus on ANOVA.
For both regression and ANOVA, you must specify the independent variable (continuous in regression, discrete [categorical] in ANOVA). The independent variables are assumed to be measured without error. For correlation, the assumption is that both variables were measured with error, and that there is no obvious 'independent' variable. Before setting out to do one of these statistical procedures, make sure that the procedure is appropriate for the question being asked, the data are structured appropriately, and the data conform to the necessary assumptions.
Required reading
Lange et al.,
chapters 24.
Recommended reading
If you need a review
of ttests and other comparisons for two groups, see Sokal and
Rohlf, pp. xxxxxx.
Assignment (bring it to class)
1.
Calculate pairwise correlations (Statistics
> Data Summaries > Correlations) and simple linear
regression statistics (Statistics >
Regression > Linear) for any pair of variables. If
your data are not amenable to correlation or regression analysis,
use another class member's data. Be sure to consult with the data
owner, and share your results!
a) choose a pair of variables and compute the correlation or regression statistics
b) check that the data meet the assumptions of correlation/regression. Illustrate that you have, in fact, checked the data.
c) if the data do not meet the assumptions, transform them appropriately, and do (a) again.
d) if you can't find an appropriate transformation, compute the Spearman's correlation coefficient based on ranks (in SPlus, first transform the data using the rank function, then compute the correlation coefficient as in part (a)).
2. Determine if your regression or correlation statistics are 'significant'.
3. Write up what you've done. Be sure that you not only present your results and show that you've checked assumptions, but also that you interpret the meaning of all output parameters. One or two pages ought to be sufficient.
Part I  ANOVA
Introduction
See
last week's assignment.
Required reading
Lange et al.,
chapter 5.
Assignment (bring it to class)
1.
Compute oneway and twoway ANOVAs for sets of variables measured
over a more than two levels of the given factor(s). If your data
are not amenable to ANOVA, use another class member's data. Be
sure to consult with the data owner, and share your results!
a) Choose a set of variables and independent factors and compute the ANOVA statistics. For oneway ANOVA, use Statistics > Compare samples > k Samples > Oneway ANOVA, while for twoway ANOVA, use Statistics > Analysis of Variance (it's up to you to determine whether to use the Fixed Effects or Random Effects option, but be sure to explain your choice!).
b) Compute appropriate posthoc multiple comparisons to determine which groups differ from each other. Use Statistics > Multiple Comparisons to accomplish this, once you've done (a).
c) check that the data meet the assumptions of ANOVA. Illustrate that you have, in fact, checked the data. SPlus has many tools to help you do this.
d) if the data do not meet the assumptions, transform them appropriately, and do (a) again.
2. Determine if your ANOVA statistics are 'significant'.
3. Write up what you've done. Be sure that you not only present your results and show that you've checked assumptions, but also that you interpret the meaning of all output parameters. One or two pages ought to be sufficient.
Part II  Statistical Power
Introduction
By now, you should have
a good sense of hypothesis testing, the meaning of a Pvalue,
and how to determine if your results are statistically significant.
Pvalues are based on the acceptable probability of committing
a Type I error (a) (rejecting the null
hypothesis when in fact it is true), which is fixed prior to the
start of an experiment (traditionally, a
= 0.05). This acceptable probability determines the 'significance'
of your results, and what you report in your scientific paper
is the probability of your data given the null hypothesis is less
than the alevel: P(dataH0)
< a. The converse, Type II error
(b) is rarely discussed. More importantly,
what you may be most interested in is the probability of rejecting
your null hypothesis when it is, in fact false, and should be
rejected. This quantity, referred to as the power of your
statistical test, equals 1b. Interestingly
(but I hope not surprisingly), statistical power does not simply
equal 1a, the obtained Pvalue.
Rather, it depends in a rather complex way on sample size, effect
size, and your predetermined alevel.
In this light, power analysis can be used to address several important
questions:
1. What sample size is needed in order to detect a difference (i.e. to see an effect) of a particular size, given predetermined values for a and b? This question should be asked before you set up your experiment!
2. What sample size would have been needed in order to detect a difference (i.e. to see an effect) equal to that observed in your study given predetermined values for a and b? This question is usually asked after you finished your experiment.
3. What is the smallest difference (effect size) that you could have detected, given your sample size, and defined values for a and b? Also a posthoc question.
4. Given your actual sample size, alevel, and effect size, what was the power of your experiment? Asked posthoc, often with discouraging answers.
Required reading
Ottenbacher, K. J. 1996.
The power of replications and the replications of power. The
American Statistician 50: 271275.
Peterman,
R. M. 1990. The importance of reporting statistical power: the
forest decline and acidic deposition example. Ecology 71:
20242027.
Assignment (bring it to class)
1. Compute the power of one of the statistical tests you have done this semester (from Assignments 3, 4, or 5). You can either do this by hand, or use SPlus (Statitsics > Power and Sample Size).
2. Write 12 paragraphs that illustrate that
you understand the difference between statistical significance
and statistical power, and their relationships to your data distributions.
Given these differences, and in light of your own data, would
you like to change the alevel that
you've come to associate with statistical significance, at least
for a while?
Introduction
Up until now, we have been
dealing primarily with twodimensional data (one independent variable,
one dependent variable). Yet many datasets have multiple independent
variables. Multivariate techniques can be used to see if there
are relationships between the many independent variables and the
dependent variable of interest.
While there are many different multivariate techniques, here we will explore two: principal components analysis (PCA), and cluster analysis. PCA creates 'composite' variables that are linear combinations of independent variables. In other words, it allows you to determine how several variables 'hang together' as predictors. The goal of a PCA is to summarize a multivariate dataset using a few components (usually 23), and to determine how much variation in the dependent variable can be 'explained' by those components.
Cluster analysis, on the other hand, forms groupings of independent variables based on their 'distance' from each other in multivariate space. In other words, imagine plotting a 10dimensional graph, where 9 of the dimensions were values of your independent variables and the 10th was the value for your dependent variable. You could compute the distance between each point, and then hierarchically group observations by their distance from each other in 10dimensional space. Observations that are closer together would cluster together, while those that were farther apart would not cluster together.
Assumptions of PCA and cluster analysis are similar to those of regression: independent observations, approximately normal distributions of variables (technically, multivariate normal, but we won't worry about that here), uncorrelated residuals, etc.
SPlus has good routines, both analytical and graphical, for PCA and cluster analysis (under Statistics > Multivariate and Statistics > Cluster Analysis).
Required reading
Lange et al., chapter 17
Recommended reading
Manly, B. F. J. 1986.
Multivariate statistical methods: a primer (especially
pages 171; 100113). Chapman & Hall, London.
Assignment (bring it to class)
1. Conduct a PCA and a cluster analysis on some set of multivariate data.
2. Write it up in a page or so; please include graphical output!
Introduction
Many biological data violate
the independence assumption of regression, ANOVA, and related
statistical tests. Not surprisingly, therefore, there are a number
of different options for analyzing datasets in which observations
are not independent. The situations encountered most commonly
are those where the observations show temporal or spatial autocorrelation.
Data collected over a period of time on the same individuals,
or at the same location, often show correlations between observations
based simply on temporal proximity of observations. Examples of
such data include patterns of survivorship of individuals within
a defined population, data recording annual variability confounded
by seasonal periodicities, etc. The same applies for observations
collected in a small area: for example, individual plants growing
close to each other are more likely to be similar in size than
individuals growing far away. This week we will focus on data
that are correlated in time; spatial autocorrelation will be dealt
with at the end of the month (Assignment
9)
SPlus has excellent analytical routines for timeseries analysis and survivorship analysis (Statistics > Time Series and Statistics > Surivival, respectively). There is also a timeseries plot option under Graph > 2D Plots.
Required reading
Lange et al.,
chapters 7, 9, 11, 12, and 14.
Assignment (bring it to class)
1. Explore temporal autocorrelation in a dataset, and attempt to determine if there are significant trends in the data. For datasets with multiple treatments, you should try to detrend the timeseries data before testing for differences among treatments.
2. Examine a dataset appropriate for using survival/failuretime analysis. This can either be a student's dataset, or one from the textbook. Make sure that the dataset has at least two different treatment groups so that you can learn how to compare between them. SPlus will illustrate these differences graphically, and provides a statistical test for differences among treatment groups.
3. Write it up; one page each for the timeseries analysis and the surivorship analysis.
Introduction
Bayesian inference offers
an alternative to 'frequentist' hypothesis testing (Pvalues,
a, b, power;
what we've been working on up until now). There are three essential
differences between Bayesian and frequentist statistical inference.
First, frequentist inference treats population parameters (e.g.,
m, s) as
fixed, while Bayesian inference treats them as random. Second,
while frequentist inference addresses the probability of your
data (x) given your null hypothesis: P(xH0),
Bayesian inference addresses the probability of your null (or
alternative) hypothesis given your data: P(H0x).
Finally, and most controversially, Bayesian inference takes into
account existing information that might be available about the
probability of your hypothesis. Unfortunately, you can't get at
the second piece without some information on the prior probability
of your hypothesis (the third piece), and so until recently, Bayesian
inference has been widely disparaged in basic scientific research
(although it is used extensively in business decisionmaking and
medical diagnoses and expert systems). Lange et al. (chapter
18) gives an illustrative examples. I provide others in a recent
article (Ellison 1996) that also addresses the potential utility
of Bayesian inference in both basic and applied research.
Required reading
Lange et al.,
chapter 18
Ellison, A. M. 1996. An introduction
to Bayesian inference to ecological research and environmental
decision making. Ecological Applications 6: 10351046.
Assignment (bring it to class)
1. Decide on a prior probability distribution for one of your own variables of interest. Note that this assignment demands that you extract information from the biological literature about possible parameter values.
2. What is that prior probability distribution? In other words, write the probability density function (pdf) for your prior. Note the prior estimate of your mean and variance.
3. Write a 12 paragraph justification for your prior probability distribution. Be sure to cite the appropriate biological literature to support your prior.
4. Based on your prior probability distribution and your data, compute the posterior probability distribution for your data, along with a posterior estimate of your mean and variance, and a credibility interval. Write out all computational steps.
5. (Extra credit) Conduct a Bayesian regression or ANOVA to complement your results from Assignments 4 or 5.
Introduction
Like timeseries analysis,
spatial statistics deals with biological data violate the independence
assumption of regression, ANOVA, and related statistical tests.
In this case the dependencies have to do with spatial autocorrelation.
Data collected in a defines spatial context often show correlations
between observations based simply on spatial proximity of observations.
For example individual plants growing close to each other are
more likely to be similar in size than individuals growing far
away. Hedda's dataset on pitcherplant size in a dense population
are our class example for spatiallyautocorrelated data.
In this assignment, you will examine three types of spatial statistics: spatial interpolation (trend surfaces), kriging, and point process analysis.
Spatial interpolation is a graphical way to describe spatial patterns. It generates graphs that look like contour plots or topographic maps, where the values of the contour lines correspond to the zvalue (or observed value) at each point (given by x, y coordinates). For example, each of Hedda's plants has an x,y coordinate in space, and the zvalue is the number of leaves at each date (there are actually 3 zvalues for each plant, since each was measured 3 times).
Kriging examines spatial autocorrelation just like timeseries analysis examines temporal autocorrelation. The question here is how big is the spatial neighborhood that has an effect on plant size? For this, use the last sample date of Hedda's observations.
Point process analysis answers the question, are the objects randomly distributed in space, or is there clumping or hyperdispersion. Here, the question is, are the plants randomly distributed in space (in other words, did we do a good job setting up this experiment)?
How to do this assignment
Functions for
doing trend surfaces, kriging, and pointprocess analysis are
available as addins to SPlus 2000. All functions work
only at the command prompt.
First things first:
It will be a lot
easier to do this assignment, and to use these functions, if you
first read chapter 16 of W. N. Venables and B. D. Ripley's book
Modern applied statistics with SPlus, second edition,
which is on reserve in the library. In the third edition, this
is chapter 14, but there is almost no difference in the text,
and there is no difference in the functions. This chapter will
introduce you to spatial statistics, and then lead you through
some worked examples.
Second:
Once you're ready to try analyzing
Hedda's data, issue the following command at the command prompt
>:
> library(spatial, first=T)
This will give you access to the commands and example datasets described by Venables and Ripley.
Third:
Apply the functions surf.ls,
correlogram, variogram, surf.gls, and
ppinit to Hedda's data, and see what you get. Try to interpret
the output following from the examples given in Venables and Ripley.
Write it up and bring it to class.
Introduction
While correlation and linear
regression are the most common 'curvefitting' procedures used
by biologists, often continuous data are related in nonlinear
ways (such as data on growth, which is often better fit by an
exponential model; uptake kinetics, which may be sigmoid or asymptotic;
or light attenuation data, which may be negative exponential or
hyperbolic). Because data transformations may not bring the data
in line with the assumptions of linear regression, or because
they may obscure true relationships, biologists often use nonlinear
regression to relate two continuous variables to each other.
Nonlinear regression requires two things. First, all the assumptions of linear regression (a predictor variable and a dependent variable; independence; homoscedasticity, normality of residuals, etc.). Second, you need to have a predictive nonlinear equation (such as y = a^x). It's up to you, the data, and the metadata to generate the appropriate nonlinear equation.
Required reading
Lange et al.,
chapters 10, 13, and 1921.
Assignment (bring it to class)
1. Conduct a nonlinear regression analysis for any pair of variables. SPlus has many different routines for nonlinear regression, including logistic, loglinear, local (aka loess), and generalizable (completely userdefined) nonlinear models. If your data are not amenable to nonlinear modelling, use another class member's data. Be sure to consult with the data owner, and share your results!
a) choose a pair of variables, determine an appropriate nonlinear model, and fit it.
b) check that the data meet the assumptions of regression. Illustrate that you have, in fact, checked the data.
2. Write it up in a page or so.





