EXPLORATORY DATA ANALYSIS AND GRAPHICAL DISPLAY



Graphical Display:

Data exploration (a.k.a. exploratory data analysis, or EDA) and display is a fundamental process of data analysis. EDA allows you to visualize patterns in your data and refine your hypothesis, while data display presents those patterns to others. While EDA is often done 'on the fly,' and with low-resolution graphics or print-outs, data display is 'presentation-quality' graphics, analogous to what you would read in a standard scientific presentation, and is what should end up in your thesis or independent project.

Pie Graphs: a circular graph that is useful in showing how a total quantity is distributed among a group of categories. The "pieces of the pie" represent the proportions of the total that fall into each category.

Bar Graphs: a graph that can be used to distribute amounts or frequencies into categories, with the height of the bar representing the quantity or frequency for each category. Example below:

Relative frequency histogram: used for a quantitative data set and is a bar graph in which the height of the bar represents the proportion or relative frequency of occurrence for a particular class or subinterval of the variable being measured. The classes or subintervals are plotted along the horizontal axis. To make a relative frequency histogram you divide the interval between the largest and smallest values into an arbitrary number of subintervals of equal length. The number of subintervals usually ranges from 5 to 20. Once the subintervals are formed, the measurements are categorized according to the class into whch they fall. Example below:

When interpreting a histogram one should examine the location of the histogram. In other words, where on the horizontal axis is the center of the histogram located? Secondly, what is the shape of the histogram? Is there one relative frequency bar that is higher than any other, thereby identifying the most frequent value in the set? Are the relative frequency bars in the left and right halves of the graph equal? Is the histogram symmetric? A distribution is said to be skewed to the right if a greater proportion of the observations lies to the right of the the highest relaive frequency bar. Determine whether any of the measurements seem unusual; that is, are they bigger or smaller than all of the other measurements. Observations that lie far from the center are called outliers.

Stem and leaf diagram: presents a histogram-like picture fo the data, while allowing the experimenter to retain the actual observed values fo each data point. To make one of these diagrams you divide each observation into two parts: the stem and the leaf. For example, you could divide each observation at the decimal point. The portion to the left of the point becomes the stem and the portion to the right becomes the leaf. Alternatively, you could choose the point of division between the tenths and hundredths decimal places. List the stem values, in order, in a vertical column. Draw a vertical line to the right of the stem values. For each observation, record the leaf portion of that observation in the row corresponding to the appropriate stem. Reorder the leaves from lowest to highest within each stem row. Example below:

 
Tree Diameter (cm)
Leaf unit = 1.0; N = 79
2 | 55 represents 25.5

0 | 51 52 52 53 54 55 58 58 60 60 60 62 64 65 65 70 70 76 80 80 81 82 83 89 90 90 91 94 95 96
1 | 00 02 05 12 15 34
1 | 51 63 70 73 78 81 89 90 91
2 | 07 10 11 24 33 39 45 46
2 | 55 56 58 60 61 82 84 96
3 | 00 05 06 17 19 22 25 34 38
3 | 60
4 | 21 22 24
4 | 94
5 | 05 25
5 | 70
6 | 33

Box plot: A box plot can be used to describe measurements or data not only in the middle of the distribution but at the tails. Values that lie very far from the middle of the distribution are called outliers. A box plot is constructed using the median and two other measurements called upper and lower hinges (similar to quartiles). A box is drawn around the center of the data so that its ends are the upper and lower hinges. A line through the box marks the value of the median. A line is then drawn from the box to the adjacent values.

Suggestions for further reading: You might want to read the following articles on EDA and data graphics (on-line reading requires the Adobe Acrobat Reader. If you don't have it, you can download it from Adobe).

Ellison, A. M. 1993. Exploratory data analysis and graphic display. Pages 14-45 in S. M. Scheiner and J. Gurevitch (editors). Design and analysis of ecological experiments. Chapman & Hall, New York, New York, USA.

Lee, J. J. and Z. N. Tu. 1997. A versatile one-dimensional distribution plot: the BLiP plot. The American Statistician 51: 353-358.

The software described by Lee & Tu can be run using S-Plus at Mount Holyoke. Click here for more instructions.

Assignment 2


Measures of Location, Spread, and Confidence Intervals:

After you've explored your data, and determined its underlying distribution, it's time to begin to summarize your data. Perhaps the most common and convenient way to summarize data is to report measures of location, spread (error), and confidence. Examples of measure of location are the mean, trimmed mean, median, and mode; examples of measures of spread are the standard deviation, standard error, variance, percentiles, range, and coefficient of variation; and confidence is usually expressed as a k% confidence interval or k% prediction interval. If your data fall into obvious groups (treatments), then summaries are usually reported for each group.

The mean of the set of measurements is equal to the sum of the measurements divided by n.

The trimmed mean is the mean of the middle 90% of the measurements after excluding the smallest 5% and the largest 5%. the trimmed mean is not sensitive to extremely large or extremely small values in the data set.

The median of a set of measurements is the value that falls in the middle position when the measurements are ordered from smallest to largest. The median divides a set of measurements into two equal parts. The median is less sensitive to extreme values.

The mode is the category that occurs most frequently. It is possible for a distribution of measurements to have more than one mode. For example, if we were to measure the length of fish taken from a lake we might get a bimodal distribution possibly reflecting a mixture of young and old fish from the population in the lake.

The range of a set of measurements is defined as the difference between the largest and smallest measurements.

Variability can be viewed in terms of distance between each measurement and the mean. If the distances are large, we can say that the data are more variable than if the distances are small. The deviation of a measurement from its mean is the quantity (xi-x). Using the sum of the squared deviations, we calculate a single measure called the variance of a set of measurements. The variance is measured in terms fo the square of the original units of measurement. Taking the square root of the variance, we obtain the standard deviation, which returens the measure of variability to the original units of measurement. Standard deviations allow us to compare several sets of data, with respect to their variability.

A percentile is another measure of relative standing. Let x1, x2, ...xn be a set of n measurements arranged in order of magnitude. The pth percentile is the value of x that exceeds p% of the measurements and is less than the remaining (100-p)%. The 25th and 75th percentiles, called the lower and upper quartiles, along with the median (the 50th percentile), locate points that divide the data into four sets of equal numbers. For example, 25% of the measurements will be less than the lower (first) quartile. 50% will be less than the median (second quartile), and 75% will be less than the upper (third) quartile.

Assignment 2


Elementary Probability Theory: Distributions:

One of the first steps in analyzing data is assessing the underlying distribution(s) of the variable(s) of interest. For example, coin flips can yield two possible outcomes: 'heads' or 'tails'; a long run of coin flips (of fair coins) gives rise to a binomial distribution of data. There are many other distributions that underlay common phenomena: the Poisson distribution and the gaussian (or 'normal') distribution are two of the more common. Many basic statistical calculations, as well as most statistical tests used by biologists are based on the assumption that the sampled population (not the sample itself) has a known probability distribution (usually normal) that can be parameterized (hence the use of the term 'parametric' statistics).

You can use the distributions, and measures of location and spread to test formal hypotheses about the shape of your data distributions. The type of statistical test that is used for this purpose is a goodness-of-fit test.

The familiar c2 test (chi-square test) is an example of a goodness-of-fit test. A generalized c2-test can be accomplished provided that you know the original data distribution, the number of categories (bins) to your histogram, the expected probability distribution (or pdf: probability density function), and, to compute the p-value, the cumulative probability distribution function (cdf) for a random variable distributed as a c2.

Assignment 3

 Bio Dept.

 Math Dept.

 MHC Home

 Aaron Ellison

 Other Stat. Sites

 Biostatistics Home Page