FPL Statistics Group
Many "Glitchy" Regressions
New as of 8/4/97: David Geary's rubberband classes have been added to
make the data exclusion process smoother. Also, many of the classes have been
reorganized into packages.
New as of 2/21/01: Here is a link to a version that uses a Java
translation of a Minpack
nonlinear least squares routine to fit
multiple data sets. (This
version does not produce residual plots. It does, however, allow
mouse-based "graphical" changes in the starting values used to
estimate the parameters of the nonlinear least squares model.)
New as of 3/2/01: Here is a link to a version that uses Java
translations of some of the LINPACK numerical linear algebra routines
to handle a model that is more complex than a simple linear
regression. (In fact the model at this link is just
a + b (x - c)^2 but the LINPACK
routines that are used can be used to handle any general linear model.)
Researchers sometimes encounter data sets that consist of many curves.
The researchers may need to "reduce" these curves further before
performing any statistical tests. Some examples:
- A researcher is interested in the effect of various preservative
treatments on the modulus of elasticity of several grades of lumber.
The design is a two way analysis of variance (ANOVA). The data points
in each cell are
modulus of elasticity values obtained from stress-strain curves.
- A researcher is interested in the effect of several chemical
inhibitors on the growth of several fungus species. Again the
design is a two way ANOVA. The data points in each cell are 50%
inhibition doses obtained from colony size versus dose curves.
- A researcher is interested in the effects of linerboard source,
corrugating medium, and adhesive type on the creep behavior of corrugated
fiberboard under cyclic humidity conditions. Here the design is a three
way ANOVA. The data points in each cell are steady-state creep
rates obtained from strain versus time curves.
In all of these examples, before the ANOVAs can be performed, the data
curves must be reduced to single numbers. Sometimes this process is
straightforward and it can be automated. In other cases, especially
when the curves contain "glitches," human intervention is required ---
an investigator needs to look at a plot of each data curve, select a
region to fit, look at the results of the fit, decide whether an
improved fit is needed, and so on. When the data set consists of
hundreds of curves, this can be a painful process. We hope that our
program will ease the pain.
The program is available as both a Java applet and a Java
application.
Alpha source code is available in both compressed tar and zip form. Here is javadoc documentation for some of this
material.
The version here fits simple linear regressions
Here is a
demonstration applet.
[Here are Java translations of
LINPACK Cholesky, QR, singular value,
and LU decomposition routines. Here are Java translations of
several high quality public domain optimization packages, including
Minpack nonlinear least squares routines.
These were not used in this
application, but they might be useful to someone who wants to produce
their own regression applications.]
The program is designed to work as follows:
- A user provides it with a data file. The file consists of a
line that contains a curve id, followed by lines containing x,y pairs,
followed by the id of the next curve, followed by its x,y pairs, and
so on. (If you forsee that you might find a use for this program, but
would prefer a different form for the input, please contact me at the
e-mail address given below.) Currently, the demonstration applet reads a data set
consisting of a collection of stress versus strain curves.
As of 10/26/96, you can provide your own data set. Simply
anonymous ftp it to www1.fpl.fs.fed.us and put
it in the pub/data directory. Then, after you start the demonstration
applet, replace "demo_input" in the first applet window with the
name of your data file.
- When the program begins, the user is presented with a window
that contains text fields for the name of the input data file,
the prefix of the output data files, and labels for some of the plots.
After these fields have been filled, the user clicks on the Go
button to proceed.
- After the Go button is clicked, the program loads the first curve
in the data set. A user can advance through the collection of data
curves by pressing the Next button. Alternatively, a user can
specify a particular curve by typing its ID into the text field next
to the Load button, and then pressing the Load button.
- A user excludes rectangular regions of data by pushing a mouse
button down at one corner of the rectangle, holding the button down
while moving the pointer to the opposite corner of the rectangle, and
then releasing the button. While a user is performing this movement,
an animated red rectangle appears that outlines the current exclusion
region. After the button is released the outline of the rectangle
turns black, and it can no longer be moved. However the data in the
rectangle can be reactivated if the rectangle is "cleared." A user
clears a rectangle by clicking a mouse button in it (all rectangles
that contain this point will be cleared) and then clicking
on the Clear button at the bottom of the page. All rectangles
that are shown on the current page (in the current "zoom state" ---
see below) will be cleared if the user clicks
the Clear All button at the bottom of the page.
- If a user wants to look at only the data that has not been
excluded, the user should press the Zoom button. A sequence of
zooms may be performed.
- Currently, the program does not permit a user to unzoom a single
zoom.
However a user can reactivate all of the data by pressing the
Reset button at the bottom of the page. Alternatively, a user can see all of
the inactivated rectangles by performing a fit (see below) and then by
clicking the appropriate recall button at the top of the page (see below).
- A user fits the active data by clicking the Fit
button at the bottom of the page. In the demonstration program, the
fit is a simple linear regression fit. In the future the program will
be modified to perform certain nonlinear regression fits. If you
would
like to use this program, and have a particular nonlinear regression model in
mind, please contact me at the e-mail address given below.
- After a fit has been performed, an appropriate summarizing value
appears in one of the boxes at the top of the page. For the
demonstration program, the summarizing value is the fitted slope of
the curve (the modulus of elasticity estimate). Up to 5 fits can be
performed on any one data set (of course, additional fits can be
performed by reloading the data set or by rerunning the program). If
i < 5 fits have been performed, and the Fit button is
pushed, the summarizing value will appear in text field i + 1
at the top of the page. At any
time prior to a push of the Load, Next, or Stop
buttons, this fit can be recalled by pushing the button (buttons 1 -- 5)
associated
with the text field at the top of the page.
- After a fit has been performed (actually, even before the Fit
button has been pushed), a user can view various associated
residual plots by pushing the Residuals button at the bottom of
the page. This opens a second window that contains a different set of
buttons at its bottom. The Data button displays the currently
active data. The Fit button adds the fitted line. The Res.
vs. x button plots the residuals versus the corresponding x
values. The Res. vs. pred. y button plots the residuals versus
the predicted y values. The Histo button produces a histogram
of the residuals. The NPP button produces a normal probability
plot of the residuals. The Bye button deletes the residuals
window.
- The Stop button ends the execution of the program.
- If a fit has been performed, then when the next
Load, Next, or Stop button is pushed, information
is written to two results files. To a file titled
xxx.summ (where xxx is provided by the user in the startup box), the ID and summarizing value of the most recently displayed
fit are written. To a file titled xxx.inact (where xxx is provided by
the user in the startup box), the ID, the number
of inactivated rectangles associated with the most recently displayed
fit,
and the pairs of x,y values that define them are written. As of
10/22/96, these
files can be retrieved via anonymous ftp from the pub/data directory.
I welcome suggestions for improving this program. Please send
suggestions and bug reports to the e-mail address given below (or call me). Here is
the current demonstration applet.
For further information, please contact Steve Verrill at
sverrill@fs.fed.us
or 608-231-9375.
[Forest Service]
[Forest Products
Lab]
[FPL Statistics Group]
Last modified on 3/2/01.
As of last midnight, this page had been accessed




times.