# Many "Glitchy" Regressions

New as of 8/4/97: David Geary's rubberband classes have been added to make the data exclusion process smoother. Also, many of the classes have been reorganized into packages.

New as of 2/21/01: Here is a link to a version that uses a Java translation of a Minpack nonlinear least squares routine to fit multiple data sets. (This version does not produce residual plots. It does, however, allow mouse-based "graphical" changes in the starting values used to estimate the parameters of the nonlinear least squares model.)

New as of 3/2/01: Here is a link to a version that uses Java translations of some of the LINPACK numerical linear algebra routines to handle a model that is more complex than a simple linear regression. (In fact the model at this link is just a + b (x - c)^2 but the LINPACK routines that are used can be used to handle any general linear model.)

Researchers sometimes encounter data sets that consist of many curves. The researchers may need to "reduce" these curves further before performing any statistical tests. Some examples:

• A researcher is interested in the effect of various preservative treatments on the modulus of elasticity of several grades of lumber. The design is a two way analysis of variance (ANOVA). The data points in each cell are modulus of elasticity values obtained from stress-strain curves.
• A researcher is interested in the effect of several chemical inhibitors on the growth of several fungus species. Again the design is a two way ANOVA. The data points in each cell are 50% inhibition doses obtained from colony size versus dose curves.
• A researcher is interested in the effects of linerboard source, corrugating medium, and adhesive type on the creep behavior of corrugated fiberboard under cyclic humidity conditions. Here the design is a three way ANOVA. The data points in each cell are steady-state creep rates obtained from strain versus time curves.

In all of these examples, before the ANOVAs can be performed, the data curves must be reduced to single numbers. Sometimes this process is straightforward and it can be automated. In other cases, especially when the curves contain "glitches," human intervention is required --- an investigator needs to look at a plot of each data curve, select a region to fit, look at the results of the fit, decide whether an improved fit is needed, and so on. When the data set consists of hundreds of curves, this can be a painful process. We hope that our program will ease the pain.

The program is available as both a Java applet and a Java application. Alpha source code is available in both compressed tar and zip form. Here is javadoc documentation for some of this material.

The version here fits simple linear regressions Here is a demonstration applet.

[Here are Java translations of LINPACK Cholesky, QR, singular value, and LU decomposition routines. Here are Java translations of several high quality public domain optimization packages, including Minpack nonlinear least squares routines. These were not used in this application, but they might be useful to someone who wants to produce their own regression applications.]

The program is designed to work as follows:

1. A user provides it with a data file. The file consists of a line that contains a curve id, followed by lines containing x,y pairs, followed by the id of the next curve, followed by its x,y pairs, and so on. (If you forsee that you might find a use for this program, but would prefer a different form for the input, please contact me at the e-mail address given below.) Currently, the demonstration applet reads a data set consisting of a collection of stress versus strain curves.
As of 10/26/96, you can provide your own data set. Simply anonymous ftp it to www1.fpl.fs.fed.us and put it in the pub/data directory. Then, after you start the demonstration applet, replace "demo_input" in the first applet window with the name of your data file.
2. When the program begins, the user is presented with a window that contains text fields for the name of the input data file, the prefix of the output data files, and labels for some of the plots. After these fields have been filled, the user clicks on the Go button to proceed.
3. After the Go button is clicked, the program loads the first curve in the data set. A user can advance through the collection of data curves by pressing the Next button. Alternatively, a user can specify a particular curve by typing its ID into the text field next to the Load button, and then pressing the Load button.
4. A user excludes rectangular regions of data by pushing a mouse button down at one corner of the rectangle, holding the button down while moving the pointer to the opposite corner of the rectangle, and then releasing the button. While a user is performing this movement, an animated red rectangle appears that outlines the current exclusion region. After the button is released the outline of the rectangle turns black, and it can no longer be moved. However the data in the rectangle can be reactivated if the rectangle is "cleared." A user clears a rectangle by clicking a mouse button in it (all rectangles that contain this point will be cleared) and then clicking on the Clear button at the bottom of the page. All rectangles that are shown on the current page (in the current "zoom state" --- see below) will be cleared if the user clicks the Clear All button at the bottom of the page.
5. If a user wants to look at only the data that has not been excluded, the user should press the Zoom button. A sequence of zooms may be performed.
6. Currently, the program does not permit a user to unzoom a single zoom. However a user can reactivate all of the data by pressing the Reset button at the bottom of the page. Alternatively, a user can see all of the inactivated rectangles by performing a fit (see below) and then by clicking the appropriate recall button at the top of the page (see below).
7. A user fits the active data by clicking the Fit button at the bottom of the page. In the demonstration program, the fit is a simple linear regression fit. In the future the program will be modified to perform certain nonlinear regression fits. If you would like to use this program, and have a particular nonlinear regression model in mind, please contact me at the e-mail address given below.
8. After a fit has been performed, an appropriate summarizing value appears in one of the boxes at the top of the page. For the demonstration program, the summarizing value is the fitted slope of the curve (the modulus of elasticity estimate). Up to 5 fits can be performed on any one data set (of course, additional fits can be performed by reloading the data set or by rerunning the program). If i < 5 fits have been performed, and the Fit button is pushed, the summarizing value will appear in text field i + 1 at the top of the page. At any time prior to a push of the Load, Next, or Stop buttons, this fit can be recalled by pushing the button (buttons 1 -- 5) associated with the text field at the top of the page.
9. After a fit has been performed (actually, even before the Fit button has been pushed), a user can view various associated residual plots by pushing the Residuals button at the bottom of the page. This opens a second window that contains a different set of buttons at its bottom. The Data button displays the currently active data. The Fit button adds the fitted line. The Res. vs. x button plots the residuals versus the corresponding x values. The Res. vs. pred. y button plots the residuals versus the predicted y values. The Histo button produces a histogram of the residuals. The NPP button produces a normal probability plot of the residuals. The Bye button deletes the residuals window.
10. The Stop button ends the execution of the program.
11. If a fit has been performed, then when the next Load, Next, or Stop button is pushed, information is written to two results files. To a file titled xxx.summ (where xxx is provided by the user in the startup box), the ID and summarizing value of the most recently displayed fit are written. To a file titled xxx.inact (where xxx is provided by the user in the startup box), the ID, the number of inactivated rectangles associated with the most recently displayed fit, and the pairs of x,y values that define them are written. As of 10/22/96, these files can be retrieved via anonymous ftp from the pub/data directory.
I welcome suggestions for improving this program. Please send suggestions and bug reports to the e-mail address given below (or call me). Here is the current demonstration applet.