Best Fit Experimenters
(Least Squares Approximations)


This demo has been revised and now includes Excel activities (9/1/2006).

Objective: Provide visual foundation and geometric intuition for best fit (least squares) models of data sets of ordered pairs using lines or parabolas.

Level: Precalculus, calculus, or linear algebra.

Prerequisites: Equations of lines and parabolas. Introductory material on building a model to a data set of ordered pairs. The depth of such material is dependent on the level at which the routines are used.

Platform: Routines in both Excel and MATLAB are provided.

Instructor's Notes:
A common problem in a variety of applications is the development of mathematical models for a set of data of ordered pairs

One of the first such models that students encounter involves finding the equation of a straight line that in some way "matches" or approximates the data. If it happens that all the data points lie on the same line, then we can find the equation of the line using any two distinct points from the data set; that is any pair of points from S, call them (x1,y1) and (x2, y2), with x1 not equal to x2. We merely compute the slope m of the line segment between the pair of points as
 

and use the point-slope form of the line, either 

 

These expressions are algebraically equivalent as can be shown by rewriting them in the form y = mx + b, where b is the y-intercept of the line. (If all the points lie on a vertical line, then its equation has the form x = c, where c is the x-coordinate of each of the data points.)

More often it is the case that no single line goes through all of the points. In this case we can develop a mathematical model for the data set by determining the equation of a line that comes closest to all the data points, but need not go through any of them. In order to make this precise we must define what we mean by "closest". In situations where it seems reasonable for the mathematical model to be a straight line, one of the most common definitions of "closest" requires us to minimize the square root of the sum of the vertical deviations between the data points and the line we seek. (Click here to see a picture of vertical deviations for a sample data set.) A line determined in this way is called a line of best fit in the least squares sense, or a line of best fit, for short. The idea is to determine the slope m and y-intercept b of the line y = mx + b so that

 
 
is as small as possible. It can be shown that under very mild restrictions that there will be unique values of m and b that will guarantee that the preceding expression is as small as possible. The development of formulas for m and b can be achieved using calculus or linear algebra. In some cases the formulas are just stated and a student uses them on faith. 

Rather than concentrate on the development of the formulas for m and b, we can have the student experiment to develop a geometric intuition for the line of best fit. The idea is to have a data set displayed as a set of ordered pairs in the plane and have the student develop a conjecture for the line of best fit. There are various ways to provide assistance for students to experiment in determining approximations to the line of best fit. Here we illustrate two such techniques, one using Excel and another using MATLAB.

In Excel: Here we give students control over the selection of the slope m with a slider and a separate slider for the y-intercept b. In Figure 1 we show the data set for the Olympic Women's Discus event. The top slider controls the slope and the second slider controls the y-intercept. As the sliders are moved the line is repositioned. To go with this activity we need a device that indicates that "accuracy" of the line for coming as close as possible to all the data points; that is,

Figure 1.

the line of best fit. The device we use is geometrical. We construct a square in a separate figure whose area indicates the value of sum of the squares of the vertical deviations, which is just the value of the expression

As the sliders are moved the area of the square changes since the values of m or b change. We illustrate this idea in with an animation which is a QuickTime file. Click here to view the animation.

We have a collection of Excel routines that are designed as described above. We use a variety of data sets from Olympic events and U.S. society. To execute or download one of these routines click on its title.

In the activity The Shrinking Value of the Dollar, the graph of the data set is shown in Figure 2. It appears that

Figure 2.

that this data may not be well approximated by by a line, but perhaps a parabola (a quadratic polynomial) may give better results. To illustrate such an approximation we have included a quadratic best fit for this data. To execute or down load this Excel file click on Figure 3.

Figure 3.

Warning: These activities only provide an approximation to the equation of the line of best fit since the sliders are calibrated to select a discrete set of values. However, careful selection of values for m and b yield good approximations.

IN MATLAB: Rather than a student choose values of m and b, an alternative is to have the student pick two points in the plane (not necessarily data points from S) and connect them. (The student should select the two points so that the line connecting them comes close to the points of the data set.) Then the vertical deviations from the data points in S to this line can be drawn. Hence the value of 

can be computed for the conjectured line of best fit. In a group situation the conjectured least squares lines can be viewed by different groups and the sum of the squares of the vertical deviations can be compared. (Teams of 2 or 3 students have worked well in this regard.) Once the student selections have been completed, the true line of best fit can be given to the groups along with the minimum value of the sum of the squares of the vertical deviations. It is instructive to have students draw the line of best fit on the same graph as their estimated line. A discussion of the conjectures and features of the process can help students do a better job on a second example. In order to demonstrate the process and provide a visual model the MATLAB routine lsqgame has been used successfully in a variety of classes. A brief description of this routine follows.

LSQGAME Least Squares Line Game 

An interactive 'game' to select the least squares line to a set of data. Two guesses for the least squares line can be made using the mouse to select two points that are then connected. The sum of the squares of the vertical deviations from the corresponding line is computed and displayed. The 'true' least squares line can be displayed. The data set for the 'game' can be entered using the mouse, typed in as a n by 2 matrix, loaded from a previously stored data set, or loaded by executing a m-file. In the latter two cases the data set must be an n by 2 matrix specifically named dmat. Upon quitting the game the data set dmat can be saved for future use. 

Use in the form ===> lsqgame <===
Requires MATLAB and routine myginput.
A sample screen appears below. (Click on the picture to enlarge it. Click on Back to return to the document.)

Routine lsqgame can be used as a demonstration involving two players or it can be used with small groups in a lab setting. It is versatile and easy enough that there is no need to have experience with MATLAB.

It is fun to give data sets containing outliers and observe how the model changes. The graphical impact from using routine lsqgame is much more dramatic than the algebraic impact from using formulas for m and b.

An extension of lsqgame is quadgame which determines a parabolic model. It is played like lsqgame, but now three points are chosen to conjecture the parabola that is closest in the least squares sense. Quadgame is useful for modeling simple data sets that arise from 'ballistic' situations. A sample screen appears below. (Use Click on the picture to enlarge it. Click on Back to return to the document.)

The routines lsqgame and quadgame along with the utility myginput are available to be downloaded by clicking on lsqdemos

An interesting observation related to sports technology: The winning heights for the men's Olympic pole vault event from 1896 through 2004 are shown in Figure 3. The least squares line to this data is not particularly a good fit.

 

Figure 3.

However, lets divide the data into three eras based on the technology used in the event as follows; 1896 - 1924, 1928 - 1960, and 1964 - 2004. Now determine the least squares line for the data from each era. We obtain the fits shown in Figure 4, which are quite good. A student activity to determine the technology used in each era is a nice tie to properties of technology that influence physical events.

Figure 4.

Click here to see Auxiliary Resources for this demo.

Credits:  This demo, the Excel files, and the MATLAB m-files were submitted by 

Dr. David R. Hill
Department of Mathematics 
Temple University

and is included in Demos with Positive Impact with his permission.


DRH 2/14/00   Last updated 9/2/2006 (Excel resources added 9/5/2006.)

Since 3/1/2002