Objective:
Provide visual foundation and geometric intuition for best
fit (least squares) models of data sets of ordered pairs using lines or parabolas.
Level: Precalculus,
calculus, or linear algebra.
Prerequisites:
Equations
of lines and parabolas. Introductory material on building a model to a
data set of ordered pairs. The depth of such material is dependent on the
level at which the routines are used.
Platform:
Routines in both Excel and MATLAB are provided.
Instructor's
Notes:
A common problem in a variety of applications
is the development of mathematical models for a set of data of ordered
pairs

One of the first such models that students encounter involves finding the
equation of a straight line that in some way "matches" or approximates
the data. If it happens that all the data points lie on the same line,
then we can find the equation of the line using any two distinct points
from the data set; that is any pair of points from S, call them (x1,y1)
and (x2, y2), with x1 not equal to
x2.
We merely compute the slope m of the line segment between the pair of points
as
and use the point-slope form of the line,
either

These expressions are algebraically equivalent as can be shown by rewriting
them in the form y = mx + b, where b is the y-intercept of the line. (If
all the points lie on a vertical line, then its equation has the form x
= c, where c is the x-coordinate of each of the data points.)
More often it is the case that no single
line goes through all of the points. In this case we can develop a mathematical
model for the data set by determining the equation of a line that comes
closest to all the data points, but need not go through any of them. In
order to make this precise we must define what we mean by "closest". In
situations where it seems reasonable for the mathematical model to be a
straight line, one of the most common definitions of "closest" requires
us to minimize the square root of the sum of the vertical deviations between the data points
and the line we seek. (Click here to see
a picture of vertical deviations for a sample data set.) A line determined in this way is called a line of
best fit in the least squares sense, or a line of best fit, for short.
The idea is to determine the slope m and y-intercept b of the line y = mx + b so that

is as small as possible. It can be shown that
under very mild restrictions that there will be unique values of m and
b that will guarantee that the preceding expression is as small as possible.
The development of formulas for m and b can be achieved using calculus
or linear algebra. In some cases the formulas are just stated and a student
uses them on faith.
Rather than concentrate on the development
of the formulas for m and b, we can have the student experiment to develop
a geometric intuition for the line of best fit. The idea is to have a data
set displayed as a set of ordered pairs in the plane and have the student
develop a conjecture for the line of best fit. There are various ways to provide
assistance for students to experiment in determining approximations to the
line of best fit. Here we illustrate two such techniques, one using Excel
and another using MATLAB.
In Excel: Here we give students
control over the selection of the slope m with a slider and a
separate slider for the y-intercept b. In Figure 1 we show the data
set for the Olympic Women's Discus event. The top slider controls the slope
and the second slider controls the y-intercept. As the sliders are moved the
line is repositioned. To go with this activity we need a device that
indicates that "accuracy" of the line for coming as close as possible to all
the data points; that is,
|

Figure 1. |
the line of best fit. The
device we use is geometrical. We construct a square in a
separate figure whose area indicates the value of sum of the squares of
the vertical deviations, which is just the value of the expression

As the sliders are moved the area of the
square changes since the values of m or b change. We
illustrate this idea in with an animation which is a QuickTime file. Click
here to view the animation.
We have a collection of Excel routines that
are designed as described above. We use a variety of data sets from Olympic
events and U.S. society. To execute or download one of these routines
click on its title.
In the activity The Shrinking Value of the Dollar, the
graph of the data set is shown in Figure 2. It appears that
|

Figure 2. |
that this data may not be well approximated by by a
line, but perhaps a parabola (a quadratic polynomial) may give better results.
To illustrate such an approximation we have included a quadratic best fit for
this data. To execute or down load this Excel file click on Figure 3.
|

Figure 3. |
Warning: These activities only provide an
approximation to the equation of the line of best fit since the sliders are
calibrated to select a discrete set of values. However, careful selection of
values for m and b yield good approximations.
IN MATLAB: Rather than
a student choose values of m and b, an alternative is to have the student pick two points in the plane (not necessarily
data points from S) and connect them. (The student should select the two
points so that the line connecting them comes close to the points of the
data set.) Then the vertical deviations from the data points in S to this
line can be drawn. Hence the value of

can be computed for the conjectured line of best fit. In a group situation
the conjectured least squares lines can be viewed by different groups and
the sum of the squares of the vertical deviations can be compared. (Teams
of 2 or 3 students have worked well in this regard.) Once the student selections
have been completed, the true line of best fit can be given to the groups
along with the minimum value of the sum of the squares of the vertical
deviations. It is instructive to have students draw the line of best fit on the
same graph as their estimated line. A discussion of the conjectures and features of the process
can help students do a better job on a second example. In order to demonstrate
the process and provide a visual model the MATLAB routine lsqgame has been
used successfully in a variety of classes. A brief description of this
routine follows.
LSQGAME Least Squares Line Game
An interactive 'game' to select
the least squares line to a set of data. Two guesses for the least squares line can
be made using the mouse to select two points that are then connected. The
sum of the squares of the vertical deviations from the corresponding line
is computed and displayed. The 'true' least squares line can be displayed.
The data set for the 'game' can be entered using the mouse, typed in as
a n by 2 matrix, loaded from a previously stored data set, or loaded by
executing a m-file. In the latter two cases the data set must be an n by
2 matrix specifically named dmat. Upon quitting the game the data set dmat
can be saved for future use.
Use in the form ===> lsqgame <===
Requires MATLAB and routine myginput.
A sample screen appears below. (Click on the
picture
to enlarge it. Click on Back to return to the document.)
Routine lsqgame can be used as a demonstration
involving two players or it can be used with small groups in a lab setting.
It is versatile and easy enough that there is no need to have experience
with MATLAB.
It is fun to give data sets containing
outliers and observe how the model changes. The graphical impact from using
routine lsqgame is much more dramatic than the algebraic impact from using
formulas for m and b.
An extension of lsqgame is quadgame
which determines a parabolic model. It is played like lsqgame, but
now three points are chosen to conjecture the parabola that is closest
in the least squares sense. Quadgame is useful for modeling simple
data sets that arise from 'ballistic' situations. A sample screen appears
below. (Use Click on the picture to enlarge it. Click on Back to return
to the document.)
The routines lsqgame and quadgame
along with the utility myginput are available to be downloaded by
clicking on lsqdemos.
An interesting observation related to sports
technology: The winning heights for the
men's Olympic pole vault event from 1896 through 2004 are shown in Figure 3.
The least squares line to this data is not particularly a good fit.
|

Figure 3. |
However, lets divide the data into three
eras based on the technology used in the event as follows; 1896 - 1924, 1928
- 1960, and 1964 - 2004. Now determine the least squares line for the data
from each era. We obtain the fits shown in Figure 4, which are quite good. A
student activity to determine the technology used in each era is a nice tie
to properties of technology that influence physical events.
|

Figure 4. |
Click here to
see Auxiliary Resources for this demo.
Credits:
This demo, the Excel files, and the MATLAB m-files were submitted by
Dr.
David R. Hill
Department of Mathematics
Temple University
and is included in Demos
with Positive Impact with his permission.