# Constrained data model - can you solve it?

By Murray Bourne, 21 Jul 2012

## Constrained data model - can you solve it?

Statisticians are often asked to predict the future. Their job is to look at historical trends and to make their best guess what will happen tomorrow, or maybe next year. This involves a process called **extrapolation**, where we project into the future based on past data.

For example, here is a set of data representing company sales (variable *s*) for a period of 10 days (*t*):

t |
1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 |
---|---|---|---|---|---|---|---|---|---|---|

s |
2 | 4 | 3 | 5 | 9 | 6 | 8 | 9 | 8 | 10 |

Let's graph the data to get a better idea what it looks like:

To extrapolate intelligently, we need to consider the **shape** of the data curve, and then fit a curve as best we can through the points.

At the beginning, there is a steep rise in sales, but then it flattens out (perhaps the salespeople were getting tired). It appears to be a logarithmic curve, which in its standard form, starts fairly steeply, and then flattens out.

There are various formulas for fitting a curve to data (see the bottom of the page), but we'll use software.

We'll get Excel to draw a curve through the data points and then extrapolate for us. After entering the data in Excel, we highlight it, then choose the "**XY(Scatter)**" option. Once we see the chart, we select the graph and right click, then choose "**Add trendline**". We choose "**logarithmic**" because that's the shape of our data curve. Under "**Options**", we can choose to "**Forecast**" by as many steps as we like.

So in this next graph, you'll see the logarithmic growth model passing through the given data. I've forecast the next 20 days, so we can get an idea how our sales team will perform for the rest of the month, assuming their historical performance (for the first 10 days) continues.

(The above set of data and curve is just to illustrate the problem. We need a formula that will work with **any** set of data, within the constraint mentioned next.)

**Constraint**

Now, (and here's where the challenge comes) say we need to produce a curve such that the sum of the values of the model curve is **exactly** the same as the sum of the values (for the first 10 days) in the original data set?

For example, here is the data again, but with a new row (*m*) containing values that fit the above constraint. That is, the sum of the first 10 sales values (variable *s*) is the same as the sum of the first 10 model values (using variable *m*).

t |
1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | Σ |
---|---|---|---|---|---|---|---|---|---|---|---|

s |
2 | 4 | 3 | 5 | 9 | 6 | 8 | 9 | 8 | 10 | 64 |

m |
2.17 | 3.53 | 4.69 | 5.68 | 6.52 | 7.23 | 7.84 | 8.36 | 8.80 | 9.18 | 64 |

The **sum** (**Σ**) is given in the final column. Both rows *s* and *m* add to 64.

Here is the graph of the *m*-values given in the above table. You'll notice it is similar to the Excel logarithmic graph, but not exactly the same. I have once again extrapolated another 20 steps in the chart below. We observe using this constrained model that the sales per day at the end of the month is less (around 11.5) than Excel's logarithm curve (which gave almost 13).

So, your challenge is to develop a formula to produce a curve that goes through any set of data, given the above constraint. Good luck! (This is quite challenging, but give it a go.)

See the 10 Comments below.