Tuesday January 31, 2006
Serendipity in Numbers
In any largeish set of numeric data, you might guess that numbers that begin with the digit '1' would occur approximately 11% of the time. And, that numbers beginning with any of the other digits, 2 - 9, would also occur 11% of the time (each). After all, shouldn't each digit have an equal probability of occurrence?In truly random data, data which could just as easily have been labeled with letters or colors as numbers, you'd be right. But in truly numeric data, data that increases or decreases additively (or subtractively), you'd be surprised. In actual fact, values that begin with the digit '1' are much more likely to occur (especially if the numbers have at least 4 digits, e,g, from 1000 on up).
The mathematical rule that describes this phenomenon is referred to as Benford's Law, after the physicist who discovered it.
In 1938, Dr. Frank Benford, a physicist at General Electric, "noticed that pages of logarithms corresponding to numbers starting with the numeral 1 were much dirtier and more worn than other pages." He asked himself why this might be the case.
Dr. Benford concluded that it was unlikely that physicists and engineers had some special preference for logarithms starting with 1. He therefore embarked on a mathematical analysis of 20,229 sets of numbers, including such wildly disparate categories as the areas of rivers, baseball statistics, numbers in magazine articles and the street addresses of the first 342 people listed in the book "American Men of Science." All these seemingly unrelated sets of numbers followed the same first-digit probability pattern as the worn pages of logarithm tables suggested. In all cases, the number 1 turned up as the first digit about 30 percent of the time, more often than any other.[ Following Benford's Law, or Looking Out for No. 1, by Malcolm W. Browne (From The New York Times, Tuesday, August 4, 1998) ]
Rich thought it would be fun to test Benford's law on data we have readily available — file sizes on the nearest computer. He wrote up a little program to do so; not surprisingly, the results fit the prediction. We then had the following discussion:
Consider the fact that Rich and I have been using Unix systems, running the ls (file listing) command, for over twenty years now. Why is it that, in all that time, we have never noticed that the majority of files have a size that begins with the digit 1 (and that approximately half of all files have a size that begins with either 1 or 2).
Now, consider how Dr. Benford tripped over this idea — not by looking at the numbers (at least not at first) but by noticing that some of the pages of logarithms were dirtier and more used than others. He then formulated a hypothesis and analyzed 20,229 sets of numbers to test it, in 1938, without the aid of a computer.
Think about it.
Charting Benford's Law
Rich's results
cnt: 766281
d pred. data 1 2 3 4
123456789 123456789 123456789 123456789
1 (30.1): 32.1 ********************************
2 (17.6): 19.7 ********************
3 (12.5): 13.4 *************
4 ( 9.7): 9.4 *********
5 ( 7.9): 7.3 *******
6 ( 6.7): 6.0 ******
7 ( 5.8): 4.8 *****
8 ( 5.1): 4.1 ****
9 ( 4.6): 3.1 ***
Serendipity in Numbers
( in category
SciTech
)
- posted at Tue, 31 Jan, 08:04 Pacific
| «e»

vlb@cfcl.com