Exercise 3-1
Suppose we wish to find the median of a collection of values. Assume that the we have read some values so far, and that we have no idea how many values remain to be read. Prove that we cannot afford to discard any of the values that we have read. Hint: One proof strategy is to assume that we can discard a value, and then find values for the unread–and therefore unknown–part of our collection that would cause the median to be the value that we discarded.
Solution
I must admit that I find this question very difficult to understand. When I first read it I wasn’t sure whether it is meant to be a very simple and straight forward question, or some kind of trick question which has a definite answer.
I decided to google this and found this very interesting forum on this topic (including some comments from the original author A.Koenig!). After spending an hour going through the thread I decided the question and answer might get a bit too mathematical / theatrical, which I am not prepared to spend my next hours or days researching it. For this reason, I will stick to my motto KISS – Keep It Simple and Stupid. i.e. I will just give a couple of examples showing that, the true median value of a full set “without discarding” may be different to the (distorted) median of the same full set “with discarding“. Hence demonstrating why we cannot afford to discard any numbers.
Example 1
Assume we have read the numbers 1, 2, 3, 4 so far, and imagine that the next number is 5.
Scenario | The eventual “full set” | The Median | Observation |
No Discarding | 1, 2, 3, 4, 5 | 3 | True median |
We discard 1 | 2, 3, 4, 5 | 3.5 | Larger than true median |
We discard 2 | 1, 3, 4, 5 | 3.5 | Larger than true median |
We discard 3 | 1, 2, 4, 5 | 3 | Identical to true median |
We discard 4 | 1, 2, 3, 5 | 2.5 | Smaller than true median |
Example 2
Assume we have read the numbers 1, 2, 3, 4 so far, and imagine that the next number is 5, 6.
Scenario | The eventual “full set” | The Median | Observation |
No Discarding | 1, 2, 3, 4, 5, 6 | 3.5 | True median |
We discard 1 | 2, 3, 4, 5, 6 | 4 | Larger than true median |
We discard 2 | 1, 3, 4, 5, 6 | 4 | Larger than true median |
We discard 3 | 1, 2, 4, 5, 6 | 4 | Larger than true median |
We discard 4 | 1, 2, 3, 5, 6 | 3 | Smaller than true median |
Conclusion
The two examples above have demonstrated that we cannot afford to discard any numbers from the number set, should we wish to identify the true median. By discarding any numbers from the (read) set, we essentially distort the median.
Thanks for this…i get stuck on parts of this book that arent even programming stuff and i think at the moment that im just not getting what im reading.
Proof by contradiction in accordance with the hint in the question:
Assume the removal of any value in a vector prior to the end of the sequence does not have an effect on the median. Then the median without removal of an element should equal the median with the removal of the element.
Example:
Original Vector: [10, 15, 20, 25]
Value to remove: 10
New Sequence: [5, 30, 35]
New Vector: [5, 15, 20, 25, 30, 35]
Median with Removal: (20+25)/2 = 22.5
Median without Removal: 20
Since this example violates our assumption above, it must follow that removed values affect the median and that we must retain all values prior to returning the median for the final vector.
I’d say it’s quite obvious anyway so I didn’t spend much time on this question. The only exception I can think of is when you have a large amount of data about, let’s say the household income in a country. Taking out one value when reinserting a new one will not have a significant influence on the median. It may have a large influence on the average though. Imagine a country with population 1 million where the median income is 50k and most people are between 25k and 75k. If there’s one person who makes 1 billion per year, it can increase the average with even 10k (not all inhabitants are of working age or even employed). The median tells you much more about the average household than the average income in such cases.