Exercise 3-2
Write a program to compute and print the quartiles (that is, the quarter of the numbers with the largest values, the next highest quarter, and so on) of a set of integers.
Solution
According to this Wikipedia page, for discrete distributions, there is no universal agreement on selecting the quartiles values – there are 3 major computing methods.
I will apply Method 1 in my solution which has the following definitions:
- Use the median to divide the ordered data set into two halves. Do not include the median in either half.
- The lower quartile value is the median of the lower half of the data. The upper quartile value is the median of the upper half of the data.
This rule is employed by the TI-83 calculator boxplot and “1-Var Stats” functions (according to Wikipedia)
It is also very important to note that there are two types of medians:
- Datum median – the middle data point of an ordered vector.
- Non-datum median – the mean of the middle two data points of an ordered dataset.
The computation of the lower quartile, median, and upper quartile can be either a datum or non-datum. It depends on the dataset dataset pattern which I shall expand on more shortly.
The 6 distinctive dataset patterns
After 3 hours of sketching and iterations, I am pleased to say that I have found “the dataset pattern” which I believe is consistent for all types of input datasets, and robust to implement. I summarise this pattern in the following table:
ID | Pattern | Condition | Quartiles (Q1, Q2, Q3) |
1 | NULL dataset | N = 0 | (Cannot compute) |
2 | 1 Number only | N = 1 | Quartiles all resolve to that 1 number provided. |
3 | Datum Profile [0 0 0] | N % 4 == 0 | |
4 | Datum Profile [0 1 0] | N % 4 == 1 | |
5 | Datum Profile [1 0 1] | N % 4 == 2 | |
6 | Datum Profile [1 1 1] | N % 4 == 3 |
Note
- I use Q1, Q2, and Q3 to denote Quartile 1, 2, and 3. (i.e. end of quarter 1, 2, and 3)
- If the dataset is empty (Null), it is not possible to compute the quartiles for that dataset.
- If the dataset contains only 1 number, it is quite natural to assume that there is no variation (or no range). i.e. all quartiles correspond to that number.
- If there are at least 2 numbers provided, the pattern is determined by the Datum Profile [Q1 Q2 Q3]. There are four possible types of Datum Profiles: [0 0 0], [0 1 0], [1 0 1], and [1 1 1].
- For example, a datum profile of [1 0 1] means: lower quartile is a datum, median is a non-datum, upper quartile is a datum. i.e. 1 represents datum. 0 represents non-datum.
- N is the total number of elements within the dataset.
- The % (percent) sign is used to compute remainder. e.g. N % 4 gives the remainder of N divided by 4.
Nomenclature
I use the following short-form letters to make the formula easier to understand, and program easier to code.
Symbol | Meaning |
v | Represents the vector. e.g. v[M] resolves the value of the element locating at index M of the vector v. |
N | Number of elements of the vector (i.e. the input dataset) |
ML | An index value for computing the lower quartile. (Think Median of the Lower-halve). |
M | An index value for computing the median. |
MU | An index value for computing the upper quartile. (Think Median of the Upper-halve). |
ml | The computed value of lower quartile. |
m | The computed value of median |
mu | The computed value of upper quartile. |
Solution Strategy
At a high level, this is what the program should do:
- Read the list of numbers from the user and store in a vector.
- Pattern 1: If vector contains no elements (i.e. NULL), output an error message (mentioning the program requires at least 1 number to compute quartiles), then exit the program peacefully.
- Pattern 2: If vector contains just 1 number, output a message saying that all quartiles values are the same as that number provided, then exit the program peacefully.
- Sort the vector so it become non-descending – this is required for the median computation.
- Determine whether the dataset belongs to pattern 3, 4, 5, or 6, execute the corresponding algorithms, display the quartile summary at the end.
Before walking through this solution strategy I think it is worth expand a bit more on how I derive the algorithms for pattern 3, 4, 5, and 6 in the first place.
Derivation of Algorithms
Pattern 1 (null dataset) and patter 2 (1 number only) are very straight forward and have already been explained above. So I’m not going to expand further on these first two patterns.
Pattern 3, 4, 5 and 6, however, are much more interesting in comparison. These patterns require a bit more intelligence to solve – so I will focus on these one-by-one and summarise this at the end.
The Repeating Cycles
The first thing that I did in terms of trying to spot a pattern was to sketch out around a number vectors with sizes ranging from N=2 to N=11 (could have been more but the patter had become quite obvious at that point!), circle out the datum profile, and hope to see some sort of pattern. Through the exercise I observed that the datum profile changes every time I increase N by 1. I also observed the profile change follows a cycle of 4 increments. e.g. the profile is the same at N=2, N=6, N=10, etc…. The following table is that “sketch” that I produced which led me to the discovery of algorithms.
N | N % 4 | Profile | Sketch |
2 | 2 | [101] | |
3 | 3 | [111] | |
4 | 0 | [000] | |
5 | 1 | [010] | |
6 | 2 | [101] | |
7 | 3 | [111] | |
8 | 0 | [000] | |
9 | 1 | [010] | |
10 | 2 | [101] | |
11 | 3 | [111] |
The pattern can therefore be summarised as follows:
Pattern ID | Datum Profile | N % 4 |
3 | [000] | 0 |
4 | [010] | 1 |
5 | [101] | 2 |
6 | [111] | 3 |
We now have this core table, let’s derive the equations.
Equations and Examples
These diagrams summarise the equations, examples, and code implementations for each of the four datum patterns.
Profile [000] – Equations and Examples
Profile [010] – Equations and Examples
Profile [101] – Equations and Examples
Profile [111] – Equations and Examples
The Program
Now that we have our algorithms, let’s put everything together into a final program. I am purely using the skeleton program from chapter 3 of the book (which has an in-depth explanation of the various components). I merely implement the algorithms into this skeleton program, with some extra enhancements for user-friendly output.
#include <algorithm> #include <iomanip> #include <ios> #include <iostream> #include <string> #include <vector> using std::cin; // <iostream> using std::cout; // <iostream> using std::endl; // <iostream> using std::setprecision; // <iomanip> using std::sort; // <algorithm> using std::streamsize; // <ios> using std::string; // <string> using std::vector; // <string> int main() { // display header message cout << "***************************************************************\n" "*** This program computes quartiles given a list of numbers ***\n" "***************************************************************\n"; cout << endl; // ask for a list of numbers and store the list as a vector cout << "Enter all a list of numbers: "; vector<double> v; double x; while (cin >> x) v.push_back(x); // check vector size and action accordingly cout << endl; typedef vector<double>::size_type vecSize; vecSize N = v.size(); if (N ==0 ) { cout << "You must enter some numbers! " << endl; return 1; } else if (N ==1 ) { cout << " Only 1 number supplied. Q1, Q2, and Q3 all equate to " << v[0] << endl; return 0; } else { // sort the homework grades; sort(v.begin(),v.end()); } // declare new variables vecSize NMod4 = (N % 4); // identification of 1 of the 4 known datum distribution profiles string datumDistr = ""; // datum distribution profile vecSize M, ML, MU; // core vector indices for quartile computation double m, ml, mu; // quartile values are store here // compute quartiles for the 4 known patterns if ( NMod4 == 0 ) { // Q1-Q3 datum distribution: [0 0 0] datumDistr = "[0 0 0]"; M = N / 2; ML = M / 2; MU = M + ML; // grab quartile values ml= (v[ML] + v[ML-1]) / 2; // datum: 0 m = (v[M] + v[M-1]) / 2; // datum: 0 mu = (v[MU] + v[MU-1]) / 2; // datum: 0 } else if ( NMod4 == 1 ) { // Q1-Q3 datum distribution: [0 1 0] datumDistr = "[0 1 0]"; M = N / 2; ML = M / 2; MU = M + ML + 1; // grab quartile values datumDistr = "[0 0 0]"; ml= (v[ML] + v[ML-1]) / 2; // datum: 0 m = v[M]; // datum: 1 mu = (v[MU] + v[MU-1]) / 2; // datum: 0 } else if ( NMod4 == 2 ) { // Q1-Q3 datum distribution: [1 0 1] datumDistr = "[1 0 1]"; M = N / 2; ML = M / 2; MU = M + ML; // grab quartile values ml= v[ML]; // datum: 1 m = (v[M] + v[M-1]) / 2; // datum: 0 mu = v[MU]; // datum: 1 } else if ( NMod4 == 3 ) { // Q1-Q3 datum distribution: [1 1 1] datumDistr = "[1 1 1]"; M = N / 2; ML = M / 2; MU = M + ML + 1; // grab quartile values ml= v[ML]; // datum: 1 m = v[M]; // datum: 0 mu = v[MU]; // datum: 1 } else { cout << "Unknown pattern discovered - new algorithm may be required."; } // Display results streamsize prec = cout.precision(); cout << "Display the sorted (non-descending) vector below." << endl; cout << "Index: Number" << endl; for (vecSize i = 0; i != N; ++i) { cout << i << ": " << v[i] << endl; } cout << endl; cout << "Vector size: " << N << endl; cout << "Datum Distribution: " << datumDistr << endl; cout << setprecision(3) << endl << " Q1: " << ml << endl << " Q2: " << m << endl << " Q3: " << mu << endl << setprecision(prec); }
Result
i’m going to run the program a number of times – each with different input dataset. (e.g. incrementing N by 1, try out the Wikipedia example and compare, etc.)
N = 0
*************************************************************** *** This program computes quartiles given a list of numbers *** *************************************************************** Enter all a list of numbers: ^Z You must enter some numbers! Process returned 1 (0x1) execution time : 9.804 s
N = 1
*************************************************************** *** This program computes quartiles given a list of numbers *** *************************************************************** Enter all a list of numbers: 5 ^Z Only 1 number supplied. Q1, Q2, and Q3 all equate to 5 Process returned 0 (0x0) execution time : 8.430 s
N = 2
*************************************************************** *** This program computes quartiles given a list of numbers *** *************************************************************** Enter all a list of numbers: 10 20 ^Z Display the sorted (non-descending) vector below. Index: Number 0: 10 1: 20 Vector size: 2 Datum Distribution: [1 0 1] Q1: 10 Q2: 15 Q3: 20
N = 3
*************************************************************** *** This program computes quartiles given a list of numbers *** *************************************************************** Enter all a list of numbers: 10 20 30 ^Z Display the sorted (non-descending) vector below. Index: Number 0: 10 1: 20 2: 30 Vector size: 3 Datum Distribution: [1 1 1] Q1: 10 Q2: 20 Q3: 30
N = 4
*************************************************************** *** This program computes quartiles given a list of numbers *** *************************************************************** Enter all a list of numbers: 6 2 4 9 ^Z Display the sorted (non-descending) vector below. Index: Number 0: 2 1: 4 2: 6 3: 9 Vector size: 4 Datum Distribution: [0 0 0] Q1: 3 Q2: 5 Q3: 7.5
N = 5
Enter all a list of numbers: 9 4 20 39 44 ^Z Display the sorted (non-descending) vector below. Index: Number 0: 4 1: 9 2: 20 3: 39 4: 44 Vector size: 5 Datum Distribution: [0 0 0] Q1: 6.5 Q2: 20 Q3: 41.5
Wikipedia example (even size vector)
The results match!
************************************************************** *** This program computes quartiles given a list of numbers ** ************************************************************** Enter all a list of numbers: 41 39 15 7 36 40 ^Z Display the sorted (non-descending) vector below. Index: Number 0: 7 1: 15 2: 36 3: 39 4: 40 5: 41 Vector size: 6 Datum Distribution: [1 0 1] Q1: 15 Q2: 37.5 Q3: 40
Wikipedia example (odd size vector)
The results match!
*************************************************************** *** This program computes quartiles given a list of numbers *** *************************************************************** Enter all a list of numbers: 49 43 41 39 15 6 7 36 40 42 47 ^Z Display the sorted (non-descending) vector below. Index: Number 0: 6 1: 7 2: 15 3: 36 4: 39 5: 40 6: 41 7: 42 8: 43 9: 47 10: 49 Vector size: 11 Datum Distribution: [1 1 1] Q1: 15 Q2: 40 Q3: 43
Hi!
On the N=4 result, you enter 3 numbers. Is this a mistake, or am I getting confused?
Relevant part:
N = 4
Thank you very much Takahiro for spotting this error. I clearly did a copy and paste error. I have re-run the code for the
N=4
case and updated the result in the blog post. Thanks very much and best of luck with the remaining C++ exercises!You’re welcome!
I should also be thanking you for the excellent solutions. You definitely put a lot of effort into learning.
When I come here to compare solutions, I’m always blown away by your well thought out strategy. You’re pushing me to try to do better than you!
So, thank you!
Very good stuff. Detailed and easy to understand. I am studying Java in college but I took it upon myself to do CPP in parallel on my own. Your work is helping me greatly!
Thank you Siim, this means a lot to me. I’m sure there are much better solutions out there – these were just my own (ok-ish) solutions. Do go out and search for more elegant versions though!
Hi Johnny,
I would like to Thank You profusely for taking the time to come up with these very detailed explanations and solutions for this Great Tome that Accelerated CPP is. I also benifited personally from your blog titled “OpenCV with Anaconda” as I use the anaconda distribution for all my Python and it was so convenient to be able to integrate OpenCV library in to my existing anaconda installation.
Kudos to you!
Ananth
Thanks SO MUCH Ananth this really means a lot to me. Though I haven’t touched C++ for over a year now it’s good to know that the documentation has helped others out too – have fun coding / hacking / building things! :-)
Hi Johnny, Thanks for this wonderful material and taking time to publish.
Actually when I do a ctrl-z the program/executable terminates in linux, I had to use ctrl-d to terminate the command line input. Was just curious to understand how you could you use ctrl- z to terminate the command line input, can you share it.
Hello, very useful website and solutions, thank you for this!
I was wonder why not approach this problem in a simpler fashion, using Inter-quartile range as suggested here: http://www.bbc.co.uk/schools/gcsebitesize/maths/statistics/representingdata3hirev4.shtml or am i missing something?
E.g:
double mid = size / 2;
double lowerqindex = (size + 1) / 4;
double higherindex = 3*(size + 1) / 4;
double median;
median = size % 2 == 0 ? (data[mid] + (data[mid - 1])) / 2
: data[mid];
double lowerq;
lowerq = size % 2 == 0 ? (data[lowerqindex] + data[lowerqindex - 1]) / 2
: data[lowerqindex];
double higherq;
higherq = size % 2 == 0 ? (data[higherindex] + data[higherindex- 1]) / 2
: data[higherindex];
cout << "MEDIAN: " << median << endl;
cout << "SIZE: " << size << endl;
cout << "Q1: " << lowerq << endl;
cout << "Q3: " << higherq << endl;
I’m sure I’m wrong at some point, it worked though, anyways thanks!