Category Archives: Exploratory Data Analysis (EDA)

Create log10 Histogram with R

An experiment I did during the Udacity course Data Analysis with R related to creating log10 histogram with R.

A Udacity supplied Pseudo-Facebook CSV Dataset was used for the purpose of the exercise. It contains interesting variables such as the user’s age, friends count, likes count, etc.) This article is about creating log10 histogram with the aim of generating a normal distributed profile against the variable friend_count.  

In this article I will show you how to create a basic log10 historigram in R in 4 different ways:

  1. show x-axis in friend_count, using qplot
  2. show x-axis in friend_count, using ggplot
  3. show x-axis in log10(friend_count), using qplot
  4. show x-axis in log10(friend_count), using ggplot

This is what the output will look like:

4-ways-to-plot-histograms

Before running any codes in R, you need to install the ggplot2 and gridExtra packages (as a one-off). Simply run these lines in RStudio.

install.packages('ggplot2', dependencies = T)
install.packages('gridExtra', dependencies = T)

Now, download the Pseudo-Facebook CSV Dataset to your local drive. Make sure your work directory is set to where the file is stored (using the getwd() and setwd() functions.)

Run the following R script will create the 4 histograms in a 2 by 2 grid-like manner. Note that they show essentially the same thing – just displayed and generated differently.

# Include packages  
library(ggplot2)  
library(gridExtra)  

# log10 Histogram with Scaling Layer scale_x_log10() - so we display x-axis in friend_count  

## (1) qplot method  
countScale.qplot <- qplot(x = friend_count, data = pf) +  
  ggtitle("(1) countScale with qplot") + scale_x_log10()  

## (2) ggplot method  
countScale.ggplot <- ggplot(aes(x = friend_count), data = pf) +  
  geom_histogram() +  
  ggtitle("(2) countScale with ggplot") +  
  scale_x_log10()  

# log10 Histogram without Scaling Layer scale_x_log10() - so we display x-axis in log10(friend_count)  

## (3) qplot method  
logScale.qplot <- qplot(x = log10(friend_count), data = pf) +  
  ggtitle("(3) logScale with qplot")  

## (4) ggplot method  
logScale.ggplot <- ggplot(aes(x = log10(friend_count)), data = pf) +  
  geom_histogram() +  
  ggtitle("(4) logScale with ggplot")  

# output results in one consolidated plot  
grid.arrange(countScale.qplot, countScale.ggplot,  
             logScale.qplot, logScale.ggplot,  
             ncol = 2)  

Now, you ask, why we have 4 ways to plot the same thing? i.e. 2 methods (qplot vs ggplot) by 2 x-axis scales (friend_count vs log10(friend_count)).

  • Regarding qplot vs ggplot: from the many articles and forums that I have come across it seems that qplot is caterred for basic charting (hence simplier), whilst ggplot is catered for more complicated stuff (hence more lengthy syntax). I guess time will tell to find out which one is more suitable in which scenarios, etc.

  • Regarding the friend_count vs log10(friend_count): advantage of using friend_count in x-axis is that it is easily understood, at the tradeoff of the numeric scale being evenly distributed (1, 10, 100, 1000, …). Using log10(friend_count) does the vice versa: advantage being that the numeric axis scale is evenly distributed (1, 2, 3, 4, …), at the tradeoff that the meaning of friend_count) is slightly diluted (e.g. If I say Joe’s log(friend_count) is 2, what is his friend_count? Well it is simply inverse-log of 2 which is 100. Great. But how about when log(friend_count) is 2.5? You may need a calculator for that: giving you 316.22…). So net net, it depends on who your audience is. Personally, I prefer the log(friend_count) style because of the even numeric scale – I don’t mind doing that extra step of computing friend_count as the inverse-log of log(friend_count).


Which type of histograms (1, 2, 3, or 4) do you usually generate in R (and why?). Have you come across any scenarios when you find that one type is better than the other?

Use Google Trend to compare Programming Language Interest

From a recent exercise in which my objective was to use Google Trend Explore  to compare interests between Python and Ruby (programming languages) over the past few years. A question arised – python could be the snake, monty python, the programming language, etc. Likewise, ruby could be a type of gems, a programming language, etc. If I did a trend comparison between python and ruby, I might run into problem such as comparing trend interest between a python the snake, and ruby the programming language (instead of comparing a programming language to another). How do I gain confidence that when I do a trend of Python vs Ruby, I am indeed comparing apple-to-apple? (programming language to programming language?). Here is my approach: Compare the trends between “python language”, “python program”, and “python code”. I want to gain some confidence that all three mean more or less the same thing. python-trendDo the same for Ruby. Compare the trends between “ruby language”, “ruby program”, and “ruby code”. ruby-trend From the comparisons above, I believe I can make more or less “apple-to-apple” comparison between python and ruby (as programming languages). python-language-vs-ruby-language-trend python-program-vs-ruby-program-trend python-code-vs-ruby-code-trend Do some “eye-ball” comparisons between the above “python language” and “ruby language” trends, and make some fairly reasonable conclusion. I was actually hoping to see the above 3 trend comparisons to be more or less the same. Reality turns out they differ quite a bit. I guess more work is required to digest the data and make sense of it. Can you suggest a better way to compare interest of two programming languages? (e.g. Python and Ruby)? And what would be your conclusion? Would love to hear your thoughts.