Triangle Open Data Day 2014

A rare live blog post today. I'm writing this from Triangle Open Data Day 2014. This will basically be a page of links that I'll try to get around to later.

GIS resources:

Open data resources:

Cloud development resources:

MongoDB presentation is about to start. Will likely update this post.

Another skewed normal distribution

At the CLRS last year, Glenn Meyers talked about something very near to my heart: a skewed normal distribution. In loss reserving (and I'm sure, many other contexts) standard linear regression is less than ideal as it presumes that deviations from the mean are equally distributed. We rarely expect this assumption to hold (though we should always test it!). Application of a log transform is one way to address this, but that option isn't available for negative observations. Negative incremental reported losses are very common and even negative payments which arise from salvage, subrogation or other factors happen often enough that (in my view) the log transform isn't an attractive option.

Meyers gave a talk where he described (among other things) the lognormal-normal mixture. That presentation, Stochastic Loss Reserving with Bayesian MCMC Models, is worth any actuary's time. The idea is simplicity itself. Z is lognormally distributed, with parameters mu and theta. X is normally distributed with parameters Z and delta.

Let's have a look at this distribution. Well, actually that's easier said than done. Here are the equations:

Z \sim Lognormal(\mu,\sigma)

X \sim Normal(Z,\alpha)

So Z is easy, it's just a lognormal. In fact, here it is:

sigma = 0.6
mu = 2
x = seq(-10, 60, length.out = 500)
Z = dlnorm(x, mu, sigma)
plot(x, Z, type = "l")

plot of chunk unnamed-chunk-1

X for the expected value of Z is also easy. Here it is:

expZ = exp(mu + sigma^2/2)
delta = 3
pdfX = dnorm(x, expZ, delta)
plot(x, pdfX, type = "l")

plot of chunk unnamed-chunk-2

Here, we've produced a normal centered around the expected value of the original lognormal distribution. Not skewed and not all that interesting. What we want is a distribution wherein the mean of the normal is itself a random variable. To get that, we have three options: one lazy, one easy and one. I'll show the lazy one first.

The lazy one is to randomly sample from Z and then feed that to X. We end up with a histogram which approximates a density function.

samples = 10000
Z = rlnorm(samples, mu, sigma)
X = rnorm(samples, Z, delta)
hist(X)

plot of chunk unnamed-chunk-3

That's undoubtedly skew and might even correspond to Glenn's graph on slide 36. But that was lazy. The easy way is to repeat a procedure similar to what I did a week ago when demonstrating a Bayesian model which combined a lognormal and an exponential. Here, we just calculate the joint density over a subspace of the probability domain, normalize it and then compute the marginal.

plotLength = 250
Z = seq(0.001, 40, length.out = plotLength)
X = seq(-10, 40, length.out = plotLength)

dfJoint = expand.grid(Z = Z, X = X)

dfJoint$Zprob = dlnorm(dfJoint$Z, mu, sigma)
dfJoint$Xprob = dnorm(dfJoint$X, dfJoint$Z, delta)
dfJoint$JointProb = with(dfJoint, Zprob * Xprob)
dfJoint$JointProb = dfJoint$JointProb/sum(dfJoint$JointProb)

jointProb = matrix(dfJoint$JointProb, plotLength, plotLength, byrow = TRUE)

filled.contour(x = X, y = Z, z = jointProb, color.palette = heat.colors, xlab = "X", 
    ylab = "Z")

plot of chunk unnamed-chunk-4

Groovy. X can be anything, but higher values of Z will pull it to the right. Here's the marginal distribution of X.

library(plyr)
marginalX = ddply(dfJoint[, c("X", "JointProb")], .variables = "X", summarize, 
    marginalProb = sum(JointProb))
plot(marginalX$X, marginalX$marginalProb, type = "l", xlab = "X", ylab = "Marginal probability")

plot of chunk unnamed-chunk-5

The hard way is to sit down with pen and paper and work this out algebraically. I tried. I worked through a few ugly integrations did some research on the interweb and have concluded that if there is a closed form solution, it's not something that people spend a great deal of time talking about. I will point to this paper and this website as material that I'd like to get better acquainted with. It would appear that this comes up in financial and time series analysis. No surprises there, I think there are similar reasons to need this sort of distribution.

For the record, here's what that integrand looks like.

f_{X}(X|Z)f_{Z}(Z) = \frac{1}{Z \delta \sigma 2\pi}e^{\frac{-(X-Z)^2}{2\delta^2}+\frac{-(ln(Z)-\mu)^2}{2\sigma^2}}

If you know how to integrate that over Z, please let me know.

If you've made it this far, odds are good that you're an actuary, a stats nerd or both. Whatever you are, take a moment to thank heaven for Glenn Meyers, who's both. He's made tremendous contributions to actuarial literature and we're all the better for it.

Session info:

## R version 3.0.2 (2013-09-25)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## 
## locale:
## [1] LC_COLLATE=English_United States.1252 
## [2] LC_CTYPE=English_United States.1252   
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] knitr_1.5        RWordPress_0.2-3 plyr_1.8        
## 
## loaded via a namespace (and not attached):
## [1] evaluate_0.5.1 formatR_0.10   RCurl_1.95-4.1 stringr_0.6.2 
## [5] tools_3.0.2    XML_3.98-1.1   XMLRPC_0.3-0

An idiot learns Bayesian analysis: Part 2

A week ago, I wrote a bit about my personal journey to come to grips with Bayesian inference. I referred to the epiphany that when we're talking about Bayesian analysis, what we're talking about- in a tangible way- is using and modifying multivariate distributions. This reminds me of the moment, about twenty years ago now, when I had a nascent interest in object oriented programming. I had read about it, worked through some tutorials and struggled to write any code of my own. I spent a bit of time with a fantastic programmer that I knew, who showed me some code that he and a team of developers were putting together. This was commercial grade software, meant for delivery to clients who would be paying real money for it. And it was straightforward. I said, “John, this really doesn't seem all that strange. It looks like the same code that I'm currently writing.” John responded with a very patient nod of his head. Sure there were some subtle, yet meaningful differences, but I got over my mental hurdle and have been an OOP fan ever since. (No one would mistake me for an expert. I'm a fan.) I'm hoping that Bayes will work out the same and I think that dealing with the multivariate mechanics is the way forward. The math is just too simple.

OK, that last comment may have been a bit much. The math isn't always simple and heaven knows, I'm lost with anything more than two variables. My last post dealt with the simplest multivariate distribution possible- a 2 dimensional Bernouli trial. Now I'm going to do something a bit more complicated. I'm not going to use more than two variables- I doubt my head could take it- but I will at least move to a continuous distribution. My favorite probability distributions are the binomial (only two things to think about!) and the Pareto (discrete! nothing negative!). The only continuous distribution for which I have any fondness is the exponential (one parameter!). It's simple and I like the “lack of memory” property. As an aging drinker, I can identify with that. Moreover, if you combine more than one, you can create a semiparametric model which can support all kinds of curves. So, an exponential distribution it is.

This won't resemble any real-world phenomeon that I can think of. If you need to assume something, you can think of time until a customer service representative answers your call. And, let's say you're calling to trying to make changes to an airlines reservation. US residents may assume American Airlines and feel free to expect that the time may be eternal and with an unhappy resolution. Residents in other countries may substitute Lufthansa, RyanAir, Aeroflot or something similar. (Actually, Lufthansa was always fantastic if travel plans didn't need to be altered. Great planes, great staff. But changing a booking was often a Kafka-esque experience)

So, because this is Bayesian we need another parameter. I'll use the lognormal because it's simple and has support across the set of positive numbers. To simplify, I'm going to presume that I know the variance. (Aside: virtually every beginning stats book that I've read contains exercises where the variance is know, but the mean is not. This has always seemed crazy to me. I can't imagine a circumstance where I emphatically know how much observations vary around the mean, but I don't know the mean itself. The only thing crazier is an obsession with urns. A survey of the work of John Keats would have fewer uses of the word “urn.”)

Right. Back to the math. I'm going to use a lognormal with a mean of 5 and a CV of 20%. Let's draw a quick picture.

mean = 5
cv = 0.2
sigma = sqrt(log(1 + cv^2))
mu = log(mean) - sigma^2/2

plotLength = 100
theta = seq(0.001, 10, length = plotLength)
y = dlnorm(theta, mu, sigma)
plot(theta, y, type = "l")

plot of chunk unnamed-chunk-1

And now an exponential. Note that for the parameter, I'll be discussing it in such a way that the expected value is equal to the parameter. I think that often, the parameter is set so that the expected value is equal to the reciprocal of the parameter. Actuaries learned it differently- or at least this one did. I don't have my copy of Loss Models with me (it's at work, where it might do me some good), but that's how I'm used to thinking about it.

So, knowing that the exponential parameter could be any positive real number- but expecting it to be something close to 5- let's draw some exponential curves. We’ll use tau as the parameter for waiting time.

tau = seq(0.001, 20, length = plotLength)
t1 = dexp(tau, 1)
t2 = dexp(tau, 1/2)
t3 = dexp(tau, 1/5)
t4 = dexp(tau, 1/10)
plot(tau, t1, type = "l")
lines(tau, t2)
lines(tau, t3)
lines(tau, t4)

plot of chunk unnamed-chunk-2

Not terribly pretty, but that's the exponential. No mode, most of the probability shoved to the left hand side, the higher mean distributions wholly reliant on remote, but high valued occurrences.

Now I'm going to render this in two-dimensional space.

dfJoint = expand.grid(Theta = theta, Tau = seq(0.001, 10, length = plotLength))

dfJoint$LogProb = dlnorm(dfJoint$Theta, mu, sigma)
dfJoint$ExpProb = dexp(dfJoint$Tau, 1/dfJoint$Theta)
dfJoint$JointProb = with(dfJoint, LogProb * ExpProb)
dfJoint$JointProb = dfJoint$JointProb/sum(dfJoint$JointProb)

jointProb = matrix(dfJoint$JointProb, plotLength, plotLength)

filled.contour(x = unique(dfJoint$Theta), y = unique(dfJoint$Tau), z = jointProb, 
    color.palette = heat.colors, xlab = "Theta", ylab = "Tau")

plot of chunk unnamed-chunk-3

Most of the probability is for tau less than 4 and theta around 5. Each change in color looks a bit like a lognormal distribution as it should. The shift from one lognormal to another follows an exponential curve. Rapid near the bottom, then slower away going towards the top.

As with the cancer example, all of the probability is here. What is the chance that I'll be on hold for longer than 5 minutes? That can be obtained using the marginal exponential distribution. I'm going to cheat and calcuate the marginal based on the subset of the probability space that I used to create the plot.

library(plyr)
marginalExp = ddply(dfJoint[, c("Tau", "JointProb")], .variables = "Tau", summarize, 
    marginalProb = sum(JointProb))
sum(marginalExp$marginalProb[marginalExp$Tau > 5])
## [1] 0.2588
plot(marginalExp$Tau, marginalExp$marginalProb, type = "l", xlab = "Tau", ylab = "Marginal probability")

plot of chunk unnamed-chunk-4

About a 1 in 4 chance. American Airlines should be so lucky. Still, though, this is promising. What I'm most interested in is making predictions and the marginal allows me to do that. I reached this marginal by assuming a prior density- both form and parameters- for another distribution which controls my undestanding of a random event. This is an additional step. Ordinarily, I just work with the one function. Does the addition of another clarify or obscure things? My immediate reaction is to say yes, because- ceteris paribus- more information is preferred over less information. I hope to explore this in my next post, where I'll try to use some real-world example. Suggestions welcome.

Session info:

## R version 3.0.2 (2013-09-25)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## 
## locale:
## [1] LC_COLLATE=English_United States.1252 
## [2] LC_CTYPE=English_United States.1252   
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] knitr_1.5        RWordPress_0.2-3 plyr_1.8        
## 
## loaded via a namespace (and not attached):
## [1] evaluate_0.5.1 formatR_0.10   RCurl_1.95-4.1 stringr_0.6.2 
## [5] tools_3.0.2    XML_3.98-1.1   XMLRPC_0.3-0

An idiot learns Bayesian analysis: Part 1

I've done a dreadful job of reading The Theory That Would Not Die, but several weeks ago I somehow managed to read the appendix. Here the author gives a short explanation of Bayes' theorem using statistics related to breast cancer and mammogram results. This is the same real world example (one of several) used by Nate Silver. It's profound in its simplicity and- for an idiot like me- a powerful gateway drug. Possibly related to this is my recent epiphany that when we're talking about Bayesian analysis, we're really talking about multivariate probability. The breast cancer/mammogram example is the simplest form of multivariate analysis available. What does it all mean, how can we extend it and what does it have to do with an underlying philosophy of Bayesian analysis (if such a thing exists)?

The Theory That Would Not Die is sitting at my desk at work, so I'm going to refer to the figures quoted by Nate Silver on page 246. Odds for cancer are read across the columns, odds for a positive mammogram are read down the rows.

Before I go any further, I have to point out that the positioning of the tables is dreadful. WordPress experts are invited to help me sort this out.

probs = matrix(c(11, 99, 3, 887), 2, 2, byrow = TRUE, dimnames = list(c("M-True", 
    "M-False"), c("C-True", "C-False")))





C-True C-False
M-True 11 99
M-False 3 887

From this table, the joint probabilities are easy to read. What is the chance that a person has breast cancer and received a negative mammogram? 3 in 1000. What is the chance that a person does not have cancer, but received a positive mammogram? 99 in 1000, or roughly 10%. It's a trivial thing to determine the marginal probabilities.

probs = cbind(probs, rowSums(probs))
probs = rbind(probs, colSums(probs))
colnames(probs)[3] = "M"
rownames(probs)[3] = "C"






C-True C-False M
M-True 11 99 110
M-False 3 887 890
C 14 986 1000

The context of this information is what matters to the authors. Each presents the result that the likelihood that a patient has cancer- even with a positive mammogram- is still rather low (10% in this case). This is consistent with advice from some areas of the medical establishment that women not get routine mammograms before a particular age. This (slightly) surprising result is driven by the fact that the positive predictive value (number of true positives divided by the number of predicted positives) is very low as is the likelihood of a positive. Put differently, a mammogram does not appear to have a good success rate at predicting cancer (for this data) and the overall rate of cancer is quite low. How would things look if the numbers changed?

How do we do that? In order to hold the cancer probability fixed, we can't change the marginal totals. So, we can move numbers in the same column from one row to another. Or, if we move from one column to another, we must offset that in the other row. As an extreme, we could assume that the test is perfectly predictive. This would move the 3 false negatives into the true positive cell and the 99 false positives to the true negative cell. In this case, there is no probability in the upper right or lower left corner of the matrix. From another perspective, it is impossible to distinguish the two marginal distributions.

But that's a bit boring, so let's create something more interesting. We'll not alter the number of false negatives, but reduce the false positives so that the positive predictive value is close to 80%.

falsePositive = round(probs[1, 1]/0.8)
probs2 = probs
probs2[1, 2] = falsePositive
probs2[2, 2] = probs2[3, 2] - falsePositive
probs2[1, 3] = sum(probs2[1, 1:2])
probs2[2, 3] = sum(probs2[2, 1:2])






C-True C-False M
M-True 11 14 25
M-False 3 972 975
C 14 986 1000

The chance that a person has cancer, conditional on a positive mammogram is now 44.0%. Before I look at another scenario, I'm going to scrap the tables in favor of something graphical. Here's what the first matrix looks like:

library(ggplot2)
library(reshape2)
mdf = melt(probs[1:2, 1:2], varnames = c("Mammogram", "Cancer"))
mdf$Cancer = ifelse(mdf$Cancer == "C-True", "Yes", "zNo")
p = ggplot(mdf, aes(x = Cancer, y = Mammogram)) + geom_tile(aes(fill = value)) + 
    scale_fill_gradient(low = "blue", high = "yellow")
p

plot of chunk unnamed-chunk-7

And the second matrix:
plot of chunk unnamed-chunk-8

In the second plot, we continue to have a large concentration of the probability in the bottom right corner, but the the top half is now more balanced. This balance comes from a shift away from top right corner. All of this means that the information about a mammogram becomes more predictive.

What happens when we increase the likelihood of cancer? In graphical terms, this would mean giving the left side a more yellow color. We'll hold the original positive predictive value (roughly 10%) fixed, but raise the likelihood of cancer to 25%.

PPV = probs[1, 1]/probs[1, 3]
probs3 = probs
probs3[2, 1] = 250 - probs3[1, 1]
probs3[2, 2] = probs3[2, 3] - probs3[2, 1]
probs3[3, 1] = 250
probs3[3, 2] = 750






C-True C-False M
M-True 11 99 110
M-False 239 651 890
C 250 750 1000

plot of chunk unnamed-chunk-10

This is interesting. The highest probability remains at the lower right hand corner (no cancer, clean mammogram) but there is now a greater concentration at the upper right and lower left corner. So, if one has a positive mammogram result, what is the posterior probability that they have cancer? The same 10% as before. And if the test showed negative? It's now 27%. This is higher than the probability if one got a positive result. Of course, this is because we've held the positive predictive value fixed, while raising the probability of the event. The efficacy of the test and the prevalence of the disease are now anti-correlated. Not the sort of thing one wants in a diagnostic tool. How would things look if the PPV were 50%?

probs4 = probs3
probs4[1, 1] = 55
probs4[2, 1] = 250 - 55
probs4[1, 2] = 55
probs4[2, 2] = 750 - 55






C-True C-False M
M-True 55 55 110
M-False 195 695 890
C 250 750 1000

plot of chunk unnamed-chunk-12

So what makes this Bayesian? The simple answer is that I don't know. I have trouble reconciling Silver and McGrayne's simple (though very accessible) examples of Bayesian inference with what I read in Gelman and Albert. Untangling the math takes me away from the philosophy, so I'll list three quick notions about what Bayesian analysis means to me:

  • In the presence of new information, our prior understanding may be modified. This is the one that feels like a one-off exercise as it is presented in the mammography
    examples. If I don't know anything at all about a person, I assume that the chance they have cancer is about 1.4%. If I know they've had a mammogram, I adjust my result up or down. This is a slightly static view of the world
  • Similar to the above, but subtly different: the process of gathering information means that our understanding continually evolves. This is the view which Silver seems to push. This allows both for continual improvement of knowledge, but also the opportunity to respond as underlying probabilities change. One critical element that's not addressed in the cancer/mammogram example is that there is presumed- and unearned- certainty in the underlying probabilities. Silver and McGrayne use two different sets
    of figures. Either the parameters are uncertain or they're drawing from samples which vary in some other way (which is another way of saying that the parameters possess some stochasticity).
  • The third interpretation is what I think of as the “actuarial” view. I can't point to a specific paper (though Bailey comes close) but it's more a feeling I get from those rare references to Bayes (explicit and otherwise) in the actuarial literature. The world is divided into sets, though you can't know to which set a particular item belongs. You may only refine the likelihood that an item belongs to a specific set in the presence of information. For example, there are three sets of drivers: very good, average and bad. If a driver has had one accident in the past 12 months, to which set do they belong? The chance that they belong to the set of very good drivers is low, but neither are they incontrovertible members of the bad drivers set.

In this example, I look at altering the joint probability distribution. I'm free to do that, if evidence warrants it. If mammography improves- or there is a provable difference in physicians' interpretations of the results- then I may alter the probabilities. If environment and lifestyle changes yield an alteration in disease prevalence, that also affects the joint distribution. It's a great toy example to begin to explore more varied problems. That's what I'll do next as I expand the example from a very simple 2×2 matrix to something more complicated.

Before I forget, my understanding of the definition of positive predictive value is taken from An Introduction to Statistical Learning, which is a great book. That value is one component of the fascinating subject of binary classification. I first heard about this in a great talk given by Dan Kelly at a meeting of the Research Triangle Analysts

Session info:

## R version 3.0.2 (2013-09-25)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## 
## locale:
## [1] LC_COLLATE=English_United States.1252 
## [2] LC_CTYPE=English_United States.1252   
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] RWordPress_0.2-3 knitr_1.5        reshape2_1.2.2   ggplot2_0.9.3.1 
## [5] xtable_1.7-1    
## 
## loaded via a namespace (and not attached):
##  [1] colorspace_1.2-4   dichromat_2.0-0    digest_0.6.4      
##  [4] evaluate_0.5.1     formatR_0.10       grid_3.0.2        
##  [7] gtable_0.1.2       labeling_0.2       MASS_7.3-29       
## [10] munsell_0.4.2      plyr_1.8           proto_0.3-10      
## [13] RColorBrewer_1.0-5 RCurl_1.95-4.1     scales_0.2.3      
## [16] stringr_0.6.2      tools_3.0.2        XML_3.98-1.1      
## [19] XMLRPC_0.3-0

24 Days of R: Day 24

OK, so I phoned it in last night. Final post and maybe this one will be a bit better. Can't recall what got me thinking about it, but I was running over the issue of school performance and the erroneous notion that small class sizes will produce better students. This is occasionally debunked, but I thought it'd be fun to demonstrate the appeal of this notion through a random simulation. If memory serves, this is somewhat similar to something that Markus Gesman wrote about on his blog earlier this year.

I'll construct a set of 100 classes, each of which has a random number of students. Each student will have the same probability distribution that describes their performance on an exam. In each case, I'll use a normal distribution, but will round the results so that I get only integral values. I'll also ensure that I only have positive values.

IntegralNormal = function(n, mean, sd, min, max) {
    val = rnorm(n, mean = mean, sd = sd)
    val = round(val)
    val = pmin(val, max)
    val = pmax(val, min)
    val
}

set.seed(1234)
numClasses = 100
numStudents = IntegralNormal(numClasses, 20, 3, 10, 30)
dfClass = data.frame(ClassID = 1:numClasses, NumStudents = numStudents)

CreateStudentResults = function(x, mean, sd, min, max) {
    numStudents = x$NumStudents
    ClassID = x$ClassID
    Score = IntegralNormal(numStudents, mean, sd, min, max)
    df = data.frame(ClassID = rep(ClassID, numStudents), Student = 1:numStudents, 
        Score = Score)
    df
}

library(plyr)
dfTestResults = ddply(dfClass, .(ClassID), CreateStudentResults, mean = 70, 
    sd = 10, min = 0, max = 100)

So how did the students do? Looks like they average about a C, with some extraordinary performers at either end.

hist(dfTestResults$Score, xlab = "Test score", main = "Histogram of student test scores")

plot of chunk StudentResults

I'm sure that any administrator would like to know which classrooms managed to produce the best overall results.

dfClassResults = ddply(dfTestResults, .(ClassID), summarize, AverageScore = mean(Score))
dfClassResults = merge(dfClassResults, dfClass)
dfClassResults = dfClassResults[order(dfClassResults$AverageScore, decreasing = TRUE), 
    ]
sum(dfClassResults$NumStudents[1:25] < 20)
## [1] 15

Amazing! 60% of classes in the top quartile had a class size which was below average. Class size is clearly a strong indicator of student performance. But perhaps this was just a fluke. What happens in the following year?

dfTestResultsTwo = ddply(dfClass, .(ClassID), CreateStudentResults, mean = 70, 
    sd = 10, min = 0, max = 100)
dfClassResultsTwo = ddply(dfTestResultsTwo, .(ClassID), summarize, AverageScore2 = mean(Score))
dfClassResults = merge(dfClassResults, dfClassResultsTwo)
dfClassResults = dfClassResults[order(dfClassResults$AverageScore2, decreasing = TRUE), 
    ]
sum(dfClassResults$NumStudents[1:25] < 20)
## [1] 14
sum(dfClassResults$NumStudents[1:10] < 20)
## [1] 8

Similar results in the second year! In fact, 8 out of the top 10 schools have class sizes below average. Well, that proves it.

Anyone who's reading this blog knows that it doesn't prove anything. I've cherry picked the data, I've not observed the results in the same class from the first to the second year, I've not looked at any other variables which may cause these results and I've not constructed and vetted any model to explain the results.

fit = lm(AverageScore ~ 1 + NumStudents, data = dfClassResults)
summary(fit)
## 
## Call:
## lm(formula = AverageScore ~ 1 + NumStudents, data = dfClassResults)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -4.276 -1.379 -0.163  1.078  4.240 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  69.1965     1.3097    52.8   <2e-16 ***
## NumStudents   0.0401     0.0663     0.6     0.55    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.99 on 98 degrees of freedom
## Multiple R-squared:  0.00372,    Adjusted R-squared:  -0.00645 
## F-statistic: 0.366 on 1 and 98 DF,  p-value: 0.547

The number of students is- as we would expect- not a meaningful predictor of test scores. I'll emphasize the obvious point that this is an artificial example and does not demonstrate that smaller class sizes have impact on student performance. This is merely demonstrates that if there were no impact, we could find one. Smaller samples have greater volatility and will appear as outliers on either end of the spectrum.

And that does it. 24 days. 24 posts about R. Later, I'll ruminate on what the exercise meant. For now, thanks to folks who have read and responded. I had a lot of fun doing this and hope that it was- at a minimum- a pleasant diversion.

sessionInfo
## function (package = NULL) 
## {
##     z <- list()
##     z$R.version <- R.Version()
##     z$platform <- z$R.version$platform
##     if (nzchar(.Platform$r_arch)) 
##         z$platform <- paste(z$platform, .Platform$r_arch, sep = "/")
##     z$platform <- paste0(z$platform, " (", 8 * .Machine$sizeof.pointer, 
##         "-bit)")
##     z$locale <- Sys.getlocale()
##     if (is.null(package)) {
##         package <- grep("^package:", search(), value = TRUE)
##         keep <- sapply(package, function(x) x == "package:base" || 
##             !is.null(attr(as.environment(x), "path")))
##         package <- sub("^package:", "", package[keep])
##     }
##     pkgDesc <- lapply(package, packageDescription, encoding = NA)
##     if (length(package) == 0) 
##         stop("no valid packages were specified")
##     basePkgs <- sapply(pkgDesc, function(x) !is.null(x$Priority) && 
##         x$Priority == "base")
##     z$basePkgs <- package[basePkgs]
##     if (any(!basePkgs)) {
##         z$otherPkgs <- pkgDesc[!basePkgs]
##         names(z$otherPkgs) <- package[!basePkgs]
##     }
##     loadedOnly <- loadedNamespaces()
##     loadedOnly <- loadedOnly[!(loadedOnly %in% package)]
##     if (length(loadedOnly)) {
##         names(loadedOnly) <- loadedOnly
##         pkgDesc <- c(pkgDesc, lapply(loadedOnly, packageDescription))
##         z$loadedOnly <- pkgDesc[loadedOnly]
##     }
##     class(z) <- "sessionInfo"
##     z
## }
## <bytecode: 0x0000000010f44640>
## <environment: namespace:utils>

24 Days of R: Day 23

Penultimate post, I'm going to take a quick look at the Gini indicator for wealth inequality. Data comes from the World Bank.

I've downloaded the zipped file, decompressed it and given it a different name. I'm going to read the data and melt it.

dfGini = read.csv("./Data/Gini.csv", stringsAsFactors = FALSE, skip = 2)
colnames(dfGini) = gsub("X", "", colnames(dfGini))
library(reshape2)
mdf = melt(dfGini, id.vars = colnames(dfGini)[1:4])
mdf = mdf[!is.na(mdf$value), ]
colnames(mdf)[1:2] = c("Name", "Code")
library(plyr)
dfCount = ddply(mdf, "Code", .fun = nrow)
atLeast5 = dfCount$Code[dfCount$V1 >= 5]
mdf = mdf[mdf$Code %in% atLeast5, ]

This will give us a decent set of data. How does this look when we plot it?

library(ggplot2)
ggplot(mdf, aes(x = variable, y = value, group = Code)) + geom_line()

plot of chunk Plot

Ugh. That looks like nothing. It's a bit late and I'm not all that keen for insight. What countries have had the biggest reduction in income inequality?

dfMove = ddply(mdf, "Code", summarize, Diff = max(value) - min(value))
dfMove = dfMove[order(dfMove$Diff, decreasing = TRUE), ]
bigMoves = dfMove$Code[1:5]

ggplot(mdf[mdf$Code %in% bigMoves, ], aes(x = variable, y = value, group = Code)) + 
    geom_line()

plot of chunk BiggestMovers

That's better, but still not very informative. I'm sleepy and going to bed.

sessionInfo
## function (package = NULL) 
## {
##     z <- list()
##     z$R.version <- R.Version()
##     z$platform <- z$R.version$platform
##     if (nzchar(.Platform$r_arch)) 
##         z$platform <- paste(z$platform, .Platform$r_arch, sep = "/")
##     z$platform <- paste0(z$platform, " (", 8 * .Machine$sizeof.pointer, 
##         "-bit)")
##     z$locale <- Sys.getlocale()
##     if (is.null(package)) {
##         package <- grep("^package:", search(), value = TRUE)
##         keep <- sapply(package, function(x) x == "package:base" || 
##             !is.null(attr(as.environment(x), "path")))
##         package <- sub("^package:", "", package[keep])
##     }
##     pkgDesc <- lapply(package, packageDescription, encoding = NA)
##     if (length(package) == 0) 
##         stop("no valid packages were specified")
##     basePkgs <- sapply(pkgDesc, function(x) !is.null(x$Priority) && 
##         x$Priority == "base")
##     z$basePkgs <- package[basePkgs]
##     if (any(!basePkgs)) {
##         z$otherPkgs <- pkgDesc[!basePkgs]
##         names(z$otherPkgs) <- package[!basePkgs]
##     }
##     loadedOnly <- loadedNamespaces()
##     loadedOnly <- loadedOnly[!(loadedOnly %in% package)]
##     if (length(loadedOnly)) {
##         names(loadedOnly) <- loadedOnly
##         pkgDesc <- c(pkgDesc, lapply(loadedOnly, packageDescription))
##         z$loadedOnly <- pkgDesc[loadedOnly]
##     }
##     class(z) <- "sessionInfo"
##     z
## }
## <bytecode: 0x0000000012974858>
## <environment: namespace:utils>

24 Days of R: Day 22

I like to use Goodreads to keep track of which books I'm reading (and not reading). They very helpfully sent me an e-mail to inform me how many books I've read so far in 2013. The number is 19. Hardly an impressive number, but between job, family and trying to develop my R skills, I'm not embarassed by that either. However, it's not as though I'm reading James Joyce. This year reveals a great interest in the works of Robert B. Parker.. There are a few omissions as well- all my fault. I never bothered to enter Naked Statistics and Medium Raw was never listed as having been finished. That understood, let's see if I can learn anything about my reading habits.

I've not yet come to grips with the Goodreads API. It was the work of only a few minutes to copy and paste my collection of books into Excel and strip out the nonsense to produce a quick CSV file. I'll load that in and have a look.

dfBooks = read.csv("./Data/Goodreads.csv", stringsAsFactors = FALSE)
colnames(dfBooks)[6] = "DateRead"
colnames(dfBooks)[7] = "DateAdded"
dfBooks$DateAdded = gsub("[edit]", "", dfBooks$DateAdded)
dfBooks$DateAdded = gsub(" ", "", dfBooks$DateAdded)
dfBooks$DateAdded = as.Date(dfBooks$DateAdded, "%m/%d/%Y")

When did I add books to my list? Most of the time this will be one book on a particular day, but there are exceptions. I'll look at this timeline in a few different ways.

dfAgg = aggregate(x = dfBooks$DateAdded, by = list(dfBooks$DateAdded), FUN = length)
colnames(dfAgg) = c("DateAdded", "NumBooks")
plot(x = dfAgg$DateAdded, y = dfAgg$NumBooks, pch = 19, xlab = "Date", ylab = "Number of books")

plot of chunk Explore

That giant bump is when GoodReads switched over from their Facebook app. I'll zero in on something relevant.

dfAgg = subset(dfAgg, DateAdded >= as.Date("2012-01-01"))
plot(x = dfAgg$DateAdded, y = dfAgg$NumBooks, pch = 19, xlab = "Date", ylab = "Number of books")

plot of chunk Recent

There's another crazy blip near the start of 2012. I've no idea why everything defaults to sometime in the middle of last year. I've been reading books for quite a long time and there's not a lot of variation.

No matter. This isn't terribly informative. I'm going to try looking at this differently. The date I finished the book isn't often available and is formatted in such a way that it resists easy munging. I'm going alter it manually for the 44 books that I started in the last three months of 2012.

dfBooks2 = read.csv("./Data/Goodreads2.csv", stringsAsFactors = FALSE)
dfBooks2$DateRead = as.Date(dfBooks2$DateRead, "%m/%d/%Y")
dfBooks2$DateAdded = as.Date(dfBooks2$DateAdded, "%m/%d/%Y")

library(timeline)
# I need to sample a random integer so that we have a unique identifier for
# the book timeline's default is to use the book title as the label.
set.seed(1234)
dfBooks2$GroupCol = sample(nrow(dfBooks2))
timeline(dfBooks2[!is.na(dfBooks2$DateRead), ], group.col = "GroupCol", start.col = "DateAdded", 
    end.col = "DateRead", label.col = "title")

plot of chunk Timeline

Cool. For the record, it didn't take me that long to read American Splendor. I think that's one where I had put the book aside for quite some time (it's a lot of short vignettes) and then finally marked it as read. The Rise and Fall of Prussia seems about right. I recall having read that in a couple concentrated bursts that were spaced far apart. I'm also struck by the fact that it took only a couple days more to read a scholarly account of the life of Jesus than it does for me to read a Spenser novel.

Tomorrow: Probably heteroskedasticity.

sessionInfo
## function (package = NULL) 
## {
##     z <- list()
##     z$R.version <- R.Version()
##     z$platform <- z$R.version$platform
##     if (nzchar(.Platform$r_arch)) 
##         z$platform <- paste(z$platform, .Platform$r_arch, sep = "/")
##     z$platform <- paste0(z$platform, " (", 8 * .Machine$sizeof.pointer, 
##         "-bit)")
##     z$locale <- Sys.getlocale()
##     if (is.null(package)) {
##         package <- grep("^package:", search(), value = TRUE)
##         keep <- sapply(package, function(x) x == "package:base" || 
##             !is.null(attr(as.environment(x), "path")))
##         package <- sub("^package:", "", package[keep])
##     }
##     pkgDesc <- lapply(package, packageDescription, encoding = NA)
##     if (length(package) == 0) 
##         stop("no valid packages were specified")
##     basePkgs <- sapply(pkgDesc, function(x) !is.null(x$Priority) && 
##         x$Priority == "base")
##     z$basePkgs <- package[basePkgs]
##     if (any(!basePkgs)) {
##         z$otherPkgs <- pkgDesc[!basePkgs]
##         names(z$otherPkgs) <- package[!basePkgs]
##     }
##     loadedOnly <- loadedNamespaces()
##     loadedOnly <- loadedOnly[!(loadedOnly %in% package)]
##     if (length(loadedOnly)) {
##         names(loadedOnly) <- loadedOnly
##         pkgDesc <- c(pkgDesc, lapply(loadedOnly, packageDescription))
##         z$loadedOnly <- pkgDesc[loadedOnly]
##     }
##     class(z) <- "sessionInfo"
##     z
## }
## <bytecode: 0x000000001076e4a8>
## <environment: namespace:utils>
Follow

Get every new post delivered to your Inbox.

Join 264 other followers