Stuff I’ve gotten horribly wrong

I'm the first (I hope) to admit when I've gotten something wrong. I like to think I'm humble enough to realize that there are limits to my knowledge. Actually, humility doesn't enter into it. Every day I'm confronted with things that I don't know or understand. Those same limits can often blind me to being sage enough to recognize when I've gone off the rails. With time, however, knowledge begins to seep in. So, here it is, stuff I've gotten wrong:

  1. Using a list to store complicated data types in S4 objects is absurd and unnecessary.
    There's a lenghty explanation here, but suffice it to say that it's absolutely possible to vectorize individual elements of your S4 object. I've done it and it's a gas. Don't get me wrong, it's not a walk in the park, but it allows you to build up very complicated objects. So long as accessor functions are coded cleanly, things will work out. Using a list to store complicated elements is a bad idea on a number of levels.

  2. It's totally possible to extract the contents of a data frame without fear of R returning a vector.
    This is really embarassing. All you need to do is set the parameter drop=FALSE.

  3. Computed columns might be a good idea. My thoughts on how to implement them and my response to alternate suggestions was moronic. I use reshape2 and plyr all the time. I'm still not happy that I can't simply define a computed column like I can in SQL, but I've not developed a better alternative.

I'm sure there are others. My initial epiphany about mapply and its relation to nested loops has faded. This is mostly the result of my having gained deeper experience with the vectorization of the language. I still use mapply in this way, so I'm not yet ready to concede that this is approach is “wrong”, per se.

A few weeks ago, I was in Africa as part of a team of instructors demonstrating how to use R. I sat with one of the students for two hours going over some basic coding. At one point, I could tell that he was reluctant to execute a command after he'd typed it. I told him, “Learning R means making many, many mistakes. Go ahead and get started and don't worry.” His code ran fine.

Recursive assignment

Here’s yet another example where I just need to read the help files. Before I go on, I should add my own notion as to why that’s not always easy to do. On loads of message boards, you’ll see people say- correctly- that the documentation is very clear on XYZ. True. But that’s only relevant if you read the bit of the documentation that actually matters to you and you have all of the context you need to understand the terse (though accurate!) descriptions there. It’s a bit like a bus schedule in Samarkand. Absolutely clear and useful if you’re in central Asia and know where you are and where you need to go and when you need to get there. If you’ve been walking the Silk Road for weeks and can’t tell Samarkand from Tashqent, that bus schedule may not do you as much good. So it is with R documentation. Sometimes you’ll have to dust off your shoes, get patient and ask a stranger for help.

So what I had wanted to do was to understand something fairly basic. How is the following statement processed:

myObject$MyColumn[2] = "New value"

This is a typical method to manipulate individual cells in a data frame and a very natural way to structure custom R objects. So, when creating my own objects, how do I implement it? If there is customization, where does it take place? Do I access the element in the $ or the [] first? What assignment operator is being used?

To investigate, I created a very simple object with easy properties that I could assign.

setClass("Person", representation(FirstName = "character", LastName = "character", 
    Birthday = "Date"))

I then created two easy access and set methods. For reasons that will become clear in a moment, I also added a statement to indicate when the methods had been called.

setMethod("$", signature(x = "Person"), function(x, name) {
    print("Just called $ accessor")
    arguments <- as.list(
    slot(x, name)
setMethod("$<-", signature(x = "Person"), function(x, name, value) {
    print("Just called $ assignment")
    arguments <- as.list(
    slot(x, name) = value

And I created a new object.

objPeople = new("Person", FirstName = c("Ambrose", "Victor", "Jules"), LastName = c("Bierce", 
    "Hugo", "Verne"), Birthday = seq(as.Date("2001/01/01"), as.Date("2003/12/31"), 
    by = "1 year"))

So, I can access the properties and my methods will tell me when they've been accessed. I can also assign to the member and I'll be told when that happens as well.

## [1] "Just called $ accessor"
## [1] "Ambrose" "Victor"  "Jules"
objPeople$FirstName = "Joe"
## [1] "Just called $ assignment"

Now here's the interesting bit. (Interesting if you've just gotten to the train station in Samarkand and are trying to find your hotel. Not so interesting if you've been in Uzbekistan for a few weeks.)

objPeople$FirstName[2] = "Joe"
## [1] "Just called $ accessor"
## [1] "Just called $ assignment"

The assignment produced a call to the accessor function? Why? The answer may be found in one of two places. One is the very clear, concise and speedy answer that I got to a question I posed on StackOverflow, which may be read here. Two is the R documentation, which may be found here.

This will tell us that the following two sets of statements are equivalent. (For the rest of the post, I'm suppressing output, so the messages about when the '$' operators are called will not appear.)

objPeople$FirstName[2] = "Joe"

`*tmp*` <- objPeople
objPeople <- `$<-`(`*tmp*`, name = "FirstName", value = `[<-`(`*tmp*`$FirstName, 
    2, value = "Joe"))

So what's happening? When I want to assign to a subset, three things take place. First, I use my accessor to sort out precisely which value I'm extracting from. Next, I use bracket assignment to alter the elements of a subset of that vector. Finally, I assign the whole vector back to the component of my object. This is a bit easier to see, if we take the steps one at a time.

gonzo = objPeople$FirstName
mojo = `[<-`(gonzo, 2, value = "Joe")
objPeople = `$<-`(objPeople, "FirstName", mojo)

This is why the accessor is not called if there is no subset in the assignment. In that case, the equivalent expression is simply the following:

objPeople = `$<-`(objPeople, "FirstName", "Joe")

Welcome to Uzbekistan. Please enjoy our fine network of buses.

Watching Africa from a plane

I wrote this 8 or 9 days ago, while on a plane and am just now getting around to posting it.

It'€™s either 7:12 PM Friday or 2:12 AM Saturday. I'€™m somewhere over the Mediterranean, having just passed over Tunisia. Sunrise will happen too late for me to see the Sahara. It's a mass of beige on the tiny map; a green dotted line treks doggedly forward over a baked wasteland about the size of the entirety of the US east of the Mississippi. It's wrong to talk about Africa without talking about the enormity of it. Once that's cleared, we'll be in Addis Ababa, Ethiopia. I doubt there will be any on offer- and it won't be at all an appropriate time to drink it- but I'd love to try some Tej. This is likely moot as I have no Ethiopian currency. I'll be glad for a coffee and the chance to buy my kids some postcards.

I'm not sure who else is on this plane. Quite a few Africans, of course, but more white Americans than I was expecting. At least one is wearing a purple t-shirt with letters arranged in the shape of a cross. Are they all missionaries? Well, not all, obviously. I'm not and neither is the woman sitting next to me. She works with the university in Addis. And me? I've got this laptop and my brain and I'm going to try to share the contents of both with some students in Rwanda. I'll have help, of course. I remain intellectually embarrassed to be involved at all. It was only my eagerness to travel, to experience and to learn that got me here. That and, I expect, a surfeit of volunteers.

Still, the question remains: just what are we all doing in Africa? I can only answer the question for myself and there are two bits of it. The first is the easy bit. I love to travel. This trip has enough altruism that I don't feel too guilty leaving my family for 10 days on another of my crazy whims. The second is different. If the trip had been to Peru, Slovakia, or Sri Lanka it would not have caught me the same way. Africa. The continent which is too big to fail, but for which everyone has such bleak hopes. Africa. Origin of humanity. Eden. Africa, source of cheap natural resources, from oil to uranium to diamonds to its most devalued commodity: free human labor. Africa the home of failed states, dictatorships, foreign-drawn borders, heart of darkness, punishing sun, steamy jungles and parched sand. Africa. The place I'd chosen to ignore for the first 40 years of my life. The place that draws me in the same way other places have, with the whispered voice telling me, “There must be more than this. Everyone else surely has it wrong. The only way you'll find out is to go there.”

This won't be an exhaustive experience, mind. It's really just 9 days. Nowhere enough for insight, answers or truth. Yet more than I had when I woke up this morning. Before I dragged myself from my home, bleary-eyed, drove through the darkness to fly against the sun and compressed one day and half a world while sitting on a plane. Tomorrow, I’ll rise again and dust my eyes to greet the African dawn.

Triangle Open Data Day 2014

A rare live blog post today. I'm writing this from Triangle Open Data Day 2014. This will basically be a page of links that I'll try to get around to later.

GIS resources:

Open data resources:

Cloud development resources:

MongoDB presentation is about to start. Will likely update this post.

Another skewed normal distribution

At the CLRS last year, Glenn Meyers talked about something very near to my heart: a skewed normal distribution. In loss reserving (and I'm sure, many other contexts) standard linear regression is less than ideal as it presumes that deviations from the mean are equally distributed. We rarely expect this assumption to hold (though we should always test it!). Application of a log transform is one way to address this, but that option isn't available for negative observations. Negative incremental reported losses are very common and even negative payments which arise from salvage, subrogation or other factors happen often enough that (in my view) the log transform isn't an attractive option.

Meyers gave a talk where he described (among other things) the lognormal-normal mixture. That presentation, Stochastic Loss Reserving with Bayesian MCMC Models, is worth any actuary's time. The idea is simplicity itself. Z is lognormally distributed, with parameters mu and theta. X is normally distributed with parameters Z and delta.

Let's have a look at this distribution. Well, actually that's easier said than done. Here are the equations:

Z \sim Lognormal(\mu,\sigma)

X \sim Normal(Z,\alpha)

So Z is easy, it's just a lognormal. In fact, here it is:

sigma = 0.6
mu = 2
x = seq(-10, 60, length.out = 500)
Z = dlnorm(x, mu, sigma)
plot(x, Z, type = "l")

plot of chunk unnamed-chunk-1

X for the expected value of Z is also easy. Here it is:

expZ = exp(mu + sigma^2/2)
delta = 3
pdfX = dnorm(x, expZ, delta)
plot(x, pdfX, type = "l")

plot of chunk unnamed-chunk-2

Here, we've produced a normal centered around the expected value of the original lognormal distribution. Not skewed and not all that interesting. What we want is a distribution wherein the mean of the normal is itself a random variable. To get that, we have three options: one lazy, one easy and one. I'll show the lazy one first.

The lazy one is to randomly sample from Z and then feed that to X. We end up with a histogram which approximates a density function.

samples = 10000
Z = rlnorm(samples, mu, sigma)
X = rnorm(samples, Z, delta)

plot of chunk unnamed-chunk-3

That's undoubtedly skew and might even correspond to Glenn's graph on slide 36. But that was lazy. The easy way is to repeat a procedure similar to what I did a week ago when demonstrating a Bayesian model which combined a lognormal and an exponential. Here, we just calculate the joint density over a subspace of the probability domain, normalize it and then compute the marginal.

plotLength = 250
Z = seq(0.001, 40, length.out = plotLength)
X = seq(-10, 40, length.out = plotLength)

dfJoint = expand.grid(Z = Z, X = X)

dfJoint$Zprob = dlnorm(dfJoint$Z, mu, sigma)
dfJoint$Xprob = dnorm(dfJoint$X, dfJoint$Z, delta)
dfJoint$JointProb = with(dfJoint, Zprob * Xprob)
dfJoint$JointProb = dfJoint$JointProb/sum(dfJoint$JointProb)

jointProb = matrix(dfJoint$JointProb, plotLength, plotLength, byrow = TRUE)

filled.contour(x = X, y = Z, z = jointProb, color.palette = heat.colors, xlab = "X", 
    ylab = "Z")

plot of chunk unnamed-chunk-4

Groovy. X can be anything, but higher values of Z will pull it to the right. Here's the marginal distribution of X.

marginalX = ddply(dfJoint[, c("X", "JointProb")], .variables = "X", summarize, 
    marginalProb = sum(JointProb))
plot(marginalX$X, marginalX$marginalProb, type = "l", xlab = "X", ylab = "Marginal probability")

plot of chunk unnamed-chunk-5

The hard way is to sit down with pen and paper and work this out algebraically. I tried. I worked through a few ugly integrations did some research on the interweb and have concluded that if there is a closed form solution, it's not something that people spend a great deal of time talking about. I will point to this paper and this website as material that I'd like to get better acquainted with. It would appear that this comes up in financial and time series analysis. No surprises there, I think there are similar reasons to need this sort of distribution.

For the record, here's what that integrand looks like.

f_{X}(X|Z)f_{Z}(Z) = \frac{1}{Z \delta \sigma 2\pi}e^{\frac{-(X-Z)^2}{2\delta^2}+\frac{-(ln(Z)-\mu)^2}{2\sigma^2}}

If you know how to integrate that over Z, please let me know.

If you've made it this far, odds are good that you're an actuary, a stats nerd or both. Whatever you are, take a moment to thank heaven for Glenn Meyers, who's both. He's made tremendous contributions to actuarial literature and we're all the better for it.

Session info:

## R version 3.0.2 (2013-09-25)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## locale:
## [1] LC_COLLATE=English_United States.1252 
## [2] LC_CTYPE=English_United States.1252   
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.1252    
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## other attached packages:
## [1] knitr_1.5        RWordPress_0.2-3 plyr_1.8        
## loaded via a namespace (and not attached):
## [1] evaluate_0.5.1 formatR_0.10   RCurl_1.95-4.1 stringr_0.6.2 
## [5] tools_3.0.2    XML_3.98-1.1   XMLRPC_0.3-0

An idiot learns Bayesian analysis: Part 2

A week ago, I wrote a bit about my personal journey to come to grips with Bayesian inference. I referred to the epiphany that when we're talking about Bayesian analysis, what we're talking about- in a tangible way- is using and modifying multivariate distributions. This reminds me of the moment, about twenty years ago now, when I had a nascent interest in object oriented programming. I had read about it, worked through some tutorials and struggled to write any code of my own. I spent a bit of time with a fantastic programmer that I knew, who showed me some code that he and a team of developers were putting together. This was commercial grade software, meant for delivery to clients who would be paying real money for it. And it was straightforward. I said, “John, this really doesn't seem all that strange. It looks like the same code that I'm currently writing.” John responded with a very patient nod of his head. Sure there were some subtle, yet meaningful differences, but I got over my mental hurdle and have been an OOP fan ever since. (No one would mistake me for an expert. I'm a fan.) I'm hoping that Bayes will work out the same and I think that dealing with the multivariate mechanics is the way forward. The math is just too simple.

OK, that last comment may have been a bit much. The math isn't always simple and heaven knows, I'm lost with anything more than two variables. My last post dealt with the simplest multivariate distribution possible- a 2 dimensional Bernouli trial. Now I'm going to do something a bit more complicated. I'm not going to use more than two variables- I doubt my head could take it- but I will at least move to a continuous distribution. My favorite probability distributions are the binomial (only two things to think about!) and the Pareto (discrete! nothing negative!). The only continuous distribution for which I have any fondness is the exponential (one parameter!). It's simple and I like the “lack of memory” property. As an aging drinker, I can identify with that. Moreover, if you combine more than one, you can create a semiparametric model which can support all kinds of curves. So, an exponential distribution it is.

This won't resemble any real-world phenomeon that I can think of. If you need to assume something, you can think of time until a customer service representative answers your call. And, let's say you're calling to trying to make changes to an airlines reservation. US residents may assume American Airlines and feel free to expect that the time may be eternal and with an unhappy resolution. Residents in other countries may substitute Lufthansa, RyanAir, Aeroflot or something similar. (Actually, Lufthansa was always fantastic if travel plans didn't need to be altered. Great planes, great staff. But changing a booking was often a Kafka-esque experience)

So, because this is Bayesian we need another parameter. I'll use the lognormal because it's simple and has support across the set of positive numbers. To simplify, I'm going to presume that I know the variance. (Aside: virtually every beginning stats book that I've read contains exercises where the variance is know, but the mean is not. This has always seemed crazy to me. I can't imagine a circumstance where I emphatically know how much observations vary around the mean, but I don't know the mean itself. The only thing crazier is an obsession with urns. A survey of the work of John Keats would have fewer uses of the word “urn.”)

Right. Back to the math. I'm going to use a lognormal with a mean of 5 and a CV of 20%. Let's draw a quick picture.

mean = 5
cv = 0.2
sigma = sqrt(log(1 + cv^2))
mu = log(mean) - sigma^2/2

plotLength = 100
theta = seq(0.001, 10, length = plotLength)
y = dlnorm(theta, mu, sigma)
plot(theta, y, type = "l")

plot of chunk unnamed-chunk-1

And now an exponential. Note that for the parameter, I'll be discussing it in such a way that the expected value is equal to the parameter. I think that often, the parameter is set so that the expected value is equal to the reciprocal of the parameter. Actuaries learned it differently- or at least this one did. I don't have my copy of Loss Models with me (it's at work, where it might do me some good), but that's how I'm used to thinking about it.

So, knowing that the exponential parameter could be any positive real number- but expecting it to be something close to 5- let's draw some exponential curves. We'll use tau as the parameter for waiting time.

tau = seq(0.001, 20, length = plotLength)
t1 = dexp(tau, 1)
t2 = dexp(tau, 1/2)
t3 = dexp(tau, 1/5)
t4 = dexp(tau, 1/10)
plot(tau, t1, type = "l")
lines(tau, t2)
lines(tau, t3)
lines(tau, t4)

plot of chunk unnamed-chunk-2

Not terribly pretty, but that's the exponential. No mode, most of the probability shoved to the left hand side, the higher mean distributions wholly reliant on remote, but high valued occurrences.

Now I'm going to render this in two-dimensional space.

dfJoint = expand.grid(Theta = theta, Tau = seq(0.001, 10, length = plotLength))

dfJoint$LogProb = dlnorm(dfJoint$Theta, mu, sigma)
dfJoint$ExpProb = dexp(dfJoint$Tau, 1/dfJoint$Theta)
dfJoint$JointProb = with(dfJoint, LogProb * ExpProb)
dfJoint$JointProb = dfJoint$JointProb/sum(dfJoint$JointProb)

jointProb = matrix(dfJoint$JointProb, plotLength, plotLength)

filled.contour(x = unique(dfJoint$Theta), y = unique(dfJoint$Tau), z = jointProb, 
    color.palette = heat.colors, xlab = "Theta", ylab = "Tau")

plot of chunk unnamed-chunk-3

Most of the probability is for tau less than 4 and theta around 5. Each change in color looks a bit like a lognormal distribution as it should. The shift from one lognormal to another follows an exponential curve. Rapid near the bottom, then slower away going towards the top.

As with the cancer example, all of the probability is here. What is the chance that I'll be on hold for longer than 5 minutes? That can be obtained using the marginal exponential distribution. I'm going to cheat and calcuate the marginal based on the subset of the probability space that I used to create the plot.

marginalExp = ddply(dfJoint[, c("Tau", "JointProb")], .variables = "Tau", summarize, 
    marginalProb = sum(JointProb))
sum(marginalExp$marginalProb[marginalExp$Tau > 5])
## [1] 0.2588
plot(marginalExp$Tau, marginalExp$marginalProb, type = "l", xlab = "Tau", ylab = "Marginal probability")

plot of chunk unnamed-chunk-4

About a 1 in 4 chance. American Airlines should be so lucky. Still, though, this is promising. What I'm most interested in is making predictions and the marginal allows me to do that. I reached this marginal by assuming a prior density- both form and parameters- for another distribution which controls my undestanding of a random event. This is an additional step. Ordinarily, I just work with the one function. Does the addition of another clarify or obscure things? My immediate reaction is to say yes, because- ceteris paribus- more information is preferred over less information. I hope to explore this in my next post, where I'll try to use some real-world example. Suggestions welcome.

Session info:

## R version 3.0.2 (2013-09-25)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## locale:
## [1] LC_COLLATE=English_United States.1252 
## [2] LC_CTYPE=English_United States.1252   
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.1252    
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## other attached packages:
## [1] knitr_1.5        RWordPress_0.2-3 plyr_1.8        
## loaded via a namespace (and not attached):
## [1] evaluate_0.5.1 formatR_0.10   RCurl_1.95-4.1 stringr_0.6.2 
## [5] tools_3.0.2    XML_3.98-1.1   XMLRPC_0.3-0

An idiot learns Bayesian analysis: Part 1

I've done a dreadful job of reading The Theory That Would Not Die, but several weeks ago I somehow managed to read the appendix. Here the author gives a short explanation of Bayes' theorem using statistics related to breast cancer and mammogram results. This is the same real world example (one of several) used by Nate Silver. It's profound in its simplicity and- for an idiot like me- a powerful gateway drug. Possibly related to this is my recent epiphany that when we're talking about Bayesian analysis, we're really talking about multivariate probability. The breast cancer/mammogram example is the simplest form of multivariate analysis available. What does it all mean, how can we extend it and what does it have to do with an underlying philosophy of Bayesian analysis (if such a thing exists)?

The Theory That Would Not Die is sitting at my desk at work, so I'm going to refer to the figures quoted by Nate Silver on page 246. Odds for cancer are read across the columns, odds for a positive mammogram are read down the rows.

Before I go any further, I have to point out that the positioning of the tables is dreadful. WordPress experts are invited to help me sort this out.

probs = matrix(c(11, 99, 3, 887), 2, 2, byrow = TRUE, dimnames = list(c("M-True", 
    "M-False"), c("C-True", "C-False")))

C-True C-False
M-True 11 99
M-False 3 887

From this table, the joint probabilities are easy to read. What is the chance that a person has breast cancer and received a negative mammogram? 3 in 1000. What is the chance that a person does not have cancer, but received a positive mammogram? 99 in 1000, or roughly 10%. It's a trivial thing to determine the marginal probabilities.

probs = cbind(probs, rowSums(probs))
probs = rbind(probs, colSums(probs))
colnames(probs)[3] = "M"
rownames(probs)[3] = "C"

C-True C-False M
M-True 11 99 110
M-False 3 887 890
C 14 986 1000

The context of this information is what matters to the authors. Each presents the result that the likelihood that a patient has cancer- even with a positive mammogram- is still rather low (10% in this case). This is consistent with advice from some areas of the medical establishment that women not get routine mammograms before a particular age. This (slightly) surprising result is driven by the fact that the positive predictive value (number of true positives divided by the number of predicted positives) is very low as is the likelihood of a positive. Put differently, a mammogram does not appear to have a good success rate at predicting cancer (for this data) and the overall rate of cancer is quite low. How would things look if the numbers changed?

How do we do that? In order to hold the cancer probability fixed, we can't change the marginal totals. So, we can move numbers in the same column from one row to another. Or, if we move from one column to another, we must offset that in the other row. As an extreme, we could assume that the test is perfectly predictive. This would move the 3 false negatives into the true positive cell and the 99 false positives to the true negative cell. In this case, there is no probability in the upper right or lower left corner of the matrix. From another perspective, it is impossible to distinguish the two marginal distributions.

But that's a bit boring, so let's create something more interesting. We'll not alter the number of false negatives, but reduce the false positives so that the positive predictive value is close to 80%.

falsePositive = round(probs[1, 1]/0.8)
probs2 = probs
probs2[1, 2] = falsePositive
probs2[2, 2] = probs2[3, 2] - falsePositive
probs2[1, 3] = sum(probs2[1, 1:2])
probs2[2, 3] = sum(probs2[2, 1:2])

C-True C-False M
M-True 11 14 25
M-False 3 972 975
C 14 986 1000

The chance that a person has cancer, conditional on a positive mammogram is now 44.0%. Before I look at another scenario, I'm going to scrap the tables in favor of something graphical. Here's what the first matrix looks like:

mdf = melt(probs[1:2, 1:2], varnames = c("Mammogram", "Cancer"))
mdf$Cancer = ifelse(mdf$Cancer == "C-True", "Yes", "zNo")
p = ggplot(mdf, aes(x = Cancer, y = Mammogram)) + geom_tile(aes(fill = value)) + 
    scale_fill_gradient(low = "blue", high = "yellow")

plot of chunk unnamed-chunk-7

And the second matrix:
plot of chunk unnamed-chunk-8

In the second plot, we continue to have a large concentration of the probability in the bottom right corner, but the the top half is now more balanced. This balance comes from a shift away from top right corner. All of this means that the information about a mammogram becomes more predictive.

What happens when we increase the likelihood of cancer? In graphical terms, this would mean giving the left side a more yellow color. We'll hold the original positive predictive value (roughly 10%) fixed, but raise the likelihood of cancer to 25%.

PPV = probs[1, 1]/probs[1, 3]
probs3 = probs
probs3[2, 1] = 250 - probs3[1, 1]
probs3[2, 2] = probs3[2, 3] - probs3[2, 1]
probs3[3, 1] = 250
probs3[3, 2] = 750

C-True C-False M
M-True 11 99 110
M-False 239 651 890
C 250 750 1000

plot of chunk unnamed-chunk-10

This is interesting. The highest probability remains at the lower right hand corner (no cancer, clean mammogram) but there is now a greater concentration at the upper right and lower left corner. So, if one has a positive mammogram result, what is the posterior probability that they have cancer? The same 10% as before. And if the test showed negative? It's now 27%. This is higher than the probability if one got a positive result. Of course, this is because we've held the positive predictive value fixed, while raising the probability of the event. The efficacy of the test and the prevalence of the disease are now anti-correlated. Not the sort of thing one wants in a diagnostic tool. How would things look if the PPV were 50%?

probs4 = probs3
probs4[1, 1] = 55
probs4[2, 1] = 250 - 55
probs4[1, 2] = 55
probs4[2, 2] = 750 - 55

C-True C-False M
M-True 55 55 110
M-False 195 695 890
C 250 750 1000

plot of chunk unnamed-chunk-12

So what makes this Bayesian? The simple answer is that I don't know. I have trouble reconciling Silver and McGrayne's simple (though very accessible) examples of Bayesian inference with what I read in Gelman and Albert. Untangling the math takes me away from the philosophy, so I'll list three quick notions about what Bayesian analysis means to me:

  • In the presence of new information, our prior understanding may be modified. This is the one that feels like a one-off exercise as it is presented in the mammography
    examples. If I don't know anything at all about a person, I assume that the chance they have cancer is about 1.4%. If I know they've had a mammogram, I adjust my result up or down. This is a slightly static view of the world
  • Similar to the above, but subtly different: the process of gathering information means that our understanding continually evolves. This is the view which Silver seems to push. This allows both for continual improvement of knowledge, but also the opportunity to respond as underlying probabilities change. One critical element that's not addressed in the cancer/mammogram example is that there is presumed- and unearned- certainty in the underlying probabilities. Silver and McGrayne use two different sets
    of figures. Either the parameters are uncertain or they're drawing from samples which vary in some other way (which is another way of saying that the parameters possess some stochasticity).
  • The third interpretation is what I think of as the “actuarial” view. I can't point to a specific paper (though Bailey comes close) but it's more a feeling I get from those rare references to Bayes (explicit and otherwise) in the actuarial literature. The world is divided into sets, though you can't know to which set a particular item belongs. You may only refine the likelihood that an item belongs to a specific set in the presence of information. For example, there are three sets of drivers: very good, average and bad. If a driver has had one accident in the past 12 months, to which set do they belong? The chance that they belong to the set of very good drivers is low, but neither are they incontrovertible members of the bad drivers set.

In this example, I look at altering the joint probability distribution. I'm free to do that, if evidence warrants it. If mammography improves- or there is a provable difference in physicians' interpretations of the results- then I may alter the probabilities. If environment and lifestyle changes yield an alteration in disease prevalence, that also affects the joint distribution. It's a great toy example to begin to explore more varied problems. That's what I'll do next as I expand the example from a very simple 2x2 matrix to something more complicated.

Before I forget, my understanding of the definition of positive predictive value is taken from An Introduction to Statistical Learning, which is a great book. That value is one component of the fascinating subject of binary classification. I first heard about this in a great talk given by Dan Kelly at a meeting of the Research Triangle Analysts

Session info:

## R version 3.0.2 (2013-09-25)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## locale:
## [1] LC_COLLATE=English_United States.1252 
## [2] LC_CTYPE=English_United States.1252   
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.1252    
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## other attached packages:
## [1] RWordPress_0.2-3 knitr_1.5        reshape2_1.2.2   ggplot2_0.9.3.1 
## [5] xtable_1.7-1    
## loaded via a namespace (and not attached):
##  [1] colorspace_1.2-4   dichromat_2.0-0    digest_0.6.4      
##  [4] evaluate_0.5.1     formatR_0.10       grid_3.0.2        
##  [7] gtable_0.1.2       labeling_0.2       MASS_7.3-29       
## [10] munsell_0.4.2      plyr_1.8           proto_0.3-10      
## [13] RColorBrewer_1.0-5 RCurl_1.95-4.1     scales_0.2.3      
## [16] stringr_0.6.2      tools_3.0.2        XML_3.98-1.1      
## [19] XMLRPC_0.3-0

Get every new post delivered to your Inbox.

Join 273 other followers