Science does a terrible job predicting individual behavior. It’s not that we don’t try. We’re just not very good at it.
Science is really good at predicting a group. We collect data on a group, and use it to predict what the group, in general, will do. If you want to know if a drug works, we gather a group of people together, randomly assign them to wonder-drug and sugar-pill treatment conditions, and we see who lives and dies.
Then, we tell you that the new drug is great. But what we mean to say is that they it is generally great. Most of the people in the study did better on the drug. A few individuals had fabulous results. And a few people died. But, overall, it’s a great drug.
Unfortunately, you’re not a group. As an individual, it is extremely hard to predict what will happen to you on the new drug. You might be like most people. If so, take the drug.
You might be unusual. If you’re unusual, the drug will do wonders or kill you. But we can’t say which it will be.
Can we do any better predicting what movies you will watch? Yes and no.
On the yes side, there are several helpful factors. First, you do this behavior more than once. Science is always better at predicting events that repeat themselves. If you regularly rent movies, there might well be a pattern in your behavior. So if you’ve rented thousands of movies, predicting your behavior is feasible. In fact, if you watch lots of movies, you’re probably up for watching anything. But if you have only rented one movie, guessing your next movie is extremely difficult.
Science would prefer to predict a group, not an individual. And it would prefer to predict regularly repeating behavior, not occasional, periodic, or spurious behavior. If you blow up buildings on a regular basis, predicting that you’ll be violent in the future isn’t so hard. If you only blow up a building here or there, it’s difficult to model that behavior.
In 2006, Netflix offered a million dollars to anyone who could improve the predicting accuracy of Cinematch, their how-about-renting-this-movie software. To reach that goal, programmers tried to model consumer behavior.
The contestants were given a large file containing movie titles and dates. No information about the customers was included. So predicting individual behavior wasn’t possible. Like testing a new drug, movie ratings predict what will happen to a group of scores. The software will only predict you to the extent that you are a lot like the people in the data set.
Predicting movie ratings is even harder than it might seem. Remember, the ratings are on a 5-point scale. And that scale uses ordinal numbers.
Ordinal numbers give us 1st, 2nd and 3rd place, but no information about how close the race was. First and second places could be really close, or quite far apart. So you might score Jaws high but is it a lot higher or only slightly higher than any other Spielberg film?
A related problem is that people aren’t consistent in their ratings. When you ask people to re-rate a movie, they give the same answer. This isn’t surprising, given that moods change. Although inconsistent ratings probably happen more often with those in the middle, we have to be in the right mood for even movies we love.
Predicting starts with a simple linear regression (see Day 5). You gather data and see what the general pattern is. If there is one simple straight line, your task will be quite easy. But more complicated data sets often require more work.
The general term is called modeling. Essentially, you calculate the correlation (see Day 4) between all of the variables, and see if you can find patterns or clusters of correlations. If you rated one funny movie high, you might do the same with another funny movie.
Here’s how modeling works. Start by imagining a room where movie titles are floating in air. When you look closer, you can see that all of the funny movies are floating near the ceiling; and the dark, scary films are near the floor.
You also notice that they are arranged left to right by their target age group: kid movies to the left, senior citizens to the right. The third dimension, the depth of the room, indicates popularity (most-liked to least-liked). This three dimensional space is a model of what these movies have in common.
Now think of these titles as flowing through the room, changing every few seconds. It’s a stream of information that has spurts, lulls, waves and transitions. In this rapidly changing sea of data, try to hit one title with a dart. Think of it as “pin the tail on the movie.” It is not an easy task.
In addition, you need to add more dimensions; three is not enough to describe them all. Movies vary on theme, quality of photography, cleverness of titles, the fame of actors, the quality of directors, the skill of editors, and on their cultural, spiritual, religious and political context. They could also be rating on happy endings, exotic settings, and intricate costumes.
Descriptions of these inter-correlations are called maps. Not only do they help clarify the data, they can also be quite pretty (see http://www.the-ensemble.com/).
Statistics often looks for consistent patterns. We predict what is best for groups of people based on other group data. We are great at predicting what large groups of people repeated do. We’re pretty good at predicting what large groups of people sometimes do. And we’re lousy at predicting what you will do.
In my life, I categorize my errors into three groups. I suppose I could use small, medium and large. And I could classify them as those I talk about, those I could talk about, and those I never talk about. But what I really do is sort them by how they make me feel. So I use emotional labels: stupid little errors, there-I-go-again errors and I-can’t-believe-I-did-that errors.
In science, there are only 2 kinds of errors: Type I and Type II. These are decision errors. That is, they are errors we make when we make a decision about the results we find. They are not errors in collecting the data (though that can happen too). Decision errors are what we do after we’ve analyzed the data.
A Type I error is deciding that your results are significantly different from chance, when in fact they are not. This is the equivalent of seeing things that aren’t there. A Type I error is the UFO of science. It is the Chicken Little of analysis: “I’ve collected the data, and I conclude that the sky is falling.”
Obviously, this is the worst kind of decision error you can make. It’s not good to find a false cure for cancer, psychosis, depression, and the common cold. Unfortunately, this is the kind of error we tend to make when we judge the universe based on our personal experience. We can find all kinds of significant findings in tea leaves, sand piles, and constellations of stars. Finding the pattern is fine. What isn’t good is to decide that the pattern we see is due a false causation: leaves change in fall because I sneeze; the sand at the beach is the result of Marians garbage dumping; or a new star appears when a person dies.
Type I error is jumping to conclusions of causation. Type II error is being blind to the truth. Although not as serious as Type I error, Type II error causes considerable distress. It delays progress and misleads people. It is not admitting that the world is round, not acknowledging that gravity impacts us, or not recognizing that skin color does not predict intelligence.
Type II error is scientific pretending. To counter it, we use replication. Replicating our findings allows us to hone our measurements. Over time, we get better at showing principles at work.
Type I error is scientific hallucination. The cure for it is to make our hypothesis falsifiable. That is, we design our experiment to prove something isn’t true. Science doesn’t prove things to be true, as much as it proves things aren’t true. To prove something true would require our testing every possible combination. Disproving only requires a single instance.
Proving only adds another brick to the theoretical structure. Disproving can fall the entire towering theory. “Monsters rule the universe” is hard to prove. But it is easy to test “There’s a monster under my bed.”
What science does is to state a hypothesis of no change. We assume that what we see is due to chance. And we only abandon that hypothesis when there is enough evidence.
We assume there is no difference between our wonder drug and getting an inert placebo. If there are small differences between the groups, we maintain our original hypothesis: no difference. If there are medium differences, we still hang on. We keep our hypothesis until we find “significant” statistical differences.
Type I error is rejecting our no-difference hypothesis (null hypothesis) when we shouldn’t. Type II error is accepting the null when we shouldn’t.
Some predictions are easier to make than others. Accurately predicting the movement of the stars and planets is possible because they maintain regular patterns. The planets have relatively set paths around the sun. There is some variation but the pattern is well established and occurs repeatedly. Similarly, stars follow consistent projectories. They don’t jump erratically; they maintain reliable courses.
Star and planet data fit the assumptions of prediction well. They have set courses, reliable patterns, and replicable observations. Statistics work well on data that fit these parameters. Any behavioral pattern that is consistent is relatively easy to predict.
In humans, scores on intelligence tests are quite consistent. Year after year, you tend to get about the same score. Moods, however, change quickly, and don’t follow a consistent pattern. Consequently, predicting moods is very hard to do.
Financial markets would be predictable if they were consistent. But stock prices jump, fall, slowly rise, and fade away. There are too many twists and turns for good prediction. So don’t blame statistics for not predicting the next major financial collapse. The data simply doesn’t meet the requirement of consistency.
If it’s any comfort, statistics is equally bad at predicting financial turnarounds. Prosperity could suddenly appear. A new discovery could be about to happen. Great news could be at hand. Statistics can sometimes explain patterns of the past but it’s not very good at seeing into the future.
In the 1950s, it wasn’t unusual for children to be quite conversant about baseball statistics. Part of that ability was tied to the popularity of the game, but part of it must be attributed to bubble gum.
It was a pretty good deal, compared to the chewable cigarettes, for example. You got a big piece of (not terribly flavorful) gum. And with the gum, you got a baseball card. It was a single card for a penny, or a five pack (sometimes 6) for a nickel.
The cards were comparable to playing cards in size and shape. The front of the card had a picture of the player, his name, team affiliation, and the position he played. The back of the card listed the player’s stats: height, weight, bats (left or right), throws (left or right), some major achievements, where he was born, and his birthday.
Then the good stuff. There were numbers for games, AB, runs, hits, 2B, 3B, HR, RBI and B ave. There also were some numbers about fielding (PO, A, and E). For pitchers, there were stats for wins, earned run average (ERA) and strikeouts.
Here’s what the abbreviations mean:
AB = at bat. It’s the number of times up to bat; but not counting getting hit by a ball, getting to base on balls, and other unusual events.
2B = a two-base hit, also called a double.
3B = a three-base hit, also called a triple.
HR = home runs (a four base hit).
RBI = runs batted in (number of other runners to cross the plate because of the player’s batting.
B Ave = batting average (hits divided by at bats).
PO = put outs (tagging out opposing runners)
A = assists (helping other fielders)
E = errors (mistakes)
In the case of baseball cards, statistics can be profitable too. Check your attic. If you have a 1951 Mickey Mantle rooky card (made by Bowman) or the 1952 Mickey Mantle card (made by Topps), I’ll give you a dollar for it. That’s just the kind of guy I am.
The card’s worth a lot of money but I’ll give you a dollar. As I said, that’s just the kind of guy I am.
As you can see, this is not your average American city. For one thing, this is Hong Kong. Minimally, to be the average American city, it has to be in America. 🙂
But Hong Kong can serve as an example of how to best approach the problem of description.
Every city can be described by numbers. There are certain items that could be counted: number of lights shining, height of buildings, number of people living in it, etc. And each of these variables (as opposed to constants) could be used to describe where you live. So, clearly, the first step is to decide what to measure. Are we looking for average population, average rainfall, or average income?
Let’s stick with population for now, and see where it leads. After all, how hard could this be? Let’s just take the population of the US and divide it by the number of cities in the country. That will give us the average city size.
But what exactly is a city? If we count only people who live within the actual boundaries of the city limits, aren’t we underestimating its size? For example, Hong Kong Island has a population around 1.3 million people, but the greater Hong Kong area has a population of nearly 7 million people. In the US, the greater Chicago would be comparable in population, depending on how big you make your “greater area.”
A related issue is that defining a city turns out to be a bit tricky. For example, Maza, North Dakota claims to be a city, and yet boasts a population of five. Another contender, Marineland in Florida, has a population of 7. Just think, if a family of five moves in, they could double your population.
And yet, Framingham, MA, which has a population of about 67,000, claims it is not a city. It says it is the largest town in the US. Apparently, being a city may depend on your type of local government, on if it has formed itself into a corporation, or just how you feel about it.
Calculating something as simple as an average can have its complications. Numbers look so clear cut and stable. But even descriptive statistics depend a lot on our definitions. So when we see a number, we have to remember that assumptions went into it. And those assumptions are critically important. We can’t separate our numbers from our assumptions.