Yesterday (19th May 2016) assembly election results for 4 states in India were announced. In Tamil Nadu, Jayalalithaa’s party AIADMK proved the exit polls wrong by coming to power second term in a row. This is not a first, various elections in the past have revealed the inaccuracies in the exit polls. Who can forget Bihar’s assembly elections last year when most exit polls predicted BJP’s victory, which later turned out to be wrong, but not before BJP party members bought sweets and crackers to celebrate their victory! 😛

So, why are exit polls so inaccurate? The answer to that lies in sampling.

## What is Sampling?

Wikipedia Definition: In statistics, quality assurance, and survey methodology, sampling is concerned with the selection of a subset of individuals from within a statistical population to estimate characteristics of the whole population.

Suppose I want to know what’s the most bought car in the city of 1 Million population that I live in. Assuming an average of 4 people per household, we have 250,000 households to interview, which is almost impossible to do. So, I select a neighbourhood and ask people living there the car they bought. Let’s say that 600 out of 1000 household I interviewed said that they bought Renault Duster, which is 60% of the sample set. So, using the same proportion of the entire population, I can say that 150,000 households in the city own Renault Duster.

But there is a small problem.

I chose a very posh neighbourhood to interview. Since the majority of families living there are rich, they have a high probability of owning an expensive car. That might not be true for the entire city. So, the answer we obtained above is not accurate because our sample was biased towards the people who are rich.

This is the first problem of sampling and such a sample is called a biased sample. In the case of exit polls, when not conducted properly, it’s highly probable that the people interviewed were biased towards one party. Then there is the trustworthiness of the interviewees which can again lead to inaccuracies in the results. For instance, there was a recent study conducted in IIT-Mumbai, claiming that 95% of the freshers are virgin. How many of the students interviewed told the truth and how many lied to either look cool or innocent in their friends’ eyes, we will never know.

Going back to our problem statement of finding the most bought car, let’s now modify our strategy. This time instead of interviewing a neighbourhood, I decide to stand at a busy traffic signal during the office hours (9 AM to 11 AM) and count the different models of cars that passed from there.

Why did I choose a traffic signal this time? Because the probability of people from all economic sections of the society passing through a traffic signal is much high, and thus my sample set will not be biased. Or, so I think! However, I am still missing on a certain section of people- what about housewives who have a car? What about people who work in night shifts and hence won’t be passing through the signal during the time slot that I selected? What about the section of people who own a car but choose to carpool to work? You get the idea.

This is the second problem of sampling, and to avoid this problem we need to select our sample carefully so that it is representative of the entire population. What that means is that the sample must represent all the different subsets of the population and in the same proportion in which they occur.

### So, how do we select a perfect sample?

The perfect sample is a **random sample** selected by pure chance from the entire population. If every object in a population has an equal chance to be in the sample, then our sample can be called a truly random sample.

However, it is extremely expensive and difficult to obtain a purely random sample. And so, statisticians use a simplified version of the random sample, called stratified random sample.

### How do you obtain a stratified random sample?

To obtain a stratified random sample, you divide the population into subsets in the proportion in which they occur in the entire population. So, for e.g., if a city has 60% male and 40% female, our sample should also have male and female in the same proportion.

Unfortunately, it is not easy to obtain a completely unbiased stratified random sample. Taking the above example of 60% male and 40% female, it’s not enough to select our sample in just that proportion. There are so many other variables we need to take into account- different age groups, different economic status, different education background, employment status, and so on. Depending on your specific use case, you need to take into account all the various combinations that will make your sample a pure stratified random sample, and trust me, that’s almost impossible to have. Take, for example, the traffic signal approach we discussed above. We selected to count cars at a traffic signal because that’s as random as you can get, yet it was not a perfect sample because it was not representative of certain subsets of the society.

Sadly, media houses and a lot of publications these days do a shoddy job at creating samples. Unless the sample is large enough and selected properly, the conclusion derived from it only has a spurious air of scientific recommendation and may be far less accurate than an intelligent guess. So, next time when you see a news reporter talking about exit polls, or read any survey results in newspapers and magazines, take them with a pinch of salt.

*This post first appeared on my blog Let’s Talk Data.*