Let’s say that you find that there is a correlation of 0.95 between the population of a city and the number of people who eat out frequently. What does this mean, if anything, about your chances of success if you locate your restaurant in a big city rather than in a smaller city?
300-500 words (leaning more towards 500) cite any and all resources/references if used (not necessary)
Transcript: Introducing Correlation
Were the ads effective?
Let’s say that a US restaurant chain, Burrito Bell, has opened seven new restaurants in seven different towns, all of about the same size and economic status. To promote the openings, the chain invested in local television commercials in each town. The chain bought more ads in some towns than others and then compared the sales volumes of the restaurants. Here are the results:
Now the chain’s management wants to know if sales went up as the number of commercials increased. In other words, are sales and the number of commercials correlated—do the two variables have a linear association? Or, put another way, does an increase in one of the variables correspond with a consistent increase or decrease in the other variable?
To find out, we have to calculate their correlation.
Correlation measures the relationship between two variables. It measures both the strength (how much they are related) and the direction (positive or negative) of the relationship. If there is a positive correlation, whenever the value of one variable increases or decreases, so does the other. If there is a negative correlation, whenever the value of one variable is high, the other has a low one.
Correlation can only be calculated under the following conditions:
- Both variables are quantitative.
- The relationship must be linear (rather than, for example, exponential).
- Outliers are left out, as they will distort the calculation.
You can check these conditions by drawing a scatter plot. Let’s try it for the advertising and sales data from Burrito Bell.
Burrito Bell Sales vs. Advertising
Let’s draw a scatter plot for Burrito Bell’s advertising and sales data. This will tell us if the data meets the conditions for calculating the correlation for these two variables.
As the explanatory variable, the number of ads goes along the X-axis and sales, as the responsive variable, should be plotted on the Y-axis.
Now, we plot the data points for each restaurant location.
So, does it look like their advertising had any impact on sales? From the scatterplot, it appears that there might be a positive relationship between these two variables. To find out for sure, we have to calculate the correlation coefficient for the data. With quantitative variables, a linear relationship (if a relationship at all) and no outliers, this data set qualifies for a correlation calculation.
The Correlation Coefficient
The correlation coefficient (or r) measures the strength and direction of the relationship between two variables on a scale of -1 to +1. Plus one indicates a perfect positive relationship. Minus one indicates a perfect negative relationship. In real life, nearly all correlation coefficients are less than perfect. That is, they fall between 0 and 1, or between 0 and minus 1.
Correlation at Burrito Bell
Let’s try this correlation calculation for the sample Burrito Bell data on commercials vs. per-restaurant sales, with number of commercials as our x variable and sales as our y variable.
- First, find the mean and standard deviation for each data set. The mean number of commercials is 3 with an SD of 1.73. The mean sales are $25,000 with an SD of about 1.9 thousand dollars.
- Second, find the distance of each variable from the mean as shown. Notice how some of the numbers are positive and some are negative. If both the x and y values have the same sign, it probably means that you have a positive correlation.
- Third, find the product of each pair of differences as shown. Notice how when the signs are the same, the product is positive.
- Fourth, add up the products, as shown.
- Finally, divide the sum by the number of x and y observations and by the SD of x and the SD of y. This gives you r, the correlation coefficient.
The correlation coefficient of ads run and per-restaurant sales is .735 on a scale of -1 to +1. This means that there is a strong positive correlation between the number of ads and sales. The more ads the restaurants ran, the higher the sales seemed to be.
Even if your data set allows you to calculate its correlation coefficient, there are a number of caveats you need to watch out for.
- Correlation does not mean causation. For example, just because there is a strong correlation between temperature and hotel occupancy does not necessarily mean that higher temperatures CAUSE higher hotel occupancy—it’s just that they happen to be related. Likewise, we can’t be sure that the ads run by Burrito Bell restaurants caused the higher sales.
- A correlation may be confounded. That is, changes observed in your correlated variables may be influenced, not so much by one another, as by some third, hidden variable. For example, there is a strong positive correlation between shoe size and reading ability in children. Does this mean that big feet cause children to read better? Or perhaps reading early in life promotes foot growth!
It may just be that older children have larger feet, and they read better. Age is the confounding factor behind this correlation: it’s the main influence behind both of the correlated variables—shoe size and reading ability.
- When there is a random or confounded relationship between two correlated variables, the correlation is said to be spurious. Imagine, for example, an investor who finds a strong correlation between the number of goals scored by his favorite hockey team during a season and the performance of a major stock index. Or, a person may notice that the rainfall totals observed during a particular spring are highest on those days when she accidentally leaves her umbrella at home. There is clearly no reasonable relationship between the variables in these examples, but when analyzing data, spurious relationships can often be difficult to detect.
Correlation can be a powerful tool for summarizing the strength of a relationship between two quantitative variables. And while you can always calculate a correlation coefficient between variables, the question is how much meaning you derive from it. It is important to consider all of your variables carefully when drawing conclusions from data relationships.
Top of Form
Rich Text Area