How Data Dredging Increases Productivity?
Data dredging, also known as significance chasing, significance questing, selective inference, and p-hacking is the misuse of data analysis to find patterns in data that can be presented as statistically significant, thus dramatically increasing and understating the risk of false positives.
The core problem of many statistical tests is that the underlying population characteristics tested (or supposed to be tested) are not generally included in the sample used for statistical analysis. For example, say that you are conducting a study to determine whether or not certain diseases affect children differently from adults; to do so, you would have to sample the families of children who may have been diagnosed with each disease individually, as well as the families of adults who may have been diagnosed with the disease. The sample size needed would depend on the number of disease cases that each family would represent. This way, it is hoped that disease-specific profiles can emerge from the statistical analysis.
Once the hypothesis to be tested is determined, however, how do you draw a conclusion about it? There are two ways that researchers go about testing a hypothesis. One way is to use statistics to analyze previous studies’ results and draw your conclusions from these analyses. The second way is to look at the real world – do something that has already been done, and draw your conclusions from the known facts. In both cases, it is important to remember that your conclusions may not ultimately reflect what the public wants to hear but rather factual.
One problem with the first method is that it is very easy to fabricate a false positive simply because the expected results will be small. If the number of false results is relatively small, there is no point in continuing with the research simply because there are no results to show your hypothesis. The other problem with hypothesis testing is that it can easily become hit and miss which hypothesis is correct. In these circumstances, data dredging methods can help increase the chance of reaching a correct conclusion.
Data dredging involves the researcher randomly selecting a set of data points, or ‘data sets,’ and then examining them to see any patterns emerging from them. For example, they might randomly select a number of businesses from a telephone directory, telephone survey, or customer list. By examining the differences between these businesses and their customers, they can examine which businesses have the highest sales, which have the highest returns on investment, and which have the lowest prices. From this, they can draw statistical comparisons between the sample’s characteristics and the population to see whether there is a significant difference in the characteristics that they are studying.
As mentioned, dredging provides an opportunity for multiple modeling to be conducted to increase the chances of arriving at correct conclusions. In simple terms, multiple modeling is when a set of results, or models, from your dredging study, are compared against each other. For instance, an investor might want to know the effect that a given asset’s price will have on its profits over time and then compare these results against the prices recorded by historical markets. The same process could be used when trying to evaluate the profitability of different operations. By using multiple models, rather than each model separately, a more accurate answer can be achieved.
When multiple modeling techniques are combined, there is a greater chance that the final model will be predictive. In other words, the results from all of the techniques combined are derived from the same underlying assumptions about the market and are then combined into a single model, which gives the best overall result. While this is not technically possible with data dredging alone, a model likely generated using the combined results will give an accurate result. This is because all the model assumptions are true, which increases the likelihood that the end result is accurate.
It’s not difficult to see how much more accurate a model run using data dredging will be if the process of combining multiple techniques is also combined. Consider for a moment that you could use simple regression to determine the relationship between price and volume. However, if you were to use the regression to only look at the price component, it would only be as accurate as calculating a constant. A more accurate result could only be obtained if the model used was coupled with a model run using the relationship between price and volume.
Data dredging, sometimes called significance hunting, relevance seeking, or p-hacking, is the abuse of statistical power, often coupled with other computer skills, to seek out patterns in large consolidated databases which are suggestive of prior probabilities, thereby dramatically boosting the statistical likelihood of finding an actual relationship between a variable and its dependent variables.
The same is true when combining data dredging with models of other types. Consider the common method of correlation between variables. This is normally used in economics but can also be used in many industries. Two models could be constructed from the same data and then used to predict the effect on one variable’s price while ignoring the other. However, this method assumes that all other factors have been controlled, which greatly reduces its predictive power. It could be said that data dredging significantly improves the predictive power of multiple models.