Data scientists don’t always grasp the core statistical concept of sampling and its superpowers in the workplace
Some years ago, I had the chance to code-review a data pipeline, as an effort to reduce the 23-hour running time. The pipeline was complex, creating on every run a new sophisticated set of treated and control users for comparison, but it wasn’t that hard to find some room for optimisation.
“What is this function doing?” I asked two of the data scientists who developed part of the code during a call. “I’m not entirely sure why we need this extensive loop”.
“Well,” one of them replied, “since the share of users treated is way smaller than the control pool, sampling might take only control users, so this function validates that we have both treated and control users and, in case we don’t, it repeats the sampling process”.
I digested that sentence for a moment and then I asked “Why?” Why on earth would you implement a sampling process that might return a useless sample? Why then write dozens of lines of codes to validate it? Why not simply use the correct sampling scheme for the task – one that always works as intended? There was an awkward silence, and only then did I realise that this was all news to them. All their sampling experience was encapsulated in the use of the sample function in Spark and sci-kit learn’s implementation of train_test_split. Apparently, simple ideas like stratified sampling were foreign to the team. And no, this wasn’t a small-scale project by an amateur team – the two data experts were working for a really big company (actually, one that belongs to the so-called ‘magnificent 7‘ group of tech giants). This made me realise that the concept of sampling, its implications and superpowers were largely obscured in the data science literature.
Sampling is a useful concept way beyond modern data science. Market research and opinion polling would not be possible without George Gallup’s use of statistical sampling techniques to match electorate demographics to predict election results back in the 1930s1. But even beyond polls, sampling takes a prominent position in governmental duties and rigorous studies. Most of the “hard” metrics that markets and investors follow are the results of survey estimations based on complex sampling procedures. Unemployment rate, for example, is periodically estimated in the US with the Current Population Survey, which samples about 60,000 eligible households that provide information on the labour force, employment, and unemployment conditions of the habitants. This same process is replicated by other official statistical agencies worldwide. Beyond economics, sampling has been used to understand critical phenomena such as the Covid-19 pandemic. In the UK, the Coronavirus (Covid-19) Infection Survey (CIS), was established in April 2020 to get reliable estimates of the actual condition of the pandemic, estimating the number of Covid infections, with or without symptoms.
Being such a powerful technique, why has sampling not been more formally adopted and used on day-to-day data tasks? Possibly because no one has taken the time to relate such sampling procedures to everyday data science work. But fear not, readers: I’m here to fill that gap, with four use cases in which a bit of knowledge of sampling can become a life-saver.
The actual problem behind data drift
Just understanding what sampling is, and how it can affect you, can be very beneficial for data analytics workflows. Sampling, in its broader sense, is understood as the selection of a subset or sample of individuals from within a statistical population, to estimate characteristics of the whole set. The subset is meant to reflect the whole population, and as statisticians we attempt to collect samples that are as representative as possible of that population of interest: that way, inferences are trustworthy. One common misconception among data practitioners is the idea that sampling is not even necessary: “We use no sampling. We have all the data”. Well, I’m sorry to tell you this, but you never have all the population data, and you never will. No matter how big your table, you always have a sample. Because you are never interested in yesterday’s behaviour, you are usually trying to make inferences about overall present and future patterns, so trusting too much in your current sample will likely affect the outcome of your analysis.
Of course, you are not going to read in O’Reilley books about how your non-probabilistic sample might derail your results, oh god no. Because in the 21st century we like to reinvent the wheel and give new names to century-old ideas. So, you’ve probably heard about this problem by its new nickname: “data drift“. This issue is a consequence of the non-random sampling process that generates your data, which is probably not representative of the whole population of future user behaviour, or future transactions or future whatever. “Data drift” only means that your current sample is no longer representative of the overall population, so it is simply a sampling problem. When you see it in this context, it is easier to suggest solutions for it, because non-representative samples happen all the time and require some action on the analyst’s part. You can take a different sampling approach, perhaps considering only certain time frames, or you can weight your current data, perhaps weighting more recent behaviour heavily or giving more prominence in the analysis to users who behave as more recent ones. So, if data drift is a problem, it simply means you must revisit what data you use and how you use it.
How useful is my A/B test?
Consider A/B testing as another example. Is the two-week sample you are using representative of the full user behaviour? Do we get to see every pattern and relevant case during that period? Seriously, in practice, you might not even get two weeks, so it is important to be sure, before making decisions, that the users you got data from behave in the same way you would expect to see from your full population. This only gets worse when you consider some other sources of bias that can affect your sampling. It is not uncommon to see errors in the sampling process used by A/B testing software, either due to flaws in the system or flaws by users at the point of setting up experiments. Again, weighting or further analysis might be necessary to ensure you are representing the population of all (current and future!) individuals.
I’m not the first person to notice this. The whole idea of A/A testing came from the need to audit not only the sampling but the overall process of A/B tests. It is not uncommon to see issues in the treatment/control assignment originating from unfair splits, which is a fancy way of saying that the sample produced by the experiment was skewed in a non-planned way.
Strata are your friends!
Understanding sample schemes will help you improve your analysis and reduce variance if done right. Most sample functions rely on simple random sample, a scheme in which every possible sample of size n has the same probability of occurring. Though this has several advantages, it might also produce undesirable results from time to time, as my two colleagues at the beginning of the article experienced. It is then worth considering other approaches, such as stratified sampling. Basically, stratified samples are made by separating the population in k homogeneous, non-overlapping strata or segments of individuals. Then, you just take k independent simple random samples within each stratum, to create your final sample.
Some caveats to this. Your usual estimator is no longer unbiased, so you can’t just take the overall mean of your whole sample and call it a day. Now you need to estimate your statistic of interest within each stratum and pool them appropriately. This is actually easier than it sounds, as sampling weights can speed up the process: this is the basis of complex survey design-based estimations even today2. But added complication also brings great rewards: appropriately handled strata usually produce lower variance estimations. And if you rely on survey data for your work, it is almost certain that you will face stratified and complex survey schemes, where considering this structure in your models will have a sizeable impact on your results.
Hack your way to speedy runs!
And finally, this is one of my favorite uses for sampling in your day-to-day life: sampling is a useful way to speed up analysis. Nowadays, boosting algorithms such as XGBoost, LightGBM or CatBoost are common for many prediction applications, though these models might take some time to train, particularly for big and complex datasets or features. And when you are testing the waters and playing around with transformations and feature experimentation, waiting 2 minutes per run is impractical. So, even if you have “all the data in the world”, feel free to take a sample. Make sure you use a stratified sample scheme that preserves the distribution of key data characteristics and then you can generate a bite-sized sample to run and test model strategies at a faster pace, with lower running times for each trial. Once you are comfortable with a final version of the model, fit it to the whole set of data.
You can extend this approach to take advantage of another perk of sampling: variability estimation. This is the intuition behind bootstrapping and some repeated sample estimation procedures and you can apply it to your big dataset to get a better understanding of your results. If you have some odd insights popping up, test them around different samples to validate how robust they are. Is pricing not such a powerful predictor? Is it because we had bad luck with our dataset or is there a more prevalent issue? Take a dozen different samples and fit the same model or analysis to each. Results will vary but how much, and in what way, can tell you a lot about your research. Is feature importance the same on each sample? Those results that sustain through different iterations are more robust and interesting than those that do not, so this strategy can give you more insights on the actual relationships you are studying.
Honestly, I have saved myself many headaches and work hours with some of these tips. The best thing is that they are not even that hard to implement: a lot of functions in Python and R are available for most of this, and once you start applying them to your daily workflows you will notice they are not even that hard to explain. I’ve always thought of sampling as one of the core statistics ideas that data scientists should learn to up their game. Hopefully, you now agree.
References
1 Harkness, T. (2021) The History of the Data Economy, Significance, 18(2), April 2021, pages 12–15 doi.org/10.1111/1740-9713.01504
2 Kish, L. (1965) Survey Sampling, John Wiley & Sons, Inc., New York, London, IX + 643 S., 31 Abb., 56 Tab., Preis 83 s. doi.org/10.1002/bimj.19680100122
Carlos Grajales is a statistician and business analytics consultant based in Mexico. He is also a member of the Significance editorial board.
You might also like: Algorithm wars: X, Bluesky and black boxes
