Pixie Dust Sampling – Or How To Commit Harakiri By Lying With Big Data

I recently saw the following post on LinkedIn:

“95% of banks in the study have created innovation labs.”

This figure seemed extremely high to me. I did a quick-and-dirty survey of three banks in my circle. Not one of them has an innovation lab.

Nevertheless, I didn’t conclude that the author of the post was lying.

That was because I couldn’t find any evidence of these three techniques to lie with Big Data in his post or any sleight-of-hand in his analysis.

Even without lying with Big Data, you can prove something and its opposite by changing the start date of your dataset. e.g. https://t.co/FeRbqmln8o

— GTM360 (@GTM360) May 17, 2018

I was about to reconcile myself with the 95% figure when I read the following comment by fintech thought leader Alex Jimenez:

“95% of the banks with innovations labs have innovation labs (one closed when they ran out of beer).”

Aha. Alex had a point.

I then saw the following snarky interpretation by another fintech thought leader Ron Shevlin:

“No issue. It said 95% of the bank IN THE STUDY. Clearly, the study was a handpicked sample of banks that have innovation labs. And one bank that didn’t.”

Ron’s emphasis on “IN THE STUDY” was an epiphany moment. I figured that this could be a brand new way to lie with Big Data, in addition to the three ways of lying with Big Data I’d covered in How To Lie With Big Data.

Let me call it:

#4. Pixie Dust Sampling

In this method, you compile a sample that largely comprises people who already support your hypothesis. To convey the impression that it’s truly representative of the population, you sprinkle a few truly random members in the sample. Obviously, when you run your survey on this sample, your results will confirm your hypothesis by a wide margin.

In a properly-conducted survey to assess the percentage of banks with innovation labs, you’d draw a random sample of banks from the population of all banks, and ask each bank if it had an innovation lab. I’d expect such a survey to reveal that 15-20% of banks have innovation labs (for reference, the corresponding figure in my personal survey was 0%).

However, when you apply the pixie dust sampling method to the same survey, you’d Google for “banks with innovation labs” and scrape all the results into your sample. You’d ask each bank whether it has an innovation lab. Obviously, all of them will say yes. In actual practice, a couple of banks might have shuttered down their innovation labs because they – ahem – ran out of beer. Besides, Google is not infallible, so its search results might contain entries of a few banks that don’t have an innovation lab. Ergo 95%, not 100%, of banks in your survey would say yes.

Pixie Dust sampling differs from Confirmation Bias in that, in pixie dust sampling, you consciously select survey participants who support your hypothesis whereas, in the case of confirmation bias, such a selection happens unwittingly.

This LinkedIn post brought back memories of a fintech whose Buy Now Pay Later (BNPL) product I’d trialed a while ago.

I got a call after a few months from a market research agency asking me if I’d heard about this deferred payment product.

I said yes.

The caller said thanks and hung up without asking any further questions.

I haven’t heard about this fintech after that.

Let me connect the dots to speculate what happened behind the scenes at the fintech:

The fintech wanted to gauge its brand awarness and appointed an Market Research agency to carry out a survey
The agency asked for big bucks to compile a statistically-significant sample
The fintech balked at the cost
In a true spirit of partnership, the MR agency said, okay, you give us a list of people whom we should poll and we’ll not charge you anything for compiling the sample
The fintech agreed to reciprocate the agency’s overtures for partnership. After scrounging through all its hard disks and cloud storage space and USB sticks, the fintech could come up with only one list that was in good enough shape. This was its existing customer list. It handed over the list to the agency
Priding itself on its bias for action – rather than talk – the fintech neglected to mention to the agency that the list comprised of existing customers
The agency conducted the survey on this pixie dust sample
In short, the survey asked existing customers of the BNPL product if they’d heard of the BNPL product.

Most people – like me – said yes. A few people might have forgotten that they’d used the product sometime in the past and said no. Ergo the survey found that “95% of people have heard about the fintech’s Buy Now Pay Later product”, not 100%.

The fintech was delighted to hear that its BNPL product enjoyed a near-100% brand awareness. Its founders reported this figure to its VC investors. Since VCs pride on having short attention spans, they couldn’t be caught quizzing the fintech’s founders about the survey methodology or sample composition. In the true spirit of “move fast and break everything”, they concluded that the fintech’s brand awareness was very high, and decided to cut off funds for any more brand-building activities. The founders stopped all marketing campaigns.

Unsurprisingly, the fintech disappeared from the market.

Practitioners of the three techniques described in my previous blog post How To Lie With Big Data can achieve some gains in the short term.

However, anybody who uses the fourth tactic described in this post will commit harakiri (aka seppuku aka suicide) by lying with Big Data. As the aforementioned BNPL fintech did – unwittingly or otherwise.

I don’t know the status of the said BNPL. My conjecture of the behind-the-scenes events at this fintech may not be wholly accurate.

But, notwithstanding any of that, the purpose of this post is to drive home the possibility that lying with Big Data can sometimes kill the liar.

Pixie Dust Sampling – Or How To Commit Harakiri By Lying With Big Data

“95% of banks in the study have created innovation labs.”

Comments

Leave a Reply