Hapsburg AI: The Perils of Synthetic Data (FD #66)

Magic Lanterns, "Garbage In, Garbage Out," and how AI is going to destroy itself

AI crawlers consuming AI-generated content. These digital bottom-feeders swallowed all human written content on the internet, and are now consuming content as it’s published. But much of the new content being produced is being written by robots. This swill-to-AI feedback loop that Sadowski mentioned in VentureBeat, 404 Media, and The Guardian. To explain for members of our studio audience not familiar with the Habsburgs…Behold! a man:

Charles II and Charles I

The Hapsburgs were a royal line so badly inbred that their chins became notably deformed. This was conventional wisdom or unproven nonsense until 2019, depending on who you talked to. Researchers confirmed the Hapsburg Jaw or Chin was due to both a recessive-trait and a dominant-gene expressed problem as well. So, they fucked up their genes and got screwed by their genetics at the same time. The best thing that came out of inbred royals was this exceptionally relevant bit from Eddie Izzard:

(Editor’s Note: This shouldn’t be confused with a Potemkin AI,” also coined by Jathan Sadowski in 2018, referring to systems sold as magical but with exploited-humans at the core. Great examples of this are the Mechanical Turk of 1770 or the Amazon’s “Just Walk Out” systems which debuted in 2016 and unmasked…this month.)

As Brian Merchant explains, LLMs are a sort of “magic lantern,” a fascinating gadget popular from 1660-1700 that allowed people to use magnified candlelight to project images onto flat surfaces for the first time. These were used not to educate or instruct, but to bamboozle & terrorize. Conjuring of demons, seances, and magic shows were all the rage, as grifters tricked people by the hundreds.

comb.io - All Singing, All Dancing

In response to the Hapsburg AI problem, tech startups, data scientists, and engineers-at-large have started pushing the concept of “synthetic data” as a solution to the problem of “garbage in, garbage out.” Which is, if analysis uses garbage data, your findings will be worthless. If you make up data to fill gaps in a pre-determined curve, your results will line up neatly with your desired conclusions. So, to quote Eddie Izzard, IQs down the toilet!

There are a LOT of companies, LinkedInfluencers, video tutorials, and entrepreneurs doing PR disguised as Medium posts talking about synthetic data, how to generate it, and why it’s totally as good as the real thing. You know, human-written content, data taken from real life, or interactions between organic beings. Listing all possible interaction choices, conversational pathways, emotional or personality-based variables, and individual histories is impossible. So they do what all engineers do in reality, pick a point in the curve of diminishing returns to stop work.

Natural Language Processing usually logs data representing language or human behavior as vectors or arrays of data and transforming it. Transforming it in a really complex way? Surely, but it’s still just 3 datas in a trench coat pretending to be an annoyed customer.

Image of Well, he's still three kids in a trench coat, so, no.

Why is this bad? Because communicating with a chatbot using AI produces a “halo of trust” around communications with humans. This fascinating study published in "Computers in Human Behavior refers to AI as a “moral crumple zone.” When AI-mediated comms go well, the human gets the gold star of trust, and when they aren’t successful, the human gets less of the blame. Why wouldn’t a company give this a shot? Which just creates more AI-influenced content to be consumed, and a gap of lies the human behind it can rest safely in.

Image 1

Synthetic Data-driven LLM will never iterate all possible human interactions, personalities, sum-of-life-experiences or even behaviors. It’s just going to do what AI systems have done since creation: be most familiar with rich, white, American/Western European, English speaking, liberally educated personality constructs. Will it create a passable “middle income earning, elderly Chinese immigrant” consumer to churn out synthetic interaction data for? Probably. Would the data it builds for 1000 of these synthetic interviews with the old Chinese lady be as good as the 1000 you "ran” with Kyle, the McKinsey consultant who loves bacon and plays pickleball? I wouldn’t bet my company on it.

LLMs are generating data to mimic real human interactions, purchasing behavior, even medical images or financial records, from crawled content that is increasingly generated by other LLMs. Which is MUCH worse than garbage in, garbage out. When you know your data is garbage, you don’t bet the farm on it. AI companies are tricking the C-suite into canning focus groups, UX research, and entire interface departments, because they were sold a few magic boxes that can run for a few weeks and get better info at 1/1000th of the cost.

A Little Ouroboros" Greeting Card for Sale by BattleGoat | Redbubble

YCombinator published a post listing 25 start-ups using synthetic data “to watch,” written by a founder who sold her startup to Niantic, becoming their Head of AR. Which is a perfect encapsulation of the problem. It scans normally, if you have no familiarity with Niantic or its use of AR. Ask any Pokémon Go fan how laughable the AR in the game is. They know it’s just as bad as the output of ChatGPT when you ask it how many e’s are in the word seventeen, or what the Material Safety Data Sheet on a harmful chemical is.

Unfortunately, we don’t need to imagine how bad the logical endpoint of this could be, it’s happening today. Google Photos are being used at scale to develop target packages for bombs in in Gaza. The Pentagon is evaluating AI to develop something similar to the Lavender system leaked/detailed in +972 Mag, because, to quote the story, “You don’t want to waste expensive bombs on unimportant people.” And this is just the worst possible example, LLMs are getting markedly worse in most measures, with no second internet to trawl to make the next jump in sophistication.

To stress, AI has uses! It even has uses that aren’t exploitative, criminal, or disastrous for the planet. But, auto-generating some marketing copy, conjuring fish dick images, or shaving half an hour off your professional day might not be worth the obliteration of millions of jobs and petawatts of power it will consume.

Thankfully, I’m not alone sounding the clarion call. Among dozens, Rob Horning examined the paradox of “trained on” and “product of” in his great piece So-called well-meaning":

Instead, it must be assumed that machines will be used to generate data that simulates and obscures human practices, making it harder to determine what different populations are doing or have done. But this isn’t so much a matter of human-made culture being replaced by machine-generated content. Rather machine-made content (which reify existing statistical biases as “truths”) will be used to reproduce and retrench the usual distortions in data already introduced by power imbalances and social inequities and forms of categorical discrimination.

Do something authentic with humans, whether it’s business, travel, conversation, sports, or even just enjoying each other’s company. LLMs can’t replace that. Never forget:

No photo description available.

Keep your head pointed at authentic humans,tn