How a ragtag band of internet friends became the best at forecasting world events

Ben Hickey for Vox

What the Samotsvety group can teach us about predicting the future.

The question before a group made up of some of the best forecasters of world events: What are the odds that China will control at least half of Taiwan’s territory by 2030?

Everyone on the chat gives their answer, and in each case it’s a number. Chinmay Ingalagavi, an economics fellow at Yale, says 8 percent. Nuño Sempere, the 25-year-old Spanish independent researcher and consultant leading our session, agrees. Greg Justice, an MBA student at the University of Chicago, pegs it at 17 percent. Lisa Murillo, who holds a PhD in neuroscience, says 15-20 percent. One member of the group, who asked not to be named in this context because they have family in China who could be targeted by the government there, posits the highest figure: 24 percent.

Sempere asks me for my number. Based on a quick analysis of past military clashes between the countries, I came up with 5 percent. That might not seem too far away from the others, but it feels embarrassingly low in this context. Why am I so out of step?

This is a meeting of Samotsvety. The name comes from a 50-year-old Soviet rock band — more on that later — but the modern Samotsvety specializes in predicting the future. And they are very, very good at it. At Infer, a major forecasting platform at the University of Maryland, the four most accurate forecasters in the site’s history are all members of Samotsvety, and there is a wide gap between them and fifth place. In fact, the gap between them and fifth place is bigger than between fifth and 10th places. They’re waaaaay out ahead.

While Samotsvety members converse on Slack regularly, the Saturday meetings are the heart of the group, and I was sitting in to get a sense of why, exactly, the group was so good. What were these folks doing differently that made them able to see the future when the rest of us can’t?

I knew a bit about forecasting going into the meeting. I’ve written about it; I’ve read Superforecasting, the bestseller by Philip Tetlock and Dan Gardner describing the research behind forecasting. The whole Future Perfect team here at Vox puts together predictions at the start of each year, hoping not just to lay down markers on how we think the next year will go, but to get better at forecasting in the process.

Part of the appeal of forecasting is not just that it seems to work, but that you don’t seem to need specialized expertise to succeed at it. The aggregated opinions of non-experts doing forecasting have proven to be a better guide to the future than the aggregated opinions of experts. One frequently cited study found that accurate forecasters’ predictions of geopolitical events, when aggregated using standard scientific methods, were more accurate than the forecasts of members of the US intelligence community who answered the same questions in a confidential prediction market. This was true even though the latter had access to classified intelligence.

But I felt a bit stuck. After years of doing my annual predictions, I didn’t sense they were improving much at all, but I wasn’t predicting enough things to tell for sure. Events kept happening that I didn’t see coming, like the Gaza war in recent months or the Wagner mutiny a few months before that. I wanted to hang out with Samotsvety for a bit because they were the best of the best, and thus a good crew to learn from.

They count among their fans Jason Matheny, now CEO of the RAND Corporation, a think tank that’s long worked on developing better predictive methods. Before he was at RAND, Matheny funded foundational work on forecasting as an official at the Intelligence Advanced Research Projects Activity (IARPA), a government organization that invests in technologies that might help the US intelligence community. “I’ve admired their work,” Matheny said of Samotsvety. “Not only their impressive accuracy, but also their commitment to scoring their own accuracy” — meaning they grade themselves so they can know when they fail and need to do better. That, he said, “is really rare institutionally.”

What I discovered was that Samotsvety’s record of success wasn’t because its members knew things others didn’t. The factors its members brought up that Saturday to explain their probabilities sounded like the points you’d hear at a think tank event or an academic lecture on China-Taiwan relations. The anonymous member emphasized how ideologically important capturing the island was to Xi Jinping, and how few political constraints he faces. Greg Justice countered that the CCP has depended on economic growth that a war would jeopardize. Murrillo put a higher probability on an attack because of a projection that the US will not be likely to back up Taiwan once the latter’s chip production monopoly has waned due to other nations investing in fabrication plants.

But if the factors being listed reminded me of a normal think tank discussion, the numbers being raised didn’t. Near the end of the session, I asked: If some of you think there are such strong reasons for China to capture Taiwan, why is the highest odds anyone has proposed 24 percent, meaning even the most bullish member thinks such an event is nearly 75 percent likely not to happen? Why does no one here think Chinese control by 2030 is more likely than not?

The team had an answer, and it’s an answer that goes some way toward explaining why this group has managed to get so good at predicting the future.

The story of Samotsvety

The name Samotsvety, co-founder Misha Yagudin says, is a multifaceted pun. “It’s Russian for semi-precious stones, or more directly ‘self-lighting/coloring’ stones,” he writes in an email. “It’s a few puns on what forecasting might be: finding nuggets of good info; even if we are not diamonds, together in aggregate we are great; self-lighting is kinda about shedding light on the future.”

It began because he and Nuño Sempere needed a name for a Slack they started around 2020 on which they and friends could shoot the shit about forecasting. The two met at a summer fellowship at Oxford’s Future of Humanity Institute, a hotbed of the rationalist subculture where forecasting is a favored activity. Before long, they were competing together in contests like Infer and on platforms like Good Judgment Open.

The latter site is part of the Good Judgment Project, led by Penn psychologists Philip Tetlock and Barbara Mellers. Those researchers have studied the process of forecasting intensely in recent decades. One of their main findings is that forecasting ability is not evenly distributed. Some people are consistently much better at it than others, and strong past performance indicates better predictions going forward. These high performers are known as “superforecasters,” a term Tetlock and Gardner would later borrow for their book.

Superforecaster® is now a registered trademark of Good Judgment, and not every member of Samotsvety has been through that exact process, although more than half of them (8 of 15) have. I won’t call the group as a whole “superforecasters” here for fear of stealing superforecaster valor. But their team’s track record is strong.

A common measure of forecasting ability is the relative Brier score, a number that aggregates the result of every prediction for which an outcome is now known, and then compares each forecaster to the median forecaster. A score of 0 means you’re average; a positive score means worse than average while negative means better than average. In 2021, the last full year Samotsvety participated, their score in the Infer tournament was -2.43, compared to -1.039 for the next-best team. They were more than twice as good as the nearest competition.

“If the point of forecasting tournaments is to figure out who you can trust,” the writer Scott Alexander once quipped. “the science has spoken, and the answer is ‘these guys.’”

So, why these guys? Part of the answer is selection. Members’ stories of how they joined the Samotsvety were usually some variation of: I started forecasting, I turned out to be pretty good at it, and the group noticed me. It’s a bit like how a youth soccer prodigy might eventually find themselves on Manchester City.

Molly Hickman came to forecasting by way of the government. Taking a contracting job out of college, she was assigned to IARPA, the intelligence research agency where Jason Matheny and others were running forecasting tournaments. The idea intrigued her, and when she went back to grad school for computer science, she signed up at Infer to try forecasting herself. She put together a team with her dad and some friends, and while the team as a whole didn’t do great, she did amazing. The Samotsvety group saw her scores and invited her to join.

Eli Lifland, a 2020 economics and computer science grad at UVA now attempting to forecast AI progress, got his start predicting Covid-19. 2020 was in some ways a banner year for forecasting: Superforecasters were predicting that Covid would reach hundreds of thousands of cases in February of that year, a time when government officials were still calling the risk “minuscule.” Users of the forecasting platform Metaculus outperformed a panel of epidemiologists when predicting case numbers. Even in that company, Lifland did unusually well. The fast-moving nature of the pandemic made it easy to learn quickly because you could predict cases on a near-weekly basis and quickly realize what you got right or wrong. Before long, Misha and Nuño from Samotsvety came calling.

But “select people already good at forecasting” doesn’t explain why Samotsvety is so good. What made these forecasters good enough to win Samotsvety’s attention? What are these people, specifically, doing differently that makes their predictions better than almost everyone else’s?

The habits of highly effective forecasters

The literature on superforecasting, from Tetlock, Mellers, and others, finds some commonalities between good predictors. One is a tendency to think in numbers. Quantitative reasoning sharpens thinking in this context. “Somewhat likely,” “pretty unlikely,” “I’d be surprised.” These kinds of phrases, on their own, convey some useful information about someone’s confidence in a prediction, but they’re impossible to compare to each other — is “pretty unlikely” more or less doubtful than “I’d be surprised”? Numbers, by contrast, are easy to compare, and they provide a means of accountability. Unsurprisingly, many great forecasters, in Samotsvety and elsewhere, have backgrounds in computer science, economics, math, and other quantitative disciplines.

Hickman recalls telling her coworkers in intelligence that she was working on forecasting and being frustrated by their skeptical responses: that it’s impossible to put numbers on such things, that the true probabilities are inherently unknowable. Of course, the true probabilities aren’t known, but that isn’t the point. Even if they weren’t using numbers, her peers were “actually doing these calculations implicitly all the time,” she recalls.

You might not tell yourself “the odds of China invading Taiwan this year is 10 percent,” but how much time a deputy assistant Secretary of Defense spends studying, say, Taiwan’s naval strategy is probably a reflection of their concept of the underlying probability. They wouldn’t spend any time if their probability was 0.1 percent; they would be losing their mind if their probability was 90 percent. In reality, it’s somewhere in between. They’re just not making that assessment explicit or putting it in a form that makes it possible to judge their accuracy and from which they can learn in the future. Numeric predictions can be graded; they let you know when you’re wrong and how wrong you are. That’s exactly why they’re so scary to make.

That leads to another commonality: practice. Forecasting is a lot like any other skill — you get better with practice — so good forecasters forecast a lot, and that in turn makes them better at it. They also update their forecasts a lot. The Taiwan numbers I heard from the team at the start of our meeting? They weren’t the same by the end. Part of practicing is adjusting and tweaking constantly.

But not everyone who practices, and uses numbers to do so, succeeds. In Superforecasting, Tetlock and Gardner come up with an array of “commandments” to help us mere mortals do better, but I often find myself struggling to implement them. One is “strike the right balance between under- and overreacting to evidence”; another is “strike the right balance between under- and overconfidence.” Great, I will simply strike correct balances in all things. I will become Ty Cobb by always striking the right balance between swinging too early and swinging too late.

However, another commandment — to pay attention to “base rates” — came up a lot when talking to the Samotsvety team. In forecasting lingo, a “base rate” is the rate at which some event tends to happen. If I want to project the odds that the New York Yankees win the World Series, I might note that out of 119 World Series to date, the Yankees have won 27, for a base rate of 22.7 percent. If I knew nothing else about baseball, that would incline me to give the Yankees better odds than any other team to win the next World Series.

Of course, you’d be a fool to depend on that alone — in baseball, you have a lot more information than base rates to go on, like stats on every player, years of modeling telling you which stats are most predictive of team performance, etc. But when projecting other kinds of events where far less data exists, you often don’t have any more to go on than the base rate.

This was the whole explanation, it turns out, for why everyone in the group put a relatively low probability on the odds of a successful Chinese attempt to retake Taiwan by 2030. Members argued over just how strong the reasons for China to attempt such an effort was, but there was broad agreement that the base rate of war — between China and Taiwan or just between countries in general — is not very high. “I think that’s why we were all so far below 50 percent, because we were all starting really low,” Justice explained when I asked.

That kind of attention to base rates can be surprisingly powerful. Among other things, it gives you a starting point for questions that might seem otherwise intractable. Say you wanted to predict whether India will go into a recession next year. Starting by counting up the number of years in which India has had a recession since independence and calculating a probability is a simple way to begin a guess without requiring huge amounts of research. One of my first successful predictions was that neither India nor China would go into a recession in 2019. I got it right not because I’m an expert on either, but because I paid attention to the base rates.

But there’s more to successful forecasting than just base rates. For one thing, knowing what base rate to use is itself a bit of an art. Going into the China/Taiwan discussion, I counted that there have been four lethal exchanges between China and Taiwan since the end of the Chinese Civil War in 1949. That’s four incidents over 75 years, implying that there’s a 5 percent chance of a lethal exchange in a given year. There are six years between now and 2030, so I got a 26.5 percent chance that there’d be a lethal exchange in at least one of them. After adjusting down for the odds that the exchange is just a skirmish versus a full invasion, and compensating for the chances that Taiwan beats China, I got my 5 percent number.

But in our discussion, the participants brought up all kinds of other base rates I hadn’t thought of. Sempere alone brought up three. One was the rate at which provinces claimed by China (like Hong Kong, Macau, and Tibet) have eventually been absorbed, peacefully or by force; another was how often control of Taiwan has changed over the last few hundred years (twice; once when Japan took over from the Qing Empire in 1895 and once when the Chinese Nationalists did in 1945); the third base rate used Laplace’s rule. Laplace’s rule states that the probability of something that hasn’t happened before happening is 1 divided by N+2, where N is the number of times it hasn’t happened in the past. So the odds of the People’s Republic of China invading Taiwan this year is 1 divided by 75 (the number of years since 1949 when this has not happened) plus 2, or 1/77, or 1.3 percent.

Sempere averaged his three base rates to get his initial prediction: 8 percent. Is that the best method? Should he have added even more? How should he have adjusted his guess after our discussion? (He nudged up to 12 percent.) There’s no firm rule about these questions. It’s ultimately something that can only be judged by your track record.

What if knowing the future is knowing the world?

Justice, the MBA student, tells me that quantitative skill is one reason why the Samotsvety crew is so good at prediction. Another reason is more abstract, maybe even grandiose: that as you forecast, you develop “a better model of the world … you start to see patterns in how the world works, and then that makes you better at forecasting.”

“It’s helpful to think of learning forecasting as having two steps,” he wrote in a follow-up email to me. “The first (and most important) step is the recognition that the future and past will look mostly the same. The second step is isolating that small bundle of cases where the two are different.” And it’s in that second step that developing a clear model of how the world works, and being willing to update that model frequently, is most helpful.

A lot of Justice’s “updates” to his world model have been toward assuming more continuity. In recent years, he says, he learned a lot from facts like, “Putin didn’t die of cancer, use nukes, or get removed from office; bird flu didn’t jump to and spread among humans (so far); Viktor Orban (very recently) dropped his objection to Ukraine aid.” What these have in common is “they’re predominantly about major events that didn’t happen, implying the future will look a lot like the past.”

The hardest part of the job is predicting those rare exceptions where everything changes. Samotsvety’s big coming-out party happened in early 2022 when they published an estimate of the odds that London would be hit by nuclear weapons as a result of the Ukraine conflict. Their estimated odds of a reasonably prepared Londoner dying from a nuclear warhead in the next month were 0.00241 percent: very, very low, all things considered. The prediction got some press attention and earned rejoinders from nuclear experts like Peter Scoblic, who argued it significantly understated the risk of a nuclear exchange. It was a big moment for the group — but also an example of a prediction that’s very, very difficult to get right. The further you’re straying from the ordinary course of history (and a nuclear bomb going off in London would be straying very far), the harder this is.

The tight connection between forecasting and building a model of the world helps explain why so much of the early interest in the idea came from the intelligence community. Matheny and colleagues wanted to develop a tool that could give policymakers real-time numerical probabilities, something that intelligence reports have historically not done, much to policymakers’ consternation. As early as 1973, Secretary of State Henry Kissinger was telling colleagues he wished “intelligence would supply him with estimates of the relevant betting odds.”

Matheny’s experiment ran through 2020. It included both the aggregative contingent estimation (ACE), which used members of the public and grew into the Good Judgment Project, and the IC Prediction Market (ICPM), which was available to intelligence analysts with access to classified information. The two sources of information were about equally accurate, despite the outsiders’ lack of classified access. The experiment was exciting enough to spawn a UK offshoot. But funding on the US side of the Atlantic ran out, and the culture of forecasting in intelligence died off.

To Matheny, it’s a crying shame, and he wishes that government institutions and think tanks like his would get back into the habit and act a bit more like Samotsvety. “People might assume that the methods that we use in most institutions that are responsible for analysis have been well-evaluated. And in fact, they haven’t. Even when there are organizations whose decisions cost billions of dollars or even trillions, billions of dollars in the case of key national security decisions,” he told me. Forecasting, by contrast, works. So what are we waiting for?

——————————————-
By: Dylan Matthews
Title: How a ragtag band of internet friends became the best at forecasting world events
Sourced From: www.vox.com/future-perfect/2024/2/13/24070864/samotsvety-forecasting-superforecasters-tetlock
Published Date: Tue, 13 Feb 2024 14:40:00 +0000

Rio Penalosa

+ posts

I'm a writer for lifestyle publications, and when I'm not crafting stories, you'll find me cherishing moments with my family, including my lovely daughter. My heart also belongs to my pets—Sushi, Snowy, Belle, and Pepper. Besides writing, I enjoy watching movies and exploring new places through travel.

Tags: forecasting, internet friends, ragtag

Politicsope

Rio Penalosa

Leave a Reply Cancel reply