A criticism of the recent chatGPT meta-study in Nature currently making the rounds
On “The effect of ChatGPT on students’ learning performance, learning perception, and higher-order thinking: insights from a meta-analysis” and its discontents
The study “The effect of ChatGPT on students’ learning performance, learning perception, and higher-order thinking: insights from a meta-analysis” was published in a Nature-affiliated journal about a month ago, Humanities and Social Sciences Communications, and is currently receiving quite a lot of attention.
It has been viewed hundreds of thousands of times, which is remarkable as far as journal articles go, especially when it’s not among the very top names, and it has already ratcheted up a large number of published citations in just a couple of weeks. This is also noteworthy considering how long it takes, on average, from submitting an article to actually getting it published.
And to put it mildly, this study is really a hot pile of smoking garbage, and should not only be retracted, it shouldn’t even have been attempted in the first place.
To start with a few remarks at the conceptual level, the meta-study as such is a very difficult tool to use. It can be powerful and it can bring out uniquely valuable data, but it’s also particularly sensitive to methodological issues because it will magnify them on the aggregate level. Minor errors in the basic method can create significant distortions since you can easily make it seem like hundreds or even thousands of specific studies reinforce variables which they have almost no connection to whatsoever.
This is why meta-studies are most reasonably performed in relation to very clear and unambiguous variables, such as insulin resistance, cancer incidence, levels of air pollution, the number of species of moss on a type of rock, or other outcomes that can be reliably measured with a high degree of accuracy.
So let’s say I’ve done a Skittles study. I’ve had a hundred volunteers eat nothing but Skittles for half of their daily calories over a year, and I want to find out how this affects their metabolic health, and I’m specifically looking at insulin resistance before and after the intervention. All right.
And then you have also performed a similar experiment. You’ve had the 65 inmates of a retirement home replace all of their drinking water with Coca-Cola. They make their oatmeal with Coca-Cola. They brush their teeth in it. They have to brew their coffee with Coca-Cola. You get the picture. And you’re also looking at the effects on insulin resistance at the end of the experiment.
So then you and I can compare these outcomes and actually pull some nice, weighted averages from our mutual results without too much distortion — and this is because we’re looking at outcomes that can be reliably measured with a high degree of accuracy. Of course, there are ambiguities and sources of error here. Perhaps your study wasn’t controlled in the same way as mine, and the average ages of the study groups may have been different. Skittles and Coke have different “nutritional profiles”, and there might also have been variations regarding the diets of our control groups and so forth.
But the signal we get from the experimental interventions should be strong and clear enough so that we can reasonably well control for these distorting variables. There’s both a strong and dominant presence of the factors we’re studying, and we also have a pretty good idea of what kind of causal mechanisms we’re looking for in terms of the dietary intake of fructose and the effects on insulin resistance. We’re also measuring exactly the same thing at the end, and not different variables whose connection is more or less unclear.
This, however, is exactly what this garbage chatGPT study does.
It’s the equivalent of studying five guys who always start their morning with half a bottle of whisky, pooling them with five others whose breakfast consists of a couple of cigarettes and three cups of coffee (and maybe with five of my Skittles guys on top), and then concluding that there are really strong indications that people who “skip breakfast” tend to write bad poetry later in the day. Because science.
In detail, the meta-study in Humanities and Social Sciences Communiations is pooling around 54 or so studies (it’s not really 54, their reported number doesn’t add up with their actual data) on the effects of chatGPT in an educational setting, and the authors purport to draw out from these studies data on three distinct overarching variables relevant to educational outcomes.
These variables of theirs regard “higher-order thinking”, “learning performance” and “learning perception” (the last one basically regards how fun and engaging you think the educational activities were).
And as you see already at this stage, these are really fuzzy variables quite difficult to reliably measure even in a limited and controlled setting. “Learning performance” is painfully vague, and “learning perception” is barely an established theoretical concept — and at that, what’s really the average of me reporting enjoying fiddling with chatGPT with a 6 on the Rikert scale and of you spending a few extra minutes reading the assigned literature for the course? What’s even the meaning of an average of those two variables, and how would it validly compute into a pooled average of “learning perception” when you put together a large number of divergent studies?
Hint: it won’t, of course.
So the authors use a measure for calculating effect size when there are two different groups involved. The measure is known as Hedges’ g, where the g is the pooled average from say, six different groups subjected to an experimental intervention, as compared to the pooled average of the six different control groups in all of those experiments. It also adds weighting that depends on such factors as the relative sizes of the study groups, or on the relative accuracy of the studies’ estimates in relation to the specific variables that you’re measuring. Needless to say, this weighting is pretty iffy stuff, since it’s applied equally throughout the entire pool of studies, which can introduce significant distortions. This means that such weighting measures must be very carefully tailored to the unique character of the specific pool of studies, so you basically need to do a deep qualitative analysis of the entire set and figure out exactly what you mean by accuracy in relation to the overarching variables of the meta-study, precisely because the potential distortions can be so significant.
So if you’re measuring “metabolic health outcomes”, say, you’re going to have to figure out the specific relevance of each variable of each and every single one of the underlying studies in relation to this meta-variable.
These jokers did no such thing, of course. That would have taken actual work. Instead, they simply used “comprehensive meta-analysis software (version 3.3.0)” [sic!], plugged in the numbers automatically extracted from the sub-studies, and ran with the result.
Yeah, I’m not kidding. This is what passes for “science” nowadays.
So what we have here is a totally irresponsible misuse of statistics that consistently mixes apples and oranges in relation to totally ambiguous meta-variables that have a tenuous and inconsistent connection to the underlying studies, using a methodological approach for weighting averages that LITERALLY GUARANTEES that the outcome will have no validity whatsoever.
It’s a piece of shoddily designed marketing masked as a scientific study that should never have been published — let alone in one of the sub-journals of the prestigious Nature.
But that’s just the beginning.
So apart from this overarching methodological nightmare, I actually sat down and went through the whole set of 50-something studies that this meta-analysis is (allegedly) based on, and the situation is much worse than one would imagine.
Again, to begin with, totally incommensurable variables are clearly mashed together towards producing weighted averages. This gives us interesting examples such as test results, average time spent reading course books, and whether or not patients perceived doctors in training as friendly, getting pooled together as a measure for “learning performance”. How is this weighted, and what are they actually measuring? Nobody knows.
On top of that, it’s not only that they include retracted studies (on three separate occasions), or that they include a study (again, on three separate occasions) that they themselves invalidate in their own method section and cite as a specific example of a paper that could not be used in their analysis due to issues of validity. But then they go ahead and use it. Three times.
They also on many occasions directly misrepresent the data from the underlying studies in a way that obviously contradicts the findings of the authors of the latter. One good example is the Bašić et al (2023) study.
Here, the outcome is evidently negative on part of the GPT-assisted experiment group, and the study design is also concerned with actual student output during a specific course assignment (rather than testable learning outcomes), which renders its validity null. Still, the study is cited with a weighted Hedges’ g of 0.993 in favor of the experimental group. This would imply a greater outcome for the GPT group by almost one standard deviation, which would be quite large — and again, the outcome was actually negative for this group in the cited study. What’s going on here?
So when you read the actual studies themselves, there’s actually just a small number of papers that provide a clear and unambigious positive result for the GPT group over and above the controls, which means that the data is massively falsified, whether intentionally or not — perhaps as an artifact of their automated scraping or their “meta study software”.
Another significant problem is that almost no underlying study specifies what happens to the control group. We learn basically nothing about whatever educational interventions they’re being subjected to, other than something like “oh, they were given traditional instruction”, so you have no idea what the interventions in the experimental group are being compared to. This, all on its own, renders the meta-study almost useless.
And of course, none of the studies are blind. None are sufficiently controlled (so that, for instance, we can be sure that the controls are not also using chatGPT, or that the controls are unaware of being possibly disadvantaged) — and the majority of the few studies with actual significant positive results for the GPT-groups clearly invite other explanations for the outcome than the AI-aspect as such (such that chatGPT being advantageous for second-language English students in a context lacking native English-speaking teachers and a rich language exposure).
So, anyway, this garbage should naturally be retracted as fast as possible. But since it has a massive amount of views and a growing number of published citations already, since it’s being used in policy development for the use of AI tools at universities and research institutions, and since its “findings” clearly align with the preferences of massively influential economic and political interest groups, I guess there’s a snowball’s chance in hell of that happening.
I’m currently having an article under review in this very journal, so I guess it’s a bad idea to post this piece. But this cannot stand. This kind of rot has to be exposed, and it’s increasingly clear to me that we have to literally reinvent and rebuild science and academia from the ground up in the years ahead.
This needs to be retracted, so please share this around, and if anyone’s interested in joining up and collaborating on a formal response, please let me know.
Shifting Baseline Disorders. Unsafe at any speed, ChatGPT.
The meta studies on introducing Chromebooks in K6, yep. Everything has gone to shit in schools using that dirty trickster, but the studies contradict what I see with my eyes substituting in K12.
Chronicle of higher education adores AI and the like.
The fabricated "studies" will continue to rationalize complete deployment of ChatGPT in all sectors of society.
You are holding the weight of western civilization practically on your own.
Rot, yes.
Foundations, vaporized.
Auto-undermining of everything everywhere, establishment policy.
And there you are, standing.
After many years in Sweden myself I can’t help but wonder, are you the only Swede standing? I know just a few others, one YouTube guy, and a handful of others forced into silence