They used the “Torrance Tests of Creative Thinking”, a pseudo-scientific test that measures and evaluates absolutely nothing of any objective measure or value.
Hah, yeah, that was my kneejerk reaction too: I read that as “the metric we use to determine creativity was found to be wildly inaccurate, with ML regularly placing in the 99th percentile”.
Embarrassing, considering how un-creative and original GPT-4 is. It’s an actual struggle to get ChatGPT to think outside of the box. Claude 2 on the other hand is much better at it.
But this goes to show how unimaginative the general population is if this truly is the case.
I have been playing with chat gpt for tabletop character creation. It’s not bad at coming up with new ideas. It is terrible at sticking to the rules of the game.
The context window is still too short for any story. They just forget about old messages and only remember the newest context.
That makes sense. The further back information would go, the harder it was to recall it. The answer wasn’t to think harder, but to fill in the gaps.
evaluating LLM
ask the researcher if they are testing form or meaning
they don’t understand
pull out illustrated diagram explaining what is form and what is meaning
they laugh and say “the model is demonstrating creativity sir”
looks at the test
it’s form
deleted by creator