Nature: Large Language Models Are Proficient in Solving and Creating Emotional Intelligence Tests

カートのアイテムが多すぎます

ご購入は五十タイトルがカートに入っている場合のみです。

カートに追加できませんでした。

しばらく経ってから再度お試しください。

ウィッシュリストに追加できませんでした。

しばらく経ってから再度お試しください。

ほしい物リストの削除に失敗しました。

しばらく経ってから再度お試しください。

ポッドキャストのフォローに失敗しました

ポッドキャストのフォロー解除に失敗しました

Nature: Large Language Models Are Proficient in Solving and Creating Emotional Intelligence Tests

無料で聴く

ポッドキャストの詳細を見る

このコンテンツについて

Summary of https://www.nature.com/articles/s44271-025-00258-x

Explores the emotional intelligence capabilities of Large Language Models (LLMs), specifically their ability to solve and create emotional intelligence tests. It highlights that several LLMs, including ChatGPT-4, consistently outperformed human averages on various established emotional intelligence assessments.

The research also investigated LLMs' capacity to generate new, psychometrically sound test items, finding that these AI-created questions demonstrated comparable difficulty and a strong correlation with original human-designed tests. While some minor differences were observed in clarity, realism, and content diversity, the study ultimately suggests that LLMs can reason accurately about human emotions and their regulation, indicating their potential for use in socio-emotional applications and psychometric development.

LLMs demonstrate superior performance in solving emotional intelligence tests compared to humans. Six widely used Large Language Models (LLMs), including ChatGPT-4, ChatGPT-o1, Gemini 1.5 flash, Copilot 365, Claude 3.5 Haiku, and DeepSeek V3, collectively achieved an average accuracy of 81% on five standard emotional intelligence (EI) tests, significantly outperforming the human average of 56% reported in original validation studies. All tested LLMs scored more than one standard deviation above the human mean, with ChatGPT-o1 and DeepSeek V3 exceeding two standard deviations above it.
LLMs are proficient at generating new, high-quality emotional intelligence test items. ChatGPT-4 successfully generated new test items (scenarios and response options) for five different ability EI tests, and these new versions demonstrated statistically equivalent test difficulty compared to the original tests when administered to human participants. Importantly, ChatGPT-4 did not simply paraphrase existing items; participants perceived a low level of similarity to any original test scenario in 88% of the newly created scenarios.
LLM-generated tests exhibit psychometric properties largely comparable to original human-designed tests, though with some minor differences. While not all psychometric properties (such as perceived item clarity, realism, item content diversity, internal consistency, and correlations with vocabulary or other EI tests) were statistically equivalent between original and ChatGPT-generated versions, any differences observed were small (Cohen’s d less than ±0.25) and none of the 95% confidence interval boundaries exceeded a medium effect size (d ± 0.50). Furthermore, original and ChatGPT-generated tests were strongly correlated (r=0.46), suggesting they measure similar constructs.
LLMs show potential for "cognitive empathy" and consistent application of emotional knowledge. The findings support the idea that LLMs can generate responses consistent with accurate knowledge of emotional concepts, emotional situations, and their implications, indicating they fulfill the aspect of cognitive empathy. LLMs offer advantages such as processing emotional scenarios based on extensive datasets, which may lead to fewer errors, and providing consistent emotional knowledge unaffected by human variability like mood, fatigue, or personal preferences.
LLMs can significantly aid psychometric test development but cannot fully replace human validation processes. The research highlights that LLMs like ChatGPT can be powerful tools for assisting in the psychometric development of standardized assessments, particularly in the domain of emotion, by generating complete tests with generally acceptable psychometric properties using few prompts. However, the study also notes that while valuable for creating an initial item pool, LLMs cannot replace the necessary pilot and validation studies to refine or eliminate poorly performing items.