AI Latest Research & Developments - With Digitalent & Mike Nedelko

カートのアイテムが多すぎます

ご購入は五十タイトルがカートに入っている場合のみです。

カートに追加できませんでした。

しばらく経ってから再度お試しください。

ウィッシュリストに追加できませんでした。

しばらく経ってから再度お試しください。

ほしい物リストの削除に失敗しました。

しばらく経ってから再度お試しください。

ポッドキャストのフォローに失敗しました

ポッドキャストのフォロー解除に失敗しました

AI Latest Research & Developments - With Digitalent & Mike Nedelko

著者： Dillan Leslie-Rowe

無料で聴く

このコンテンツについて

Join us monthly as we explore the cutting-edge world of artificial intelligence. Mike distills the most significant trends, groundbreaking research, and pivotal developments in AI, offering you a concise yet comprehensive update on this rapidly evolving field.

Whether you're an industry professional or simply AI-curious, this series is designed to be your essential guide. If you could only choose one source to stay informed about AI, make it Mike Nedelko's monthly briefing. Stay ahead of the curve and gain insights that matter in just one session per month.

エピソードもっと見る

Latest Artificial Intelligence Latest R&D Session - With Digitalent & Mike Nedelko - Episode (009)

2025/06/23

In this conversation, Mike discusses the latest developments in AI and machine learning, focusing on recent research papers that explore the reasoning capabilities of large language models (LLMs) and the implications of self-improving AI systems.

The discussion includes a critical analysis of Apple's paper on LLM reasoning, comparisons between human and AI conceptual strategies, and insights into the Darwin-Girdle machine, a self-referential AI system that can modify its own code. Mike emphasizes the importance of understanding the limitations and capabilities of AI in various domains, particularly in high-stakes environments.

Highlights:

- Apple's paper claims that large language models (LLMs) struggle with reasoning.

- The importance of understanding LLMs' reasoning capabilities.

- Understanding controlled puzzles to evaluate LLM reasoning in isolation.
Findings suggest that LLMs face fundamental scaling limitations in reasoning tasks.

- Comparing human and LLM conceptual strategies using information theory.
LLMs are statistically efficient but may lack functional richness compared to human cognition.

- Exploring the distinction between factual knowledge and logical reasoning in AI. Self-improving AI systems, like the Darwin-Girdle machine, represent a significant advancement in AI technology.

続きを読む一部表示

1 時間 5 分

カートのアイテムが多すぎます

ご購入は五十タイトルがカートに入っている場合のみです。

カートに追加できませんでした。

しばらく経ってから再度お試しください。

ウィッシュリストに追加できませんでした。

しばらく経ってから再度お試しください。

ほしい物リストの削除に失敗しました。

しばらく経ってから再度お試しください。

ポッドキャストのフォローに失敗しました

ポッドキャストのフォロー解除に失敗しました

無料で聴く
Latest Artificial Intelligence R&D Session - With Digitalent & Mike Nedelko - Episode 008

2025/06/03

Session Topics:
The Llama 4 Controversy and Evaluation Mechanism Failure
Llama 4’s initial high ELO score on LM Arena was driven by optimizations for human preferences—such as the use of emojis and overly positive tone. When these were removed, performance dropped significantly. This exposed weaknesses in existing evaluation mechanisms and raised concerns about benchmark reliability.
Two Levels of AI Evaluation
There are two main types of AI evaluation: model-level benchmarking for foundational models (e.g., Gemini, Claude), and use-case-specific evaluations for deployed AI systems—especially Retrieval Augmented Generation (RAG) systems.
Benchmarking Foundational Models
Benchmarks such as MMLU (world knowledge), MMU (multimodal understanding), GPQA (expert-level reasoning), ARC AGI (reasoning tasks), and newer ones like Code ELO and SWEBench (software engineering tasks) are commonly used to assess foundational model performance.
Evaluating Conversational and Agentic LLMs
The Multi-Challenge benchmark by Scale AI evaluates multi-turn conversational capabilities, while the Tow Benchmark assesses how well agentic LLMs perform tasks like interacting with and modifying databases.
Use Case Specific Evaluation and RAG Systems
Use-case-specific evaluation is critical for RAG systems that rely on organizational data to generate context. One example illustrated a car-booking agent returning a cheesecake recipe—underscoring the risks of unexpected model behaviour.
Ragas Framework for Evaluating RAG Systems
Ragas and DeepEval offer evaluation metrics such as context precision, response relevance, and faithfulness. These frameworks can compare model outputs against ground truth to assess both retrieval and generation components.
The Leaderboard Illusion in Model Evaluation
Leaderboards like LM Arena may present a distorted picture, as large organisations submit multiple hidden models to optimise final rankings—misleading users about true model performance.
Using LLMs to Evaluate Other LLMs: Advantages and Risks
LLMs can be used to evaluate other LLMs for scalability, but this introduces risks such as bias and false positives. Fourteen common design flaws have been identified in LLM-on-LLM evaluation systems.
Circularity and LLM Narcissism in Evaluation
Circularity arises when evaluator feedback influences the model being tested. LLM narcissism describes a model favouring outputs similar to its own, distorting evaluation outcomes.
Label Correlation and Test Set Leaks
Label correlation occurs when human and model evaluators agree on flawed outputs. Test set leaks happen when models have seen benchmark data during training, compromising result accuracy.
The Need for Use Case Specific Model Evaluation
General benchmarks alone are increasingly inadequate. Tailored, context-driven evaluations are essential to determine real-world suitability and performance of AI models.

続きを読む一部表示

1 時間

カートのアイテムが多すぎます

ご購入は五十タイトルがカートに入っている場合のみです。

カートに追加できませんでした。

しばらく経ってから再度お試しください。

ウィッシュリストに追加できませんでした。

しばらく経ってから再度お試しください。

ほしい物リストの削除に失敗しました。

しばらく経ってから再度お試しください。

ポッドキャストのフォローに失敗しました

ポッドキャストのフォロー解除に失敗しました

無料で聴く
Latest Artificial Intelligence R&D Session - With Digitalent & Mike Nedelko - Episode (007)

2025/04/29

Some of the main topics discussed.

Google Gemini 2.5 Release
Gemini 2.5 is now leading AI benchmarks with exceptional reasoning capabilities baked into its base training. Features include a 1M token context window, multimodality (handling text, images, video together), and independence from Nvidia chips, giving Google a strategic advantage.

Alibaba’s Omnimodal Model ("Gwen")
Alibaba released an open-source model that can hear, talk, and write simultaneously with low latency. It uses a "thinker and talker" architecture and blockwise encoding, making it promising for edge devices and real-time conversations.

OpenAI’s 03 and 04 Mini Models
OpenAI’s new models demonstrate strong tool usage (automatically using tools like Python or Web search during inference) and outperform previous models in multiple benchmarks. However, concerns were raised about differences between preview and production versions, including potential benchmark cheating.

Model Context Protocol (MCP) and AI "App Store"
MCP is becoming the dominant open standard to connect AI models to external applications and databases. It allows natural language-driven interactions between LLMs and business software. OpenAI and Google have endorsed MCP, making it a potential ecosystem-defining change.

Security Concerns with MCP
While MCP is powerful, early versions suffer from security vulnerabilities (e.g., privilege persistence, credential theft). New safety tools like MCP audits are being developed to address these concerns before it becomes enterprise-ready.

Rise of Agentic AI and Industry 6.0
The shift towards agentic AI (LLMs that chain tools and create novel ideas) could significantly reshape industries. A concept of "Industry 6.0" was discussed — fully autonomous manufacturing without human intervention, with early proof-of-concept already demonstrated.

Impacts on Jobs and the Need for Upskilling
With AI models becoming so capable, human roles will shift from doing the work to verifying and trusting AI outputs. Staying informed, experimenting with tools like MCP, and gaining AI literacy will be crucial for job security.

Real-World AI Marketing and Legal Challenges
Participants discussed real examples where AI (e.g., ChatGPT) generated inaccurate brand information. Legal implications around intellectual property and misinformation were also highlighted, including an anecdote about account banning due to copyright complaints.

Vibe Coding and the Future of Development
New AI-assisted coding platforms (like Google's Firebase Studio) allow "vibe coding," where developers can build applications with conversational prompts instead of traditional programming. This approach is making technical development much faster but still requires technical oversight.

続きを読む一部表示

1 時間 4 分

カートのアイテムが多すぎます

ご購入は五十タイトルがカートに入っている場合のみです。

カートに追加できませんでした。

しばらく経ってから再度お試しください。

ウィッシュリストに追加できませんでした。

しばらく経ってから再度お試しください。

ほしい物リストの削除に失敗しました。

しばらく経ってから再度お試しください。

ポッドキャストのフォローに失敗しました

ポッドキャストのフォロー解除に失敗しました

無料で聴く