『AI Latest Research & Developments - With Digitalent & Mike Nedelko』のカバーアート

AI Latest Research & Developments - With Digitalent & Mike Nedelko

AI Latest Research & Developments - With Digitalent & Mike Nedelko

著者: Dillan Leslie-Rowe
無料で聴く

このコンテンツについて

Join us monthly as we explore the cutting-edge world of artificial intelligence. Mike distills the most significant trends, groundbreaking research, and pivotal developments in AI, offering you a concise yet comprehensive update on this rapidly evolving field.

Whether you're an industry professional or simply AI-curious, this series is designed to be your essential guide. If you could only choose one source to stay informed about AI, make it Mike Nedelko's monthly briefing. Stay ahead of the curve and gain insights that matter in just one session per month.

© 2025 AI Latest Research & Developments - With Digitalent & Mike Nedelko
エピソード
  • Latest Artificial Intelligence Latest R&D Session - With Digitalent & Mike Nedelko - Episode (009)
    2025/06/23

    In this conversation, Mike discusses the latest developments in AI and machine learning, focusing on recent research papers that explore the reasoning capabilities of large language models (LLMs) and the implications of self-improving AI systems.

    The discussion includes a critical analysis of Apple's paper on LLM reasoning, comparisons between human and AI conceptual strategies, and insights into the Darwin-Girdle machine, a self-referential AI system that can modify its own code. Mike emphasizes the importance of understanding the limitations and capabilities of AI in various domains, particularly in high-stakes environments.

    Highlights:

    - Apple's paper claims that large language models (LLMs) struggle with reasoning.

    - The importance of understanding LLMs' reasoning capabilities.

    - Understanding controlled puzzles to evaluate LLM reasoning in isolation.
    Findings suggest that LLMs face fundamental scaling limitations in reasoning tasks.

    - Comparing human and LLM conceptual strategies using information theory.
    LLMs are statistically efficient but may lack functional richness compared to human cognition.

    - Exploring the distinction between factual knowledge and logical reasoning in AI. Self-improving AI systems, like the Darwin-Girdle machine, represent a significant advancement in AI technology.

    続きを読む 一部表示
    1 時間 5 分
  • Latest Artificial Intelligence R&D Session - With Digitalent & Mike Nedelko - Episode 008
    2025/06/03

    Session Topics:

    The Llama 4 Controversy and Evaluation Mechanism Failure
    Llama 4’s initial high ELO score on LM Arena was driven by optimizations for human preferences—such as the use of emojis and overly positive tone. When these were removed, performance dropped significantly. This exposed weaknesses in existing evaluation mechanisms and raised concerns about benchmark reliability.

    Two Levels of AI Evaluation
    There are two main types of AI evaluation: model-level benchmarking for foundational models (e.g., Gemini, Claude), and use-case-specific evaluations for deployed AI systems—especially Retrieval Augmented Generation (RAG) systems.

    Benchmarking Foundational Models
    Benchmarks such as MMLU (world knowledge), MMU (multimodal understanding), GPQA (expert-level reasoning), ARC AGI (reasoning tasks), and newer ones like Code ELO and SWEBench (software engineering tasks) are commonly used to assess foundational model performance.

    Evaluating Conversational and Agentic LLMs
    The Multi-Challenge benchmark by Scale AI evaluates multi-turn conversational capabilities, while the Tow Benchmark assesses how well agentic LLMs perform tasks like interacting with and modifying databases.

    Use Case Specific Evaluation and RAG Systems
    Use-case-specific evaluation is critical for RAG systems that rely on organizational data to generate context. One example illustrated a car-booking agent returning a cheesecake recipe—underscoring the risks of unexpected model behaviour.

    Ragas Framework for Evaluating RAG Systems
    Ragas and DeepEval offer evaluation metrics such as context precision, response relevance, and faithfulness. These frameworks can compare model outputs against ground truth to assess both retrieval and generation components.

    The Leaderboard Illusion in Model Evaluation
    Leaderboards like LM Arena may present a distorted picture, as large organisations submit multiple hidden models to optimise final rankings—misleading users about true model performance.

    Using LLMs to Evaluate Other LLMs: Advantages and Risks
    LLMs can be used to evaluate other LLMs for scalability, but this introduces risks such as bias and false positives. Fourteen common design flaws have been identified in LLM-on-LLM evaluation systems.

    Circularity and LLM Narcissism in Evaluation
    Circularity arises when evaluator feedback influences the model being tested. LLM narcissism describes a model favouring outputs similar to its own, distorting evaluation outcomes.

    Label Correlation and Test Set Leaks
    Label correlation occurs when human and model evaluators agree on flawed outputs. Test set leaks happen when models have seen benchmark data during training, compromising result accuracy.

    The Need for Use Case Specific Model Evaluation
    General benchmarks alone are increasingly inadequate. Tailored, context-driven evaluations are essential to determine real-world suitability and performance of AI models.

    続きを読む 一部表示
    1 時間
  • Latest Artificial Intelligence R&D Session - With Digitalent & Mike Nedelko - Episode (007)
    2025/04/29

    Some of the main topics discussed.

    Google Gemini 2.5 Release
    Gemini 2.5 is now leading AI benchmarks with exceptional reasoning capabilities baked into its base training. Features include a 1M token context window, multimodality (handling text, images, video together), and independence from Nvidia chips, giving Google a strategic advantage.

    Alibaba’s Omnimodal Model ("Gwen")
    Alibaba released an open-source model that can hear, talk, and write simultaneously with low latency. It uses a "thinker and talker" architecture and blockwise encoding, making it promising for edge devices and real-time conversations.

    OpenAI’s 03 and 04 Mini Models
    OpenAI’s new models demonstrate strong tool usage (automatically using tools like Python or Web search during inference) and outperform previous models in multiple benchmarks. However, concerns were raised about differences between preview and production versions, including potential benchmark cheating.

    Model Context Protocol (MCP) and AI "App Store"
    MCP is becoming the dominant open standard to connect AI models to external applications and databases. It allows natural language-driven interactions between LLMs and business software. OpenAI and Google have endorsed MCP, making it a potential ecosystem-defining change.

    Security Concerns with MCP
    While MCP is powerful, early versions suffer from security vulnerabilities (e.g., privilege persistence, credential theft). New safety tools like MCP audits are being developed to address these concerns before it becomes enterprise-ready.

    Rise of Agentic AI and Industry 6.0
    The shift towards agentic AI (LLMs that chain tools and create novel ideas) could significantly reshape industries. A concept of "Industry 6.0" was discussed — fully autonomous manufacturing without human intervention, with early proof-of-concept already demonstrated.

    Impacts on Jobs and the Need for Upskilling
    With AI models becoming so capable, human roles will shift from doing the work to verifying and trusting AI outputs. Staying informed, experimenting with tools like MCP, and gaining AI literacy will be crucial for job security.

    Real-World AI Marketing and Legal Challenges
    Participants discussed real examples where AI (e.g., ChatGPT) generated inaccurate brand information. Legal implications around intellectual property and misinformation were also highlighted, including an anecdote about account banning due to copyright complaints.

    Vibe Coding and the Future of Development
    New AI-assisted coding platforms (like Google's Firebase Studio) allow "vibe coding," where developers can build applications with conversational prompts instead of traditional programming. This approach is making technical development much faster but still requires technical oversight.

    続きを読む 一部表示
    1 時間 4 分

AI Latest Research & Developments - With Digitalent & Mike Nedelkoに寄せられたリスナーの声

カスタマーレビュー:以下のタブを選択することで、他のサイトのレビューをご覧になれます。