エピソード

  • Latest Artificial Intelligence Latest R&D Session - With Digitalent & Mike Nedelko - Episode (009)
    2025/06/23

    In this conversation, Mike discusses the latest developments in AI and machine learning, focusing on recent research papers that explore the reasoning capabilities of large language models (LLMs) and the implications of self-improving AI systems.

    The discussion includes a critical analysis of Apple's paper on LLM reasoning, comparisons between human and AI conceptual strategies, and insights into the Darwin-Girdle machine, a self-referential AI system that can modify its own code. Mike emphasizes the importance of understanding the limitations and capabilities of AI in various domains, particularly in high-stakes environments.

    Highlights:

    - Apple's paper claims that large language models (LLMs) struggle with reasoning.

    - The importance of understanding LLMs' reasoning capabilities.

    - Understanding controlled puzzles to evaluate LLM reasoning in isolation.
    Findings suggest that LLMs face fundamental scaling limitations in reasoning tasks.

    - Comparing human and LLM conceptual strategies using information theory.
    LLMs are statistically efficient but may lack functional richness compared to human cognition.

    - Exploring the distinction between factual knowledge and logical reasoning in AI. Self-improving AI systems, like the Darwin-Girdle machine, represent a significant advancement in AI technology.

    続きを読む 一部表示
    1 時間 5 分
  • Latest Artificial Intelligence R&D Session - With Digitalent & Mike Nedelko - Episode 008
    2025/06/03

    Session Topics:

    The Llama 4 Controversy and Evaluation Mechanism Failure
    Llama 4’s initial high ELO score on LM Arena was driven by optimizations for human preferences—such as the use of emojis and overly positive tone. When these were removed, performance dropped significantly. This exposed weaknesses in existing evaluation mechanisms and raised concerns about benchmark reliability.

    Two Levels of AI Evaluation
    There are two main types of AI evaluation: model-level benchmarking for foundational models (e.g., Gemini, Claude), and use-case-specific evaluations for deployed AI systems—especially Retrieval Augmented Generation (RAG) systems.

    Benchmarking Foundational Models
    Benchmarks such as MMLU (world knowledge), MMU (multimodal understanding), GPQA (expert-level reasoning), ARC AGI (reasoning tasks), and newer ones like Code ELO and SWEBench (software engineering tasks) are commonly used to assess foundational model performance.

    Evaluating Conversational and Agentic LLMs
    The Multi-Challenge benchmark by Scale AI evaluates multi-turn conversational capabilities, while the Tow Benchmark assesses how well agentic LLMs perform tasks like interacting with and modifying databases.

    Use Case Specific Evaluation and RAG Systems
    Use-case-specific evaluation is critical for RAG systems that rely on organizational data to generate context. One example illustrated a car-booking agent returning a cheesecake recipe—underscoring the risks of unexpected model behaviour.

    Ragas Framework for Evaluating RAG Systems
    Ragas and DeepEval offer evaluation metrics such as context precision, response relevance, and faithfulness. These frameworks can compare model outputs against ground truth to assess both retrieval and generation components.

    The Leaderboard Illusion in Model Evaluation
    Leaderboards like LM Arena may present a distorted picture, as large organisations submit multiple hidden models to optimise final rankings—misleading users about true model performance.

    Using LLMs to Evaluate Other LLMs: Advantages and Risks
    LLMs can be used to evaluate other LLMs for scalability, but this introduces risks such as bias and false positives. Fourteen common design flaws have been identified in LLM-on-LLM evaluation systems.

    Circularity and LLM Narcissism in Evaluation
    Circularity arises when evaluator feedback influences the model being tested. LLM narcissism describes a model favouring outputs similar to its own, distorting evaluation outcomes.

    Label Correlation and Test Set Leaks
    Label correlation occurs when human and model evaluators agree on flawed outputs. Test set leaks happen when models have seen benchmark data during training, compromising result accuracy.

    The Need for Use Case Specific Model Evaluation
    General benchmarks alone are increasingly inadequate. Tailored, context-driven evaluations are essential to determine real-world suitability and performance of AI models.

    続きを読む 一部表示
    1 時間
  • Latest Artificial Intelligence R&D Session - With Digitalent & Mike Nedelko - Episode (007)
    2025/04/29

    Some of the main topics discussed.

    Google Gemini 2.5 Release
    Gemini 2.5 is now leading AI benchmarks with exceptional reasoning capabilities baked into its base training. Features include a 1M token context window, multimodality (handling text, images, video together), and independence from Nvidia chips, giving Google a strategic advantage.

    Alibaba’s Omnimodal Model ("Gwen")
    Alibaba released an open-source model that can hear, talk, and write simultaneously with low latency. It uses a "thinker and talker" architecture and blockwise encoding, making it promising for edge devices and real-time conversations.

    OpenAI’s 03 and 04 Mini Models
    OpenAI’s new models demonstrate strong tool usage (automatically using tools like Python or Web search during inference) and outperform previous models in multiple benchmarks. However, concerns were raised about differences between preview and production versions, including potential benchmark cheating.

    Model Context Protocol (MCP) and AI "App Store"
    MCP is becoming the dominant open standard to connect AI models to external applications and databases. It allows natural language-driven interactions between LLMs and business software. OpenAI and Google have endorsed MCP, making it a potential ecosystem-defining change.

    Security Concerns with MCP
    While MCP is powerful, early versions suffer from security vulnerabilities (e.g., privilege persistence, credential theft). New safety tools like MCP audits are being developed to address these concerns before it becomes enterprise-ready.

    Rise of Agentic AI and Industry 6.0
    The shift towards agentic AI (LLMs that chain tools and create novel ideas) could significantly reshape industries. A concept of "Industry 6.0" was discussed — fully autonomous manufacturing without human intervention, with early proof-of-concept already demonstrated.

    Impacts on Jobs and the Need for Upskilling
    With AI models becoming so capable, human roles will shift from doing the work to verifying and trusting AI outputs. Staying informed, experimenting with tools like MCP, and gaining AI literacy will be crucial for job security.

    Real-World AI Marketing and Legal Challenges
    Participants discussed real examples where AI (e.g., ChatGPT) generated inaccurate brand information. Legal implications around intellectual property and misinformation were also highlighted, including an anecdote about account banning due to copyright complaints.

    Vibe Coding and the Future of Development
    New AI-assisted coding platforms (like Google's Firebase Studio) allow "vibe coding," where developers can build applications with conversational prompts instead of traditional programming. This approach is making technical development much faster but still requires technical oversight.

    続きを読む 一部表示
    1 時間 4 分
  • Latest Artificial Intelligence R&D Session - with Digitalent & Mike Nedelko - Episode (006)
    2025/02/28

    The sessions topics include:

    Reasoning Models: Mike highlights the rise of reasoning models dominating leaderboards, enabled by "inference time compute scaling." This allows models to allocate more computational power dynamically, leading to better accuracy and efficiency. These models use "chain of thought prompting," enhancing reasoning by generating intermediate steps, inspired by Daniel Kahneman's "System 2 thinking." He also discussed "Humanity's Last Exam," a challenging new benchmark designed to test advanced reasoning models.

    DeepSeek R1: Mike explored DeepSeek R1's innovations, including stable 8-bit floating point operations and multi-hat latent attention, which reduced memory usage and improved efficiency. The real breakthrough was its use of reinforcement learning with self-verifiable tasks, allowing the model to learn without traditional supervised data. This approach improved reasoning and generalisation.

    Reinforcement Learning and Generalisation: Mike emphasised a shift from supervised fine-tuning to reinforcement learning, enabling models to generalise intelligence rather than just memorise. This approach lowers training costs while enhancing reasoning abilities. He also discussed the growing trend of using reinforcement learning and self-play to make AI training more efficient and affordable.

    続きを読む 一部表示
    1 時間 3 分