エピソード

  • (FM Personalize-AMZN) MCM: A multi-task pre-trained customer model for personalization
    2025/09/05

    Welcome to our podcast, where we delve into cutting-edge advancements in personalization! Today, we're highlighting MCM: A Multi-task Pre-trained Customer Model for Personalization, developed by Amazon LLC.

    This innovative BERT-based model, with 10 million parameters, revolutionises how e-commerce platforms deeply understand customer preferences and shopping intents. Its novelty stems from significantly improving the state-of-the-art BERT4Rec framework by handling heterogeneous customer signals and implementing multi-task training. Key innovations include a random prefix augmentation method that avoids leaking future information and a task-aware attentional readout module that generates highly specific representations for different items and tasks.

    MCM’s applications are extensive, empowering diverse personalization projects by providing accurate preference scores for recommendations, customer embeddings for transfer learning, and a pre-trained model for fine-tuning. It excels in next action prediction tasks, outperforming original BERT4Rec by 17%. While generally powerful, for highly specific behaviours like those driven by incentives, fine-tuning MCM with task-specific data can yield even greater improvements, driving over 60% uplift in conversion rates for incentive-based recommendations compared to baselines.

    Discover how MCM is shaping the future of personalised e-commerce experiences!

    Find the full paper here: https://assets.amazon.science/d7/a5/d17698634b70925612c07f07a0fa/mcm-a-multi-task-pre-trained-customer-model-for-personalization.pdf

    続きを読む 一部表示
    12 分
  • (LLM RAG-Google) On the Theoretical Limitations of Embedding-Based Retrieval
    2025/09/02

    Welcome to our podcast! Today, we delve into groundbreaking research from Google DeepMind and Johns Hopkins University titled "On the Theoretical Limitations of Embedding-Based Retrieval". This paper uncovers a fundamental flaw in the widely used single-vector embedding paradigm: the number of unique top-k document combinations an embedding model can represent is inherently limited by its dimension.

    Despite the common belief that better training or larger models can overcome these issues, the researchers demonstrate these theoretical limits in surprisingly simple, realistic settings. They introduce LIMIT, a novel dataset that exposes how even state-of-the-art embedding models severely struggle with straightforward tasks, scoring less than 20 recall@100 in some cases, due to these theoretical underpinnings. This suggests that existing academic benchmarks might be inadvertently hiding these limitations by testing only a minute fraction of possible query-relevance combinations.

    This work calls for a re-evaluation of how we approach information retrieval. While single-vector embeddings are powerful, their capacity for handling diverse, instruction-following queries with complex relevance definitions is fundamentally capped. The paper suggests exploring alternative architectures like cross-encoders, multi-vector models, or sparse models to address these limitations. Tune in to understand why pushing the boundaries of current embedding models requires a shift beyond the single-vector paradigm.

    Find the full paper at: https://arxiv.org/pdf/2508.21038

    続きを読む 一部表示
    13 分
  • (FM-Pinterest) ItemSage: Learning Product Embeddings for Shopping Recommendations at Pinterest
    2025/09/02

    Welcome to our podcast, where we delve into cutting-edge AI in e-commerce! Today, we're exploring ItemSage, Pinterest's innovative product embedding system for shopping recommendations. Developed by engineers at Pinterest, ItemSage revolutionises how users discover products across Home, Closeup, and Search surfaces.

    A key novelty is its transformer-based architecture, combining both text and image modalities to create rich product representations, significantly outperforming single-modality approaches. ItemSage also leverages multi-task learning to optimise for diverse engagement objectives, including purchases and add-to-cart actions, making the recommendation funnel more efficient, particularly for sparse labels. This unified embedding system, compatible with existing PinSage and SearchSage embeddings, drastically reduces infrastructure and maintenance costs by three times across different recommendation verticals.

    While ItemSage has delivered substantial gains—up to +7% Gross Merchandise Value per user and +11% click volume in online A/B experiments—future work aims to enhance text feature modeling with pre-trained Transformers. Join us to understand this powerful system transforming shopping at Pinterest!

    Paper link: https://arxiv.org/pdf/2205.11728

    続きを読む 一部表示
    17 分
  • (LLM Multiagent UCB) Why Multi-Agent LLM Systems Fail: A Taxonomy
    2025/08/18

    Here is a 200-word description for your podcast:

    Ever wondered why Multi-Agent LLM Systems (MAS) often fall short despite their promise? Researchers at UC Berkeley introduce MAST (Multi-Agent System Failure Taxonomy), the first empirically grounded taxonomy to systematically analyse MAS failures.

    Uncover 14 unique failure modes, organised into three crucial categories: specification issues (system design), inter-agent misalignment (agent coordination), and task verification (quality control). Developed through rigorous human annotation and validated with a scalable LLM-as-a-Judge pipeline, MAST offers a structured framework for diagnosing and understanding these challenges.

    Our findings reveal that most failures stem from fundamental system design challenges and agent coordination issues, rather than just individual LLM limitations, requiring more complex solutions than superficial fixes. MAST provides actionable insights for debugging and development, enabling systematic diagnosis and guiding interventions towards building more robust systems. While currently focused on task correctness, future work will explore critical aspects like efficiency, cost, and security.

    Learn how MAST can help build more reliable and effective multi-agent systems.

    Find the paper here: https://arxiv.org/pdf/2503.13657

    続きを読む 一部表示
    12 分
  • (LLM Application-GOOGLE) Toward Sensor-In-the-Loop LLM Agent: Benchmarks and Implications
    2025/08/05

    Tune into our podcast to explore groundbreaking advancements in AI personal agents! In this episode, we delve into WellMax, a novel sensor-in-the-loop Large Language Model (LLM) agent developed by researchers from the University of Pittsburgh, University of Illinois Urbana-Champaign, and Google.

    WellMax uniquely enhances AI responses by integrating real-time physiological and physical data from wearables, allowing personal agents to understand your context implicitly and automatically. This results in more empathetic and contextually relevant advice compared to non-sensor-informed agents. Imagine an AI tailoring your exercise routine based on your actual activity levels or suggesting stress-reducing activities after a demanding day.

    However, the journey isn't without its challenges. We discuss the difficulties LLMs face in interpreting raw sensor data, the balance between detailed advice and user choice, and the privacy implications of cloud-based LLMs versus the performance trade-offs with smaller, on-device models like Gemma-2. WellMax paves the way for future AI agents that adapt dynamically to your shifting needs, offering holistic support beyond mere question-answering.

    Learn more about this research in "Toward Sensor-In-the-Loop LLM Agent: Benchmarks and Implications": https://doi.org/10.1145/3715014.3722082

    続きを読む 一部表示
    15 分
  • (Counterfactual-AirBnB) Harnessing the Power of Interleaving and Counterfactual Evaluation for Airbnb Search Ranking
    2025/08/05

    Tune into our podcast as we explore Airbnb's groundbreaking advancements in search ranking evaluation. Traditional A/B testing for significant purchases like accommodation bookings faces challenges: it's time-consuming, with low traffic and delayed feedback. Offline evaluations, while quick, often lack accuracy due to issues like selection bias and disconnect from online metrics.

    To overcome this, Airbnb developed and implemented two novel online evaluation methods: interleaving and counterfactual evaluation. Our competitive pair-based interleaving method offers an impressive 50X speedup in experimentation velocity compared to traditional A/B tests. For even greater generalizability and sensitivity, our online counterfactual evaluation achieves an astonishing 100X speedup. These methods allow for rapid identification of promising candidates for full A/B tests, significantly streamlining the experimental process.

    While interleaving may face limitations with rankers using set-level optimization that can disrupt user experience, counterfactual evaluation provides greater robustness in such scenarios. These innovative techniques are not only proven effective at Airbnb, leading to increased capacity to test new ideas and higher success rates in A/B testing, but are also easily generalizable to other online platforms, especially those with sparse conversion events.

    Paper Link: https://doi.org/10.1145/3711896.3737232

    続きを読む 一部表示
    21 分
  • (LLM Optimization-MSFT) COLLABLLM: From Passive Responders to Active Collaborators
    2025/07/23

    Tune into our podcast to explore COLLABLLM, a groundbreaking framework redefining human-LLM interactions! Traditional Large Language Models often fall short in complex, open-ended tasks by passively responding and failing to grasp long-term user intent.

    Developed by researchers from Stanford University, Microsoft, and Georgia Tech, COLLABLLM addresses this by incorporating Multiturn-aware Rewards (MR). This innovative approach uses collaborative simulation to estimate the long-term impact of responses, moving beyond immediate rewards to foster active collaboration.

    COLLABLLM excels in various applications, including:

    • Document creation
    • Code generation
    • Multiturn mathematics problem-solving

    It significantly improves task performance, conversational efficiency, and interactivity, leading to higher user satisfaction and reduced time spent on tasks. While primarily effective, some users noted COLLABLLM can occasionally feel bland, lack up-to-date information, and require more effort for personalisation.

    Discover how COLLABLLM transforms LLMs from passive responders into active collaborators, paving the way for more human-centred AI.

    Read the full paper here: http://arxiv.org/pdf/2502.00640

    続きを読む 一部表示
    16 分
  • [RAG-GOOGLE] MUVERA: Multi-Vector Retrieval via Fixed Dimensional Encodings
    2025/07/20

    Welcome to our podcast! Today, we're diving into MUVERA (Multi-Vector Retrieval Algorithm), a groundbreaking development from researchers at Google Research, UMD, and Google DeepMind. While neural embedding models are fundamental to modern information retrieval (IR), multi-vector models, though superior, are computationally expensive. MUVERA addresses this by ingeniously reducing complex multi-vector similarity search to efficient single-vector search, allowing the use of highly-optimised MIPS (Maximum Inner Product Search) solvers.

    The core innovation is Fixed Dimensional Encodings (FDEs), single-vector proxies for multi-vector similarity that offer the first theoretical guarantees (ε-approximations). Empirically, MUVERA significantly outperforms prior state-of-the-art implementations like PLAID, achieving an average of 10% higher recall with 90% lower latency across diverse BEIR retrieval datasets. It also incorporates product quantization for 32x memory compression of FDEs with minimal quality loss.

    A current limitation is that MUVERA did not outperform PLAID on the MS MARCO dataset, possibly due to PLAID's extensive tuning for that specific benchmark. Additionally, the effect of the average number of embeddings per document on FDE retrieval quality remains an area for future study. MUVERA's applications primarily lie in enhancing modern IR pipelines, potentially improving the efficiency of components within LLMs.

    Learn more: https://arxiv.org/pdf/2405.19504

    続きを読む 一部表示
    14 分