エピソード

  • (FM-Capital One) TIMeSynC: Temporal Intent Modelling with Synchronized Context Encodings for Financial Service Applications
    2026/02/03

    Welcome to our latest episode where we dive into TIMeSynC, a groundbreaking framework developed by researchers at Capital One to revolutionise intent prediction in financial services. Managing customer journeys across mobile apps, call centres, and web platforms is historically difficult because data is recorded at vastly different temporal resolutions.

    The novelty of TIMeSynC lies in its encoder-decoder transformer architecture, which employs ALiBi-based time representations and synchronised context encodings to align these heterogeneous data streams. By flattening multi-channel activity into a single tokenised sequence, it eliminates the need for hours of manual feature engineering, allowing the model to learn complex temporal patterns directly.

    In terms of applications, this technology enables highly personalised digital experiences, such as contextual chatbot Q&A, targeted marketing, and predicting a user’s "next best action"—whether that is redeeming rewards or reporting fraud. However, a notable limitation is that flattening data across domains can lead to an "explosion" of the encoder context window, and the results may not yet generalise to datasets with different characteristics. Join us as we explore how TIMeSynC significantly outperforms traditional tabular methods to set a new standard in sequential recommendation.

    Paper link: https://arxiv.org/pdf/2410.12825

    続きを読む 一部表示
    14 分
  • (FM-Tencent) HunyuanImage 3.0
    2026/02/02

    Welcome to our exploration of HunyuanImage 3.0, a landmark release from the Tencent Hunyuan Foundation Model Team. This episode dives into the novelty of its architecture: a native multimodal model that unifies image understanding and generation within a single autoregressive framework. As the largest open-source image generative model currently available, it utilizes a Mixture-of-Experts (MoE) design with over 80 billion total parameters to balance high capacity with computational efficiency.

    A standout feature is its native Chain-of-Thought (CoT) reasoning, which enables the model to refine abstract concepts and "think" through instructions before synthesizing high-fidelity visual outputs. This process is supported by a rigorous data curation pipeline that filtered over 10 billion images to prioritize aesthetic quality and semantic diversity. Applications for this technology are broad, including sophisticated text-to-image generation, complex prompt-following, and specialized tasks like artistic rendering or text-heavy graphic design.

    Despite its power, there are limitations; the current public release is focused on its text-to-image capabilities, while image-to-image training is still ongoing. Tune in to learn how this foundation model aims to foster a more transparent and vibrant multimodal ecosystem.

    Paper Link: https://arxiv.org/pdf/2509.23951

    続きを読む 一部表示
    20 分
  • (FM Personalize-AMZN) MCM: A multi-task pre-trained customer model for personalization
    2025/09/05

    Welcome to our podcast, where we delve into cutting-edge advancements in personalization! Today, we're highlighting MCM: A Multi-task Pre-trained Customer Model for Personalization, developed by Amazon LLC.

    This innovative BERT-based model, with 10 million parameters, revolutionises how e-commerce platforms deeply understand customer preferences and shopping intents. Its novelty stems from significantly improving the state-of-the-art BERT4Rec framework by handling heterogeneous customer signals and implementing multi-task training. Key innovations include a random prefix augmentation method that avoids leaking future information and a task-aware attentional readout module that generates highly specific representations for different items and tasks.

    MCM’s applications are extensive, empowering diverse personalization projects by providing accurate preference scores for recommendations, customer embeddings for transfer learning, and a pre-trained model for fine-tuning. It excels in next action prediction tasks, outperforming original BERT4Rec by 17%. While generally powerful, for highly specific behaviours like those driven by incentives, fine-tuning MCM with task-specific data can yield even greater improvements, driving over 60% uplift in conversion rates for incentive-based recommendations compared to baselines.

    Discover how MCM is shaping the future of personalised e-commerce experiences!

    Find the full paper here: https://assets.amazon.science/d7/a5/d17698634b70925612c07f07a0fa/mcm-a-multi-task-pre-trained-customer-model-for-personalization.pdf

    続きを読む 一部表示
    12 分
  • (LLM RAG-Google) On the Theoretical Limitations of Embedding-Based Retrieval
    2025/09/02

    Welcome to our podcast! Today, we delve into groundbreaking research from Google DeepMind and Johns Hopkins University titled "On the Theoretical Limitations of Embedding-Based Retrieval". This paper uncovers a fundamental flaw in the widely used single-vector embedding paradigm: the number of unique top-k document combinations an embedding model can represent is inherently limited by its dimension.

    Despite the common belief that better training or larger models can overcome these issues, the researchers demonstrate these theoretical limits in surprisingly simple, realistic settings. They introduce LIMIT, a novel dataset that exposes how even state-of-the-art embedding models severely struggle with straightforward tasks, scoring less than 20 recall@100 in some cases, due to these theoretical underpinnings. This suggests that existing academic benchmarks might be inadvertently hiding these limitations by testing only a minute fraction of possible query-relevance combinations.

    This work calls for a re-evaluation of how we approach information retrieval. While single-vector embeddings are powerful, their capacity for handling diverse, instruction-following queries with complex relevance definitions is fundamentally capped. The paper suggests exploring alternative architectures like cross-encoders, multi-vector models, or sparse models to address these limitations. Tune in to understand why pushing the boundaries of current embedding models requires a shift beyond the single-vector paradigm.

    Find the full paper at: https://arxiv.org/pdf/2508.21038

    続きを読む 一部表示
    13 分
  • (FM-Pinterest) ItemSage: Learning Product Embeddings for Shopping Recommendations at Pinterest
    2025/09/02

    Welcome to our podcast, where we delve into cutting-edge AI in e-commerce! Today, we're exploring ItemSage, Pinterest's innovative product embedding system for shopping recommendations. Developed by engineers at Pinterest, ItemSage revolutionises how users discover products across Home, Closeup, and Search surfaces.

    A key novelty is its transformer-based architecture, combining both text and image modalities to create rich product representations, significantly outperforming single-modality approaches. ItemSage also leverages multi-task learning to optimise for diverse engagement objectives, including purchases and add-to-cart actions, making the recommendation funnel more efficient, particularly for sparse labels. This unified embedding system, compatible with existing PinSage and SearchSage embeddings, drastically reduces infrastructure and maintenance costs by three times across different recommendation verticals.

    While ItemSage has delivered substantial gains—up to +7% Gross Merchandise Value per user and +11% click volume in online A/B experiments—future work aims to enhance text feature modeling with pre-trained Transformers. Join us to understand this powerful system transforming shopping at Pinterest!

    Paper link: https://arxiv.org/pdf/2205.11728

    続きを読む 一部表示
    17 分
  • (LLM Multiagent UCB) Why Multi-Agent LLM Systems Fail: A Taxonomy
    2025/08/18

    Here is a 200-word description for your podcast:

    Ever wondered why Multi-Agent LLM Systems (MAS) often fall short despite their promise? Researchers at UC Berkeley introduce MAST (Multi-Agent System Failure Taxonomy), the first empirically grounded taxonomy to systematically analyse MAS failures.

    Uncover 14 unique failure modes, organised into three crucial categories: specification issues (system design), inter-agent misalignment (agent coordination), and task verification (quality control). Developed through rigorous human annotation and validated with a scalable LLM-as-a-Judge pipeline, MAST offers a structured framework for diagnosing and understanding these challenges.

    Our findings reveal that most failures stem from fundamental system design challenges and agent coordination issues, rather than just individual LLM limitations, requiring more complex solutions than superficial fixes. MAST provides actionable insights for debugging and development, enabling systematic diagnosis and guiding interventions towards building more robust systems. While currently focused on task correctness, future work will explore critical aspects like efficiency, cost, and security.

    Learn how MAST can help build more reliable and effective multi-agent systems.

    Find the paper here: https://arxiv.org/pdf/2503.13657

    続きを読む 一部表示
    12 分
  • (LLM Application-GOOGLE) Toward Sensor-In-the-Loop LLM Agent: Benchmarks and Implications
    2025/08/05

    Tune into our podcast to explore groundbreaking advancements in AI personal agents! In this episode, we delve into WellMax, a novel sensor-in-the-loop Large Language Model (LLM) agent developed by researchers from the University of Pittsburgh, University of Illinois Urbana-Champaign, and Google.

    WellMax uniquely enhances AI responses by integrating real-time physiological and physical data from wearables, allowing personal agents to understand your context implicitly and automatically. This results in more empathetic and contextually relevant advice compared to non-sensor-informed agents. Imagine an AI tailoring your exercise routine based on your actual activity levels or suggesting stress-reducing activities after a demanding day.

    However, the journey isn't without its challenges. We discuss the difficulties LLMs face in interpreting raw sensor data, the balance between detailed advice and user choice, and the privacy implications of cloud-based LLMs versus the performance trade-offs with smaller, on-device models like Gemma-2. WellMax paves the way for future AI agents that adapt dynamically to your shifting needs, offering holistic support beyond mere question-answering.

    Learn more about this research in "Toward Sensor-In-the-Loop LLM Agent: Benchmarks and Implications": https://doi.org/10.1145/3715014.3722082

    続きを読む 一部表示
    15 分
  • (Counterfactual-AirBnB) Harnessing the Power of Interleaving and Counterfactual Evaluation for Airbnb Search Ranking
    2025/08/05

    Tune into our podcast as we explore Airbnb's groundbreaking advancements in search ranking evaluation. Traditional A/B testing for significant purchases like accommodation bookings faces challenges: it's time-consuming, with low traffic and delayed feedback. Offline evaluations, while quick, often lack accuracy due to issues like selection bias and disconnect from online metrics.

    To overcome this, Airbnb developed and implemented two novel online evaluation methods: interleaving and counterfactual evaluation. Our competitive pair-based interleaving method offers an impressive 50X speedup in experimentation velocity compared to traditional A/B tests. For even greater generalizability and sensitivity, our online counterfactual evaluation achieves an astonishing 100X speedup. These methods allow for rapid identification of promising candidates for full A/B tests, significantly streamlining the experimental process.

    While interleaving may face limitations with rankers using set-level optimization that can disrupt user experience, counterfactual evaluation provides greater robustness in such scenarios. These innovative techniques are not only proven effective at Airbnb, leading to increased capacity to test new ideas and higher success rates in A/B testing, but are also easily generalizable to other online platforms, especially those with sparse conversion events.

    Paper Link: https://doi.org/10.1145/3711896.3737232

    続きを読む 一部表示
    21 分