Notes on "A deep dive into the world's smartest email AI"

February 18, 2025

  • https://www.shortwave.com/blog/deep-dive-into-worlds-smartest-email-ai/
  • overview
    • avoid long LLM call chains
      • this leads to data loss and errors at each stage
    • prefer a single LLM call that includes all context needed to answer the question in one prompt
      • right bet as reasoining and context limits will increase over time
    • common language
      • introduced an abstraction called a "Tool"
      • each type of data is sourced from a different Tool (each tool decides what data needs to be added to the final prompt)
  • generic response workflow (tool selection, tool data retrieval, question answering, post processing)
    • tool selection
      • given a query, determine what data is needed to answer the question
      • requires a deep understanding of the context of the question
      • return 0, 1, or many tools
      • allows integration with multiple, heterogeneous data sources in a modular and scalable way
    • tool data retrieval
      • retrieve data in parallel
      • tools can make LLM calls of their own, run vector DB queries, run models on our cloud GPU cluster, access our full text search infrastructure, and so on
    • question answering
      • we have all the data we need
      • create a prompt containing the original user question and all the context information fetched using various tools
      • make tradeoffs about token allocation given various heuristics
    • post processing
      • convert to desired output format, add citations, and suggests actions to the user
  • example of a tool (AI search)
    • uses LLMs, embeddings, vector DBs, full text search, metadata-based search, cross encoding models, and rule-based heuristics
    • workflow
      • query reformulation
        • takes a query that lacks needed context, and rewrites it using an LLM so that it makes sense on its own
        • considers any relevant information
      • feature extraction and traditional search
        • given a query, extract any relevant features that may exist and assign a confidence score (this can be done using large number of parallel calls to fast LLM)
          • recency bias extractor
          • keyword extractor
          • named entity extractor
          • date range extractor
          • query embeddings
        • query the full text search engine
      • embedding-based vector search
      • fast heuristic re-ranking
        • we end up with a bunch of content from both traditional search and vector search, with all relevant metadata
        • we need to cut through a lot of low quality potential results
        • boosting is done based on two-phase approach
        • phase 1 is similarity score of content (howvever, this is solely repr of the content and not of any metadata nor does it account for shortcomings in the embedding model)
        • phase 2 is applying a series of local heuristicsbased on the features extracted
          • date range extractor applies gaussian filter
          • boosting done for named entities, labels, recency bias
          • penalties are applied for certain keywords
        • this determines the approximate rankings of all content returned
      • slow cross-encoder re-ranking
        • most powerful technique, smarted then the above heuristics, but much slower, using an open source model - http://huggingface.co/cross-encoder/ms-marco-MiniLM-L-12-v2
        • query + selected content (after heursitc ranking) -> scores each fragment ysing cross encoder model -> reapply geuristics (helps boost or penalise fragments, and addresses any inconsistencies)