Notes on "A deep dive into the world's smartest email AI"

18-02-2025

https://www.shortwave.com/blog/deep-dive-into-worlds-smartest-email-ai/
overview
- avoid long LLM call chains
  - this leads to data loss and errors at each stage
- prefer a single LLM call that includes all context needed to answer the question in one prompt
  - right bet as reasoining and context limits will increase over time
- common language
  - introduced an abstraction called a "Tool"
  - each type of data is sourced from a different Tool (each tool decides what data needs to be added to the final prompt)
generic response workflow (tool selection, tool data retrieval, question answering, post processing)
- tool selection
  - given a query, determine what data is needed to answer the question
  - requires a deep understanding of the context of the question
  - return 0, 1, or many tools
  - allows integration with multiple, heterogeneous data sources in a modular and scalable way
- tool data retrieval
  - retrieve data in parallel
  - tools can make LLM calls of their own, run vector DB queries, run models on our cloud GPU cluster, access our full text search infrastructure, and so on
- question answering
  - we have all the data we need
  - create a prompt containing the original user question and all the context information fetched using various tools
  - make tradeoffs about token allocation given various heuristics
- post processing
  - convert to desired output format, add citations, and suggests actions to the user
example of a tool (AI search)
- uses LLMs, embeddings, vector DBs, full text search, metadata-based search, cross encoding models, and rule-based heuristics
- workflow
  - query reformulation
    - takes a query that lacks needed context, and rewrites it using an LLM so that it makes sense on its own
    - considers any relevant information
  - feature extraction and traditional search
    - given a query, extract any relevant features that may exist and assign a confidence score (this can be done using large number of parallel calls to fast LLM)
      - recency bias extractor
      - keyword extractor
      - named entity extractor
      - date range extractor
      - query embeddings
    - query the full text search engine
  - embedding-based vector search
    - keyword and metadata-based searches are not enough
    - use an open source embedding model - https://huggingface.co/hkunlp/instructor-large
  - fast heuristic re-ranking
    - we end up with a bunch of content from both traditional search and vector search, with all relevant metadata
    - we need to cut through a lot of low quality potential results
    - boosting is done based on two-phase approach
    - phase 1 is similarity score of content (howvever, this is solely repr of the content and not of any metadata nor does it account for shortcomings in the embedding model)
    - phase 2 is applying a series of local heuristicsbased on the features extracted
      - date range extractor applies gaussian filter
      - boosting done for named entities, labels, recency bias
      - penalties are applied for certain keywords
    - this determines the approximate rankings of all content returned
  - slow cross-encoder re-ranking
    - most powerful technique, smarted then the above heuristics, but much slower, using an open source model - http://huggingface.co/cross-encoder/ms-marco-MiniLM-L-12-v2
    - query + selected content (after heursitc ranking) -> scores each fragment ysing cross encoder model -> reapply geuristics (helps boost or penalise fragments, and addresses any inconsistencies)