Notes on "A deep dive into the world's smartest email AI"
February 18, 2025
- https://www.shortwave.com/blog/deep-dive-into-worlds-smartest-email-ai/
- overview
- avoid long LLM call chains
- this leads to data loss and errors at each stage
- prefer a single LLM call that includes all context needed to answer the question in one prompt
- right bet as reasoining and context limits will increase over time
- common language
- introduced an abstraction called a "Tool"
- each type of data is sourced from a different Tool (each tool decides what data needs to be added to the final prompt)
- avoid long LLM call chains
- generic response workflow (tool selection, tool data retrieval, question answering, post processing)
- tool selection
- given a query, determine what data is needed to answer the question
- requires a deep understanding of the context of the question
- return 0, 1, or many tools
- allows integration with multiple, heterogeneous data sources in a modular and scalable way
- tool data retrieval
- retrieve data in parallel
- tools can make LLM calls of their own, run vector DB queries, run models on our cloud GPU cluster, access our full text search infrastructure, and so on
- question answering
- we have all the data we need
- create a prompt containing the original user question and all the context information fetched using various tools
- make tradeoffs about token allocation given various heuristics
- post processing
- convert to desired output format, add citations, and suggests actions to the user
- tool selection
- example of a tool (AI search)
- uses LLMs, embeddings, vector DBs, full text search, metadata-based search, cross encoding models, and rule-based heuristics
- workflow
- query reformulation
- takes a query that lacks needed context, and rewrites it using an LLM so that it makes sense on its own
- considers any relevant information
- feature extraction and traditional search
- given a query, extract any relevant features that may exist and assign a confidence score (this can be done using large number of parallel calls to fast LLM)
- recency bias extractor
- keyword extractor
- named entity extractor
- date range extractor
- query embeddings
- query the full text search engine
- given a query, extract any relevant features that may exist and assign a confidence score (this can be done using large number of parallel calls to fast LLM)
- embedding-based vector search
- keyword and metadata-based searches are not enough
- use an open source embedding model - https://huggingface.co/hkunlp/instructor-large
- fast heuristic re-ranking
- we end up with a bunch of content from both traditional search and vector search, with all relevant metadata
- we need to cut through a lot of low quality potential results
- boosting is done based on two-phase approach
- phase 1 is similarity score of content (howvever, this is solely repr of the content and not of any metadata nor does it account for shortcomings in the embedding model)
- phase 2 is applying a series of local heuristicsbased on the features extracted
- date range extractor applies gaussian filter
- boosting done for named entities, labels, recency bias
- penalties are applied for certain keywords
- this determines the approximate rankings of all content returned
- slow cross-encoder re-ranking
- most powerful technique, smarted then the above heuristics, but much slower, using an open source model - http://huggingface.co/cross-encoder/ms-marco-MiniLM-L-12-v2
- query + selected content (after heursitc ranking) -> scores each fragment ysing cross encoder model -> reapply geuristics (helps boost or penalise fragments, and addresses any inconsistencies)
- query reformulation