GPTs for financial statement analysis

May 27, 2024

Methodology:

  • questions asked
    • will earnings grow or decline in the following year?
    • is economic performance sustainable?
  • no textual information was used, only numeric data from standardised financial statements
    • Compustat annual financial data from 1968 to 2021
    • anonymised by company name and financial period
  • evaluations were made against

Considerations:

  • how to prevent look-ahead bias
    • analayis was done outside of the training window
    • anonymised by company name and financial period
  • LLM specifics
    • usage of chain of thought prompting increases accuracy / F1 score
    • self-evaluation
      • output confidence intervals (0 to 1) were generated by the model, and from the token-level logistic probability values
      • narrative insights were generated by the model based solely on numeric data
        • these evalautions were informative about future performance
    • GPT-4 is the best (followed by Gemini 1.5 and GPT-3.5)

Results:

  • it looks like the GPT model used in this study (GTP-4 Turbo) is evaluated against a logistic regression model and a 4-layer neural net
    • both models use 59 financial predictors from a 1989 study
    • the neural net has 4 layers (1 input / 2 hidden / 1 output), compared to the 100+ layers in GPT-4 or other modern neural nets
    • the results don't seem to be statistically significant
  • where are humans good
    • soft information, with access to broader context not available to LLMs
  • where humans are less good
    • mitigating biases, information inefficiencies, disagreements
  • where LLMs are good
    • general purpose models, with the ability to quickly analyse large amounts of unstructured data
    • surprisngly good at quantitative tasks which require intuition and human-like reasoning
    • less common data patterns (smaller firm sizes)
  • where LLMs are less good
    • lack of deep numerical reasoning
    • negative time trend in prediciton accuracy
      • fail to take into accuont contextual knowledge of the current macroecnomic environment (although I imagine this can be improved outside of this study)

Things that stand out:

  • the strength of LLMs on quantitative tasks which require intuition and human-reasoning (given that no textual information was used in this study)
  • the ability to identify trends in less common data problems (where bias or information inefficeinceies may effect decision-making)
  • simple prompt engineering can make a great impact
    • I've attached a screenshot of the prompt process below, quite brilliant how this can capture the nuance of an analyst's thought process