GPTs for financial statement analysis

May 27, 2024

Methodology:

questions asked
- will earnings grow or decline in the following year?
- is economic performance sustainable?
no textual information was used, only numeric data from standardised financial statements
- Compustat annual financial data from 1968 to 2021
- anonymised by company name and financial period
evaluations were made against
- historical analyst forecasts (obtained from the IBES dataset - https://www.investopedia.com/terms/i/ibes.asp))
- narrowly-specialised ML models (logistic regression model, artificial neural network)

Considerations:

how to prevent look-ahead bias
- analayis was done outside of the training window
- anonymised by company name and financial period
LLM specifics
- usage of chain of thought prompting increases accuracy / F1 score
- self-evaluation
  - output confidence intervals (0 to 1) were generated by the model, and from the token-level logistic probability values
  - narrative insights were generated by the model based solely on numeric data
    - these evalautions were informative about future performance
- GPT-4 is the best (followed by Gemini 1.5 and GPT-3.5)

Results:

it looks like the GPT model used in this study (GTP-4 Turbo) is evaluated against a logistic regression model and a 4-layer neural net
- both models use 59 financial predictors from a 1989 study
- the neural net has 4 layers (1 input / 2 hidden / 1 output), compared to the 100+ layers in GPT-4 or other modern neural nets
- the results don't seem to be statistically significant
where are humans good
- soft information, with access to broader context not available to LLMs
where humans are less good
- mitigating biases, information inefficiencies, disagreements
where LLMs are good
- general purpose models, with the ability to quickly analyse large amounts of unstructured data
- surprisngly good at quantitative tasks which require intuition and human-like reasoning
- less common data patterns (smaller firm sizes)
where LLMs are less good
- lack of deep numerical reasoning
- negative time trend in prediciton accuracy
  - fail to take into accuont contextual knowledge of the current macroecnomic environment (although I imagine this can be improved outside of this study)

Things that stand out:

the strength of LLMs on quantitative tasks which require intuition and human-reasoning (given that no textual information was used in this study)
the ability to identify trends in less common data problems (where bias or information inefficeinceies may effect decision-making)
simple prompt engineering can make a great impact
- I've attached a screenshot of the prompt process below, quite brilliant how this can capture the nuance of an analyst's thought process