GPTs for financial statement analysis
May 27, 2024
Methodology:
- questions asked
- will earnings grow or decline in the following year?
- is economic performance sustainable?
- no textual information was used, only numeric data from standardised financial statements
- Compustat annual financial data from 1968 to 2021
- anonymised by company name and financial period
- evaluations were made against
- historical analyst forecasts (obtained from the IBES dataset - https://www.investopedia.com/terms/i/ibes.asp))
- narrowly-specialised ML models (logistic regression model, artificial neural network)
Considerations:
- how to prevent look-ahead bias
- analayis was done outside of the training window
- anonymised by company name and financial period
- LLM specifics
- usage of chain of thought prompting increases accuracy / F1 score
- self-evaluation
- output confidence intervals (0 to 1) were generated by the model, and from the token-level logistic probability values
- narrative insights were generated by the model based solely on numeric data
- these evalautions were informative about future performance
- GPT-4 is the best (followed by Gemini 1.5 and GPT-3.5)
Results:
- it looks like the GPT model used in this study (GTP-4 Turbo) is evaluated against a logistic regression model and a 4-layer neural net
- both models use 59 financial predictors from a 1989 study
- the neural net has 4 layers (1 input / 2 hidden / 1 output), compared to the 100+ layers in GPT-4 or other modern neural nets
- the results don't seem to be statistically significant
- where are humans good
- soft information, with access to broader context not available to LLMs
- where humans are less good
- mitigating biases, information inefficiencies, disagreements
- where LLMs are good
- general purpose models, with the ability to quickly analyse large amounts of unstructured data
- surprisngly good at quantitative tasks which require intuition and human-like reasoning
- less common data patterns (smaller firm sizes)
- where LLMs are less good
- lack of deep numerical reasoning
- negative time trend in prediciton accuracy
- fail to take into accuont contextual knowledge of the current macroecnomic environment (although I imagine this can be improved outside of this study)
Things that stand out:
- the strength of LLMs on quantitative tasks which require intuition and human-reasoning (given that no textual information was used in this study)
- the ability to identify trends in less common data problems (where bias or information inefficeinceies may effect decision-making)
- simple prompt engineering can make a great impact
- I've attached a screenshot of the prompt process below, quite brilliant how this can capture the nuance of an analyst's thought process