Autonomous code generation and completion using the SWE-bench benchmark
January 12, 2025
- this paper from openai references some libraries used in the evaluation of autonomous code completion / generation tasks (they call them scaffolds, kind of like agents) - https://openai.com/index/introducing-swe-bench-verified/
- this one tops the rank and have seen some talk across tech twitter - https://github.com/OpenAutoCoder/Agentless / https://huggingface.co/papers/2407.01489 (labels itself as agentless lol, but uses a two-step approach to localize code and then repair code it)
- others include - https://github.com/Aider-AI/aider, https://github.com/aorwall/moatless-tree-search, https://github.com/SWE-agent/SWE-agent, https://github.com/nus-apr/auto-code-rover
- this is the benchmark used for evaluation llms on real world software issues - https://github.com/swe-bench/SWE-bench / https://www.swebench.com/