$ Eddie Berman / Publications / Shinn et al. (1) (Back)

We show that giving Large Language Models an Episodic memory buffer results in state of the art performance on a wide variety of benchmarks, including HumanEval (programming), Alfworld (decision making), and HotpotQA (reseaoning). We release Leetcode Hard Gym, a programming benchmark for LLM agents.