Crafter

Crafter is an open-source 2D survival game designed specifically for research on strong generalization, deep exploration, and long-term reasoning in reinforcement learning. It is a Minecraft-inspired, procedurally generated environment that combines resource gathering, crafting, and combat elements. Additionally, the game includes a comprehensive set of tasks and achievements, enabling researchers to evaluate agent performance across multiple objectives and time scales. To enable interaction with language models we use the same language wrapper as proposed in @wu2023smartplay.

Crafter's examples of unique procedurally generatedmaps.

Crafter Results

Standard errors are computed using 10 seeds. GPT4o leads in language-only mode, and Gemini-1.5-Pro leads in vision-language mode. Surprisingly, Llama 3.2 90B performance decreases very sharply when images are added, getting worse average progress than its smaller 11B model.

LLM results

Model

Average Progress (%)

gpt-4o

33.10 ± 2.32

llama-3.2-90B-it

31.69 ± 1.36

llama-3.1-70B-it

31.31 ± 2.68

gemini-1.5-pro

30.21 ± 2.86

llama-3.2-11B-it

26.20 ± 3.30

gemini-1.5-flash

20.00 ± 0.74

gpt-4o-mini

12.72 ± 1.13

VLM results

Model

Average Progress (%)

gemini-1.5-pro

33.50 ± 2.07

gpt-4o

26.81 ± 3.74

llama-3.2-11B-it

23.63 ± 1.48

gemini-1.5-flash

20.70 ± 4.43

gpt-4o-mini

19.91 ± 3.13

llama-3.2-90B-it

10.00 ± 1.13

Observations

TODO