Crafter
Crafter is an open-source 2D survival game designed specifically for research on strong generalization, deep exploration, and long-term reasoning in reinforcement learning. It is a Minecraft-inspired, procedurally generated environment that combines resource gathering, crafting, and combat elements. Additionally, the game includes a comprehensive set of tasks and achievements, enabling researchers to evaluate agent performance across multiple objectives and time scales. To enable interaction with language models we use the same language wrapper as proposed in @wu2023smartplay.

Crafter Results
Standard errors are computed using 10 seeds. GPT4o leads in language-only mode, and Gemini-1.5-Pro leads in vision-language mode. Surprisingly, Llama 3.2 90B performance decreases very sharply when images are added, getting worse average progress than its smaller 11B model.
LLM results
Model |
Average Progress (%) |
|---|---|
gpt-4o |
33.10 ± 2.32 |
llama-3.2-90B-it |
31.69 ± 1.36 |
llama-3.1-70B-it |
31.31 ± 2.68 |
gemini-1.5-pro |
30.21 ± 2.86 |
llama-3.2-11B-it |
26.20 ± 3.30 |
gemini-1.5-flash |
20.00 ± 0.74 |
gpt-4o-mini |
12.72 ± 1.13 |
VLM results
Model |
Average Progress (%) |
|---|---|
gemini-1.5-pro |
33.50 ± 2.07 |
gpt-4o |
26.81 ± 3.74 |
llama-3.2-11B-it |
23.63 ± 1.48 |
gemini-1.5-flash |
20.70 ± 4.43 |
gpt-4o-mini |
19.91 ± 3.13 |
llama-3.2-90B-it |
10.00 ± 1.13 |
Observations
TODO