Evaluation

⚡ Quickstart

Experiments are run using the eval.py script found at the root of the BALROG repo. Simply run:

python eval.py envs="babyai,nle" base_url=<your_vllm_server_base_url>

Where evaluation environments are specified in a comma separated list. By default, experiment results are saved to the ./results directory.

Evaluate using local vLLM server

We support running LLMs/VLMs out of the box using vLLM. You can spin up a vLLM client and evaluate your agent on BALROG in the following way:

vllm serve meta-llama/Llama-3.2-1B-Instruct --port 8080

python eval.py \
  agent.type=custom \
  agent.max_image_history=0 \
  agent.max_history=16 \
  eval.num_workers=16 \
  client.client_name=vllm \
  client.model_id=meta-llama/Llama-3.2-1B-Instruct \
  client.base_url=http://0.0.0.0:8080/v1

Check out vLLM for more options on how to serve your models fast and efficiently.

Evaluate using API

We support how of the box clients for OpenAI, Anthropic and Google Gemini APIs. If you want to evaluate an agent using one of these APIs, you first have to set up your API key in one of two ways:

You can either directly export it:

export OPENAI_API_KEY=<KEY>
export ANTHROPIC_API_KEY=<KEY>
export GEMINI_API_KEY=<KEY>

Or you can modify the SECRETS file, adding your api keys.

You can then run the evaluation with:

python eval.py \
  agent.type=custom \
  agent.max_image_history=0 \
  agent.max_history=16 \
  eval.num_workers=16 \
  client.client_name=openai \
  client.model_id=gpt-4o-mini-2024-07-18

🖼️ VLM mode

You can activate the VLM mode by increasing the max_image_history argument, for example

python eval.py \
  agent.type=custom \
  agent.max_history=16 \
  agent.max_image_history=1 \
  eval.num_workers=16 \
  client.client_name=openai \
  client.model_id=gpt-4o-mini-2024-07-18

▶️ Resume an evaluation

To resume an incomplete evaluation, use eval.resume_from. For example, if an evaluation in the folder results/2024-10-30/16-20-30_custom_gpt-4o-mini-2024-07-18 is unfinished, resume it with:

python eval.py \
  agent.type=custom \
  agent.max_image_history=0 \
  agent.max_history=16 \
  eval.num_workers=16 \
  client.client_name=openai \
  client.model_id=gpt-4o-mini-2024-07-18 \
  eval.resume_from=results/2024-10-30_16-20-30_custom_gpt-4o-mini-2024-07-18

⚙️ Configuring Eval

eval.py is configured using Hydra. We list some options below. For more details, refer to the eval config.

Parameter

Description

Default Value

agent.type

Type of agent used

icl

agent.remember_cot

Whether the agent should remember chain-of-thought (CoT) during episodes.

True

agent.max_history

Maximum number of dialogue history entries to retain.

16

eval.num_workers

Number of parallel environment workers for parallel evaluation.

1

eval.num_episodes

Number of episodes per environment for evaluation.

{nle: 5, minihack: 5, babyai: 25, ...}

eval.save_trajectories

Whether to save agent trajectories during evaluation.

True

eval.icl_episodes

Number of in-context learning episodes.

1

eval.icl_dataset

Dataset used for in-context learning, generally a path to the demonstrations.

demos

client.client_name

Type of the client used, e.g. for vLLM servers you would use openai.

openai

client.model_id

Identifier for the model used.

gpt-4o

client.base_url

Base URL of the model server for API requests.

http://localhost:8080/v1

client.is_chat_model

Indicates if the model follows a chat-based interface.

True

client.generate_kwargs.temperature

Temperature for model response randomness.

0.0

envs.names

Comma-separated list of environments to evaluate, e.g., nle, minihack.

nle