iclbench package

Subpackages

Submodules

iclbench.client module

class iclbench.client.ClaudeWrapper(client_config)[source]

Bases: LLMClientWrapper

__init__(client_config)[source]
convert_messages(messages)[source]
generate(messages)[source]
class iclbench.client.GoogleGenerativeAIWrapper(client_config)[source]

Bases: LLMClientWrapper

__init__(client_config)[source]
cache_icl_demo(icl_demo)[source]
convert_messages(messages)[source]
extract_completion(response)[source]

Extracts and returns the completion from the response safely.

generate(messages)[source]
get_completion(converted_messages, max_retries=5, delay=5)[source]
class iclbench.client.LLMClientWrapper(client_config)[source]

Bases: object

__init__(client_config)[source]
generate(messages)[source]
class iclbench.client.LLMResponse(model_id, completion, stop_reason, input_tokens, output_tokens, reasoning)

Bases: tuple

completion

Alias for field number 1

input_tokens

Alias for field number 3

model_id

Alias for field number 0

output_tokens

Alias for field number 4

reasoning

Alias for field number 5

stop_reason

Alias for field number 2

class iclbench.client.OpenAIWrapper(client_config)[source]

Bases: LLMClientWrapper

__init__(client_config)[source]
convert_messages(messages)[source]
generate(messages)[source]
class iclbench.client.ReplicateWrapper(client_config)[source]

Bases: LLMClientWrapper

__init__(client_config)[source]
generate(messages)[source]
iclbench.client.create_llm_client(client_config)[source]

Factory function to create the appropriate LLM client based on the client name.

iclbench.client.process_image_claude(image)[source]
iclbench.client.process_image_openai(image)[source]

iclbench.dataset module

class iclbench.dataset.InContextDataset(config, env_name, original_cwd)[source]

Bases: object

__init__(config, env_name, original_cwd) None[source]
check_seed(demo_path)[source]
demo_path(i, task, demo_config)[source]
demo_task(task)[source]
icl_episodes(task)[source]
load_incontext_actions(demo_path)[source]
override_incontext_config(demo_config, demo_path)[source]
iclbench.dataset.choice_excluding(lst, excluded_element)[source]
iclbench.dataset.natural_sort_key(s)[source]

iclbench.evaluator module

class iclbench.evaluator.Evaluator(env_name, config, original_cwd='')[source]

Bases: object

Class to evaluate an agent on a set of tasks in a given environment.

The Evaluator class is responsible for orchestrating the evaluation of agents across multiple tasks within a specified environment. It manages the setup of the environment, runs episodes, logs results, and can execute evaluations in parallel or sequentially.

Variables:
  • env_name (str) – Name of the environment in which the agent operates.

  • config (Config) – Configuration object containing evaluation parameters.

  • tasks (list) – List of tasks for the specified environment.

  • num_episodes (int) – Number of episodes to run for each task.

  • num_workers (int) – Number of parallel worker processes to use.

  • max_steps_per_episode (int) – Maximum number of steps per episode.

  • dataset (InContextDataset) – Dataset object for managing in-context learning tasks.

__init__(env_name, config, original_cwd='')[source]

Initializes the Evaluator with environment name and configuration.

Parameters:
  • env_name (str) – Name of the environment.

  • config (Config) – Configuration object with evaluation parameters.

  • original_cwd (str, optional) – Original current working directory. Defaults to “”.

load_in_context_learning_episode(i, task, agent, episode_log)[source]

Loads and executes an in-context learning episode for the specified task.

Parameters:
  • i (int) – Index of the in-context learning episode.

  • task (str) – Name of the task to be evaluated.

  • agent (BaseAgent) – The agent being evaluated.

  • episode_log (dict) – Log to record episode results.

run(agent_factory)[source]

Executes the evaluation process either sequentially or in parallel.

Parameters:

agent_factory (AgentFactory) – Factory to create instances of the agent.

Returns:

Summary of the results for all tasks.

Return type:

dict

run_episode(task, agent, process_num=None, position=0)[source]

Executes a single evaluation episode for the specified task.

Parameters:
  • task (str) – Name of the task to be evaluated.

  • agent (BaseAgent) – The agent being evaluated.

  • process_num (int, optional) – Process number for logging purposes. Defaults to None.

  • position (int, optional) – Position for progress bar. Defaults to 0.

Returns:

Log of the episode results including trajectory, action frequency,

and performance metrics.

Return type:

dict

iclbench.utils module

iclbench.utils.load_secrets(file_path)[source]
iclbench.utils.setup_environment(openai_tag: str = 'OPENAI_API_KEY', gemini_tag: str = 'GEMINI_API_KEY', anthropic_tag: str = 'ANTHROPIC_API_KEY', replicate_tag: str = 'REPLICATE_API_KEY', organization: str = None, original_cwd: str = '')[source]
iclbench.utils.summarize_env_progressions(results_summaries: defaultdict, config) float[source]
iclbench.utils.wandb_save_artifact(config)[source]

Module contents