iclbench package

Subpackages

Submodules

iclbench.client module

class iclbench.client.ClaudeWrapper(client_config)[source]

Bases: LLMClientWrapper

__init__(client_config)[source]

convert_messages(messages)[source]

generate(messages)[source]

class iclbench.client.GoogleGenerativeAIWrapper(client_config)[source]

Bases: LLMClientWrapper

__init__(client_config)[source]

cache_icl_demo(icl_demo)[source]

convert_messages(messages)[source]

extract_completion(response)[source]: Extracts and returns the completion from the response safely.

generate(messages)[source]

get_completion(converted_messages, max_retries=5, delay=5)[source]

class iclbench.client.LLMClientWrapper(client_config)[source]

Bases: object

__init__(client_config)[source]

generate(messages)[source]

class iclbench.client.LLMResponse(model_id, completion, stop_reason, input_tokens, output_tokens, reasoning)

Bases: tuple

completion: Alias for field number 1

input_tokens: Alias for field number 3

model_id: Alias for field number 0

output_tokens: Alias for field number 4

reasoning: Alias for field number 5

stop_reason: Alias for field number 2

class iclbench.client.OpenAIWrapper(client_config)[source]

Bases: LLMClientWrapper

__init__(client_config)[source]

convert_messages(messages)[source]

generate(messages)[source]

class iclbench.client.ReplicateWrapper(client_config)[source]

Bases: LLMClientWrapper

__init__(client_config)[source]

generate(messages)[source]

iclbench.client.create_llm_client(client_config)[source]: Factory function to create the appropriate LLM client based on the client name.

iclbench.client.process_image_claude(image)[source]

iclbench.client.process_image_openai(image)[source]

iclbench.dataset module

class iclbench.dataset.InContextDataset(config, env_name, original_cwd)[source]

Bases: object

__init__(config, env_name, original_cwd) → None[source]

check_seed(demo_path)[source]

demo_path(i, task, demo_config)[source]

demo_task(task)[source]

icl_episodes(task)[source]

load_incontext_actions(demo_path)[source]

override_incontext_config(demo_config, demo_path)[source]

iclbench.dataset.choice_excluding(lst, excluded_element)[source]

iclbench.dataset.natural_sort_key(s)[source]

iclbench.evaluator module

class iclbench.evaluator.Evaluator(env_name, config, original_cwd='')[source]

Bases: object

Class to evaluate an agent on a set of tasks in a given environment.

The Evaluator class is responsible for orchestrating the evaluation of agents across multiple tasks within a specified environment. It manages the setup of the environment, runs episodes, logs results, and can execute evaluations in parallel or sequentially.

Variables:

env_name (str) – Name of the environment in which the agent operates.
config (Config) – Configuration object containing evaluation parameters.
tasks (list) – List of tasks for the specified environment.
num_episodes (int) – Number of episodes to run for each task.
num_workers (int) – Number of parallel worker processes to use.
max_steps_per_episode (int) – Maximum number of steps per episode.
dataset (InContextDataset) – Dataset object for managing in-context learning tasks.

__init__(env_name, config, original_cwd='')[source]

Initializes the Evaluator with environment name and configuration.

Parameters:

env_name (str) – Name of the environment.
config (Config) – Configuration object with evaluation parameters.
original_cwd (str, optional) – Original current working directory. Defaults to “”.

load_in_context_learning_episode(i, task, agent, episode_log)[source]

Loads and executes an in-context learning episode for the specified task.

Parameters:

i (int) – Index of the in-context learning episode.
task (str) – Name of the task to be evaluated.
agent (BaseAgent) – The agent being evaluated.
episode_log (dict) – Log to record episode results.

run(agent_factory)[source]

Executes the evaluation process either sequentially or in parallel.

Parameters:: agent_factory (AgentFactory) – Factory to create instances of the agent.
Returns:: Summary of the results for all tasks.
Return type:: dict

run_episode(task, agent, process_num=None, position=0)[source]

Executes a single evaluation episode for the specified task.

Parameters:

task (str) – Name of the task to be evaluated.
agent (BaseAgent) – The agent being evaluated.
process_num (int, optional) – Process number for logging purposes. Defaults to None.
position (int, optional) – Position for progress bar. Defaults to 0.

Returns:

Log of the episode results including trajectory, action frequency,: and performance metrics.

Return type:

dict

iclbench.utils module

iclbench.utils.load_secrets(file_path)[source]

iclbench.utils.setup_environment(openai_tag: str = 'OPENAI_API_KEY', gemini_tag: str = 'GEMINI_API_KEY', anthropic_tag: str = 'ANTHROPIC_API_KEY', replicate_tag: str = 'REPLICATE_API_KEY', organization: str = None, original_cwd: str = '')[source]

iclbench.utils.summarize_env_progressions(results_summaries: defaultdict, config) → float[source]

iclbench.utils.wandb_save_artifact(config)[source]

iclbench package

Subpackages

Submodules

iclbench.client module

iclbench.dataset module

iclbench.evaluator module

iclbench.utils module

Module contents