[arteval-bench] Port Artifact of EMT (in OSDI'25) to the ArtEvalBench of System Intelligence Benchmark by EscapistArcadia · Pull Request #108 · sys-intelligence/system-intelligence-benchmark

EscapistArcadia · 2026-02-01T17:48:10Z

Description

This PR is to add the artifact of EMT, presented at OSDI'2025, to the ArtEvalBench of System Intelligence Benchmark.

Changes

Adds EMT AE's entry to ArtEvalBench's schema file;
Adds Oracle scripts running EMT AE.

Testing

I have finished testing the full AE pipeline locally by running python main.py, together with git clone, git submodule update, and a local image build. These commands are kept in the Oracle scripts but commented out.

Checklist

Tests pass locally
Code follows project style guidelines
Documentation updated (if needed)

Distinguish the models used in the executor and evaluator

Signed-off-by: Tarek <tareknaser360@gmail.com>

…m changes

…s/sysmobench/sysmobench_core'

- Add gpt-4o model configuration to models.yaml - Fix setup_tools.py to use shutil.move instead of os.rename This resolves 'Invalid cross-device link' error when /tmp is on different filesystem

Adding Acto (SOSP'23) and Anvil (OSDI'24)

Integrate SREGym

… patch in the future.

…eep one-click effect.

xuafeng · 2026-02-02T21:14:03Z

@EscapistArcadia Hi Shanbo, thanks a lot for your contribution. Let us review this and feedback soon.

Copilot

Pull request overview

This PR adds the EMT artifact evaluation (from OSDI 2025) to the ArtEvalBench of the System Intelligence Benchmark. The implementation provides Oracle scripts to automate the evaluation of EMT's experimental results by comparing reproduced metrics against reference values.

Changes:

Added EMT artifact entry to the ArtEvalBench schema file
Implemented Oracle scripts for environment setup, experiment execution, and result validation
Created utilities for logging and path configuration

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 8 comments.

Show a summary per file

File	Description
benchmarks/arteval_bench/data/benchmark/arteval_tasks.jsonl	Adds EMT artifact entry to the benchmark registry
benchmarks/arteval_bench/data/benchmark/osdi25_emt/_agent_eval/utils.py	Defines constants, paths, and logging configuration for EMT evaluation
benchmarks/arteval_bench/data/benchmark/osdi25_emt/_agent_eval/oracle_experiment_runs.py	Implements experiment execution and result validation logic
benchmarks/arteval_bench/data/benchmark/osdi25_emt/_agent_eval/oracle_env_setup.py	Handles dependency checking and environment setup
benchmarks/arteval_bench/data/benchmark/osdi25_emt/_agent_eval/oracle_benchmark_prep.py	Placeholder for benchmark preparation
benchmarks/arteval_bench/data/benchmark/osdi25_emt/_agent_eval/oracle_artifact_build.py	Placeholder for artifact build
benchmarks/arteval_bench/data/benchmark/osdi25_emt/_agent_eval/main.py	Entry point for orchestrating the evaluation process

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-02-02T21:15:09Z

+HOME = Path.cwd().parent
+REPO_DIRS = {"emt": HOME / "emt"}
+
+FIG18_REFERENCE_PATH = HOME / "_agent_eval" / "refs" / "emt-figure16.ref.csv" # TODO: Fill the paper data to the reference 


The variable name FIG18_REFERENCE_PATH refers to Figure 18 but the path includes 'emt-figure16.ref.csv'. This inconsistency should be resolved by either renaming the variable to FIG16_REFERENCE_PATH or updating the filename to 'emt-figure18.ref.csv'.

Suggested change

FIG18_REFERENCE_PATH = HOME / "_agent_eval" / "refs" / "emt-figure16.ref.csv" # TODO: Fill the paper data to the reference

FIG18_REFERENCE_PATH = HOME / "_agent_eval" / "refs" / "emt-figure18.ref.csv" # TODO: Fill the paper data to the reference

Copilot · 2026-02-02T21:15:09Z

+    if not ok:
+      return False, why


This check on line 158 will always evaluate the result from the already-completed 'never' check on line 153, not the commented-out 'always' check on line 157. Either remove this dead code or uncomment line 157 and assign its result to 'ok, why' before this check.

Copilot · 2026-02-02T21:15:09Z

+    if not Path(self.go_root).exists():
+      problems.append("$HOME/go directory missing (install golang and configure GOPATH)")
+    if not Path(self.go_bin).exists():
+      problems.append("$HOME/go/bin directory missing (ensure Go tools are installed)")
+
+    # Check PATH contains Go path
+    path_env = os.environ.get("PATH", "")
+    go_root_str = str(self.go_root)
+    go_bin_str = str(self.go_bin)
+    if go_root_str not in path_env or go_bin_str not in path_env:
+      problems.append("PATH missing $HOME/go or $HOME/go/bin "
+                      "(export PATH=$HOME/go:$HOME/go/bin:$PATH)")


The Go-related checks appear unnecessary for the EMT artifact evaluation. The code comment on line 80 questions whether this is needed, and no Go dependencies are listed in DEPENDENCIES. These checks should either be removed or the TODO comment on line 80 should be resolved to clarify if Go is required.

Copilot · 2026-02-02T21:15:10Z

+
+
+  def run(self) -> bool:
+    # DEBUG PERPOSES ONLY, WILL REMOVE LATER


Corrected spelling of 'PERPOSES' to 'PURPOSES'.

Suggested change

# DEBUG PERPOSES ONLY, WILL REMOVE LATER

# DEBUG PURPOSES ONLY, WILL REMOVE LATER

Copilot · 2026-02-02T21:15:10Z

+
+  def paths_check(self):
+    """
+    Check that Python virtual environment is succesfully created 


Corrected spelling of 'succesfully' to 'successfully'.

Suggested change

Check that Python virtual environment is succesfully created

Check that Python virtual environment is successfully created

Copilot · 2026-02-02T21:15:10Z

+    if not Path(self.venv_dir).exists():
+      problems.append(".venv virtual environment missing (run 'python3 -m venv .venv')")
+
+    # Check Go directories exit


Corrected spelling of 'exit' to 'exist'.

Suggested change

# Check Go directories exit

# Check Go directories exist

Copilot · 2026-02-02T21:15:10Z

+  def check_dependency(self, dep: Dependency) -> Optional[str]:
+    """
+    Core method that checks whether a certain dependency of a version 
+    equal or greather than a reference version is installed.


Corrected spelling of 'greather' to 'greater'.

Suggested change

equal or greather than a reference version is installed.

equal or greater than a reference version is installed.

Copilot · 2026-02-02T21:15:11Z

 {"artifact_id": "sosp24_wasabi", "artifact_dir": "sosp24_wasabi", "artifact_readme": "sosp24_wasabi/wasabi/README.md", "artifact_url": "https://github.com/bastoica/wasabi/tree/sosp24-ae", "evaluator": "sosp24_wasabi/wasabi/_agent_eval/main.py", "expected_score": 4, "docer_env": "bastoica/ae-agent-ubuntu24.04:latest"}
 {"artifact_id": "osdi24_anvil", "artifact_dir": "osdi24_anvil", "artifact_readme": "sosp23_acto/acto/README.md", "artifact_url": "https://github.com/anvil-verifier/anvil", "evaluator": "osdi24_anvil/anvil/_agent_eval/main.py", "expected_score": 4, "docer_env": "bastoica/ae-agent-ubuntu24.04:latest"}
-{"artifact_id": "sosp23_acto", "artifact_dir": "sosp23_acto", "artifact_readme": "sosp23_acto/acto/README.md", "artifact_url": "https://github.com/xlab-uiuc/acto", "evaluator": "sosp23_acto/acto/_agent_eval/main.py", "expected_score": 4, "docer_env": "bastoica/ae-agent-ubuntu24.04:latest"}
+{"artifact_id": "sosp23_acto", "artifact_dir": "sosp23_acto", "artifact_readme": "sosp23_acto/acto/README.md", "artifact_url": "https://github.com/xlab-uiuc/acto", "evaluator": "sosp23_acto/acto/_agent_eval/main.py", "expected_score": 4, "docer_env": "bastoica/ae-agent-ubuntu24.04:latest"}
+{"artifact_id": "osdi25_emt", "artifact_dir": "osdi25_emt", "artifact_readme": "osdi25_emt/emt/README.md", "artifact_url": "https://github.com/xlab-uiuc/emt-system-intelligence-benchmark", "evaluator": "osdi25_emt/emt/_agent_eval/main.py", "expected_score": 4, "docer_env": ""}


The field name 'docer_env' appears to be a typo. It should likely be 'docker_env' to match the pattern from other entries in this file.

Suggested change

{"artifact_id": "sosp24_wasabi", "artifact_dir": "sosp24_wasabi", "artifact_readme": "sosp24_wasabi/wasabi/README.md", "artifact_url": "https://github.com/bastoica/wasabi/tree/sosp24-ae", "evaluator": "sosp24_wasabi/wasabi/_agent_eval/main.py", "expected_score": 4, "docer_env": "bastoica/ae-agent-ubuntu24.04:latest"}

{"artifact_id": "osdi24_anvil", "artifact_dir": "osdi24_anvil", "artifact_readme": "sosp23_acto/acto/README.md", "artifact_url": "https://github.com/anvil-verifier/anvil", "evaluator": "osdi24_anvil/anvil/_agent_eval/main.py", "expected_score": 4, "docer_env": "bastoica/ae-agent-ubuntu24.04:latest"}

{"artifact_id": "sosp23_acto", "artifact_dir": "sosp23_acto", "artifact_readme": "sosp23_acto/acto/README.md", "artifact_url": "https://github.com/xlab-uiuc/acto", "evaluator": "sosp23_acto/acto/_agent_eval/main.py", "expected_score": 4, "docer_env": "bastoica/ae-agent-ubuntu24.04:latest"}

\ No newline at end of file

{"artifact_id": "sosp23_acto", "artifact_dir": "sosp23_acto", "artifact_readme": "sosp23_acto/acto/README.md", "artifact_url": "https://github.com/xlab-uiuc/acto", "evaluator": "sosp23_acto/acto/_agent_eval/main.py", "expected_score": 4, "docer_env": "bastoica/ae-agent-ubuntu24.04:latest"}

{"artifact_id": "osdi25_emt", "artifact_dir": "osdi25_emt", "artifact_readme": "osdi25_emt/emt/README.md", "artifact_url": "https://github.com/xlab-uiuc/emt-system-intelligence-benchmark", "evaluator": "osdi25_emt/emt/_agent_eval/main.py", "expected_score": 4, "docer_env": ""}

{"artifact_id": "sosp24_wasabi", "artifact_dir": "sosp24_wasabi", "artifact_readme": "sosp24_wasabi/wasabi/README.md", "artifact_url": "https://github.com/bastoica/wasabi/tree/sosp24-ae", "evaluator": "sosp24_wasabi/wasabi/_agent_eval/main.py", "expected_score": 4, "docker_env": "bastoica/ae-agent-ubuntu24.04:latest"}

{"artifact_id": "osdi24_anvil", "artifact_dir": "osdi24_anvil", "artifact_readme": "sosp23_acto/acto/README.md", "artifact_url": "https://github.com/anvil-verifier/anvil", "evaluator": "osdi24_anvil/anvil/_agent_eval/main.py", "expected_score": 4, "docker_env": "bastoica/ae-agent-ubuntu24.04:latest"}

{"artifact_id": "sosp23_acto", "artifact_dir": "sosp23_acto", "artifact_readme": "sosp23_acto/acto/README.md", "artifact_url": "https://github.com/xlab-uiuc/acto", "evaluator": "sosp23_acto/acto/_agent_eval/main.py", "expected_score": 4, "docker_env": "bastoica/ae-agent-ubuntu24.04:latest"}

{"artifact_id": "osdi25_emt", "artifact_dir": "osdi25_emt", "artifact_readme": "osdi25_emt/emt/README.md", "artifact_url": "https://github.com/xlab-uiuc/emt-system-intelligence-benchmark", "evaluator": "osdi25_emt/emt/_agent_eval/main.py", "expected_score": 4, "docker_env": ""}

xuafeng · 2026-02-24T18:05:44Z

@bastoica can you please review this PR?

bastoica · 2026-02-24T18:11:24Z

@xuafeng This has multiple conflicts due to an earlier force-push from one of the other benchmarks. I'm not sure why this artifact entries all other benchmark entries. I've been trying to fix this over the last week but it's too broken. I'll need to reach out to the EMT authors to try to solve this.

tareknaser · 2026-02-24T20:11:00Z

This has multiple conflicts due to an earlier force-push from one of the other benchmarks.

sorry about that! I actually have a backup from before the force push. happy to share it if that helps.

also, if you tell me what file tree you’d like from the previous version, I can clean up the git history myself to save you some time.

bastoica · 2026-02-24T23:11:15Z

Don't worry about it @tareknaser ! Stuff like this happens and it's not a big deal. I think we can safely close this PR and create another.

It's also my fault since I procrastinated a lot before reviewing it... :-(

xuafeng and others added 30 commits November 5, 2025 18:10

Rename it "System Intelligence Benchmark"

6d24e69

Init: Initialize SysMoBench benchmark integration

87db1e9

feat: Add gitigore

69f4cb5

feat: Add prototype for phase 1&2

843f031

feat: Distinguish evaluator and model API keys in env.toml

0d2b38f

feat: Add validation for required evaluator API keys

b2acaa7

doc: update README.md

ca7e72e

initial ArtEval commit

ec7b57f

Merge pull request #2 from systemintelligence/feat/distinguish-api-keys

a607e73

Distinguish the models used in the executor and evaluator

feat: Add test

60d30e0

featr: Add install.sh

ff96313

adding overview and contributor's guide

1799370

skeleton ArtEval agent implementation

2054314

adding sosp24 wasabi

6303aa5

docs: add arteval to main README

a5358dc

Signed-off-by: Tarek <tareknaser360@gmail.com>

feat(ci): add GH Actions workflow for running benchmarks tests

904374e

Signed-off-by: Tarek <tareknaser360@gmail.com>

feat: add issue and pull request templates

40ccf1f

Signed-off-by: Tarek <tareknaser360@gmail.com>

fix(ci): add a test for example_bench

4130c7a

Signed-off-by: Tarek <tareknaser360@gmail.com>

fix: shell scripts to be executable

3af5b70

Signed-off-by: Tarek <tareknaser360@gmail.com>

docs: update README with instructions for running a single benchmark

156c77c

Signed-off-by: Tarek <tareknaser360@gmail.com>

docs: a note on docker image arch support

a0557f9

Signed-off-by: Tarek <tareknaser360@gmail.com>

meta: add outputs directories to gitignore

868da59

Signed-off-by: Tarek <tareknaser360@gmail.com>

feat(ci): add release trigger to workflow

ea9b54d

Signed-off-by: Tarek <tareknaser360@gmail.com>

fix: Use tla_specification instead of generated_text to adapt upstrea…

5ef835f

…m changes

Merge commit '04900168e10834f3aa5eef4d13b318e1efcdac24' as 'benchmark…

c5dfbb1

…s/sysmobench/sysmobench_core'

fix: Add gpt-4o config and fix cross-device link issue in setup_tools

a68e171

- Add gpt-4o model configuration to models.yaml - Fix setup_tools.py to use shutil.move instead of os.rename This resolves 'Invalid cross-device link' error when /tmp is on different filesystem

fix: Convert GenerationOutput to GenerationResult for evaluators

984336a

docs: Update README and install script for Git Subtree integration

68025ca

feat: Add docker file

25c6af8

fix: Add env.toml

97c1c3c

bastoica and others added 19 commits December 16, 2025 01:13

feat: added experiment runs and and benchmark prep oracles

9d37bf4

refactor: update directory structure

8abce0e

feat: add missing oracles

f4723b8

refactor: improve instructions in README

e160a25

refactor: change directory structure

073355a

fix: corrected max score for 'acto'

b024b34

Merge pull request #35 from bastoica/main

afa4640

Adding Acto (SOSP'23) and Anvil (OSDI'24)

Merge pull request #30 from SREGym/main

4950df8

Integrate SREGym

Added EMT AE to ArtEval Task List

945fed1

Duplicated main.py and util.py, but didn't understand some part, will…

25f227f

… patch in the future.

Finished env_setup step

10bac6d

Updated utils for test name and some paths.

85104ef

Suppressed active exec

25ad41d

Added other three part of oracle scripts.

2a8d083

Returned true if no error happens

03d7887

Extended running script for Figure 16 data generation & comparison

60a427f

Updated HOME path

c376407

Reserves git clone, git submodule update, and image copy command to k…

fb0e4ce

…eep one-click effect.

Extended the data comparison to thp=always but keep it commented.

b0a64bd

xuafeng requested a review from bastoica February 2, 2026 21:12

xuafeng requested a review from Copilot February 2, 2026 21:14

Copilot AI reviewed Feb 2, 2026

View reviewed changes

tareknaser force-pushed the main branch from 57b962d to a1780ed Compare February 5, 2026 16:46

xuafeng changed the title ~~Port Artifact of EMT (in OSDI'25) to the ArtEvalBench of System Intelligence Benchmark~~ [arteval-bench] Port Artifact of EMT (in OSDI'25) to the ArtEvalBench of System Intelligence Benchmark Feb 19, 2026

bastoica closed this Feb 24, 2026

	FIG18_REFERENCE_PATH = HOME / "_agent_eval" / "refs" / "emt-figure16.ref.csv" # TODO: Fill the paper data to the reference
	FIG18_REFERENCE_PATH = HOME / "_agent_eval" / "refs" / "emt-figure18.ref.csv" # TODO: Fill the paper data to the reference



		def run(self) -> bool:
		# DEBUG PERPOSES ONLY, WILL REMOVE LATER

	# DEBUG PERPOSES ONLY, WILL REMOVE LATER
	# DEBUG PURPOSES ONLY, WILL REMOVE LATER

	Check that Python virtual environment is succesfully created
	Check that Python virtual environment is successfully created

	equal or greather than a reference version is installed.
	equal or greater than a reference version is installed.

Conversation

EscapistArcadia commented Feb 1, 2026

Description

Changes

Testing

Checklist

Uh oh!

xuafeng commented Feb 2, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Feb 2, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 2, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 2, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 2, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 2, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 2, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 2, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 2, 2026

Choose a reason for hiding this comment

Uh oh!

xuafeng commented Feb 24, 2026

Uh oh!

bastoica commented Feb 24, 2026

Uh oh!

tareknaser commented Feb 24, 2026

Uh oh!

bastoica commented Feb 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

bastoica commented Feb 24, 2026 •

edited

Loading