[arteval-bench] Port Artifact of EMT (in OSDI'25) to the ArtEvalBench of System Intelligence Benchmark#108
[arteval-bench] Port Artifact of EMT (in OSDI'25) to the ArtEvalBench of System Intelligence Benchmark#108EscapistArcadia wants to merge 150 commits intosys-intelligence:mainfrom
Conversation
Distinguish the models used in the executor and evaluator
Signed-off-by: Tarek <tareknaser360@gmail.com>
Signed-off-by: Tarek <tareknaser360@gmail.com>
Signed-off-by: Tarek <tareknaser360@gmail.com>
Signed-off-by: Tarek <tareknaser360@gmail.com>
Signed-off-by: Tarek <tareknaser360@gmail.com>
Signed-off-by: Tarek <tareknaser360@gmail.com>
Signed-off-by: Tarek <tareknaser360@gmail.com>
Signed-off-by: Tarek <tareknaser360@gmail.com>
Signed-off-by: Tarek <tareknaser360@gmail.com>
…s/sysmobench/sysmobench_core'
- Add gpt-4o model configuration to models.yaml - Fix setup_tools.py to use shutil.move instead of os.rename This resolves 'Invalid cross-device link' error when /tmp is on different filesystem
Adding Acto (SOSP'23) and Anvil (OSDI'24)
Integrate SREGym
… patch in the future.
…eep one-click effect.
|
@EscapistArcadia Hi Shanbo, thanks a lot for your contribution. Let us review this and feedback soon. |
There was a problem hiding this comment.
Pull request overview
This PR adds the EMT artifact evaluation (from OSDI 2025) to the ArtEvalBench of the System Intelligence Benchmark. The implementation provides Oracle scripts to automate the evaluation of EMT's experimental results by comparing reproduced metrics against reference values.
Changes:
- Added EMT artifact entry to the ArtEvalBench schema file
- Implemented Oracle scripts for environment setup, experiment execution, and result validation
- Created utilities for logging and path configuration
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
| benchmarks/arteval_bench/data/benchmark/arteval_tasks.jsonl | Adds EMT artifact entry to the benchmark registry |
| benchmarks/arteval_bench/data/benchmark/osdi25_emt/_agent_eval/utils.py | Defines constants, paths, and logging configuration for EMT evaluation |
| benchmarks/arteval_bench/data/benchmark/osdi25_emt/_agent_eval/oracle_experiment_runs.py | Implements experiment execution and result validation logic |
| benchmarks/arteval_bench/data/benchmark/osdi25_emt/_agent_eval/oracle_env_setup.py | Handles dependency checking and environment setup |
| benchmarks/arteval_bench/data/benchmark/osdi25_emt/_agent_eval/oracle_benchmark_prep.py | Placeholder for benchmark preparation |
| benchmarks/arteval_bench/data/benchmark/osdi25_emt/_agent_eval/oracle_artifact_build.py | Placeholder for artifact build |
| benchmarks/arteval_bench/data/benchmark/osdi25_emt/_agent_eval/main.py | Entry point for orchestrating the evaluation process |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| HOME = Path.cwd().parent | ||
| REPO_DIRS = {"emt": HOME / "emt"} | ||
|
|
||
| FIG18_REFERENCE_PATH = HOME / "_agent_eval" / "refs" / "emt-figure16.ref.csv" # TODO: Fill the paper data to the reference |
There was a problem hiding this comment.
The variable name FIG18_REFERENCE_PATH refers to Figure 18 but the path includes 'emt-figure16.ref.csv'. This inconsistency should be resolved by either renaming the variable to FIG16_REFERENCE_PATH or updating the filename to 'emt-figure18.ref.csv'.
| FIG18_REFERENCE_PATH = HOME / "_agent_eval" / "refs" / "emt-figure16.ref.csv" # TODO: Fill the paper data to the reference | |
| FIG18_REFERENCE_PATH = HOME / "_agent_eval" / "refs" / "emt-figure18.ref.csv" # TODO: Fill the paper data to the reference |
| if not ok: | ||
| return False, why |
There was a problem hiding this comment.
This check on line 158 will always evaluate the result from the already-completed 'never' check on line 153, not the commented-out 'always' check on line 157. Either remove this dead code or uncomment line 157 and assign its result to 'ok, why' before this check.
| if not Path(self.go_root).exists(): | ||
| problems.append("$HOME/go directory missing (install golang and configure GOPATH)") | ||
| if not Path(self.go_bin).exists(): | ||
| problems.append("$HOME/go/bin directory missing (ensure Go tools are installed)") | ||
|
|
||
| # Check PATH contains Go path | ||
| path_env = os.environ.get("PATH", "") | ||
| go_root_str = str(self.go_root) | ||
| go_bin_str = str(self.go_bin) | ||
| if go_root_str not in path_env or go_bin_str not in path_env: | ||
| problems.append("PATH missing $HOME/go or $HOME/go/bin " | ||
| "(export PATH=$HOME/go:$HOME/go/bin:$PATH)") |
There was a problem hiding this comment.
The Go-related checks appear unnecessary for the EMT artifact evaluation. The code comment on line 80 questions whether this is needed, and no Go dependencies are listed in DEPENDENCIES. These checks should either be removed or the TODO comment on line 80 should be resolved to clarify if Go is required.
|
|
||
|
|
||
| def run(self) -> bool: | ||
| # DEBUG PERPOSES ONLY, WILL REMOVE LATER |
There was a problem hiding this comment.
Corrected spelling of 'PERPOSES' to 'PURPOSES'.
| # DEBUG PERPOSES ONLY, WILL REMOVE LATER | |
| # DEBUG PURPOSES ONLY, WILL REMOVE LATER |
|
|
||
| def paths_check(self): | ||
| """ | ||
| Check that Python virtual environment is succesfully created |
There was a problem hiding this comment.
Corrected spelling of 'succesfully' to 'successfully'.
| Check that Python virtual environment is succesfully created | |
| Check that Python virtual environment is successfully created |
| if not Path(self.venv_dir).exists(): | ||
| problems.append(".venv virtual environment missing (run 'python3 -m venv .venv')") | ||
|
|
||
| # Check Go directories exit |
There was a problem hiding this comment.
Corrected spelling of 'exit' to 'exist'.
| # Check Go directories exit | |
| # Check Go directories exist |
| def check_dependency(self, dep: Dependency) -> Optional[str]: | ||
| """ | ||
| Core method that checks whether a certain dependency of a version | ||
| equal or greather than a reference version is installed. |
There was a problem hiding this comment.
Corrected spelling of 'greather' to 'greater'.
| equal or greather than a reference version is installed. | |
| equal or greater than a reference version is installed. |
| {"artifact_id": "sosp24_wasabi", "artifact_dir": "sosp24_wasabi", "artifact_readme": "sosp24_wasabi/wasabi/README.md", "artifact_url": "https://github.com/bastoica/wasabi/tree/sosp24-ae", "evaluator": "sosp24_wasabi/wasabi/_agent_eval/main.py", "expected_score": 4, "docer_env": "bastoica/ae-agent-ubuntu24.04:latest"} | ||
| {"artifact_id": "osdi24_anvil", "artifact_dir": "osdi24_anvil", "artifact_readme": "sosp23_acto/acto/README.md", "artifact_url": "https://github.com/anvil-verifier/anvil", "evaluator": "osdi24_anvil/anvil/_agent_eval/main.py", "expected_score": 4, "docer_env": "bastoica/ae-agent-ubuntu24.04:latest"} | ||
| {"artifact_id": "sosp23_acto", "artifact_dir": "sosp23_acto", "artifact_readme": "sosp23_acto/acto/README.md", "artifact_url": "https://github.com/xlab-uiuc/acto", "evaluator": "sosp23_acto/acto/_agent_eval/main.py", "expected_score": 4, "docer_env": "bastoica/ae-agent-ubuntu24.04:latest"} No newline at end of file | ||
| {"artifact_id": "sosp23_acto", "artifact_dir": "sosp23_acto", "artifact_readme": "sosp23_acto/acto/README.md", "artifact_url": "https://github.com/xlab-uiuc/acto", "evaluator": "sosp23_acto/acto/_agent_eval/main.py", "expected_score": 4, "docer_env": "bastoica/ae-agent-ubuntu24.04:latest"} | ||
| {"artifact_id": "osdi25_emt", "artifact_dir": "osdi25_emt", "artifact_readme": "osdi25_emt/emt/README.md", "artifact_url": "https://github.com/xlab-uiuc/emt-system-intelligence-benchmark", "evaluator": "osdi25_emt/emt/_agent_eval/main.py", "expected_score": 4, "docer_env": ""} No newline at end of file |
There was a problem hiding this comment.
The field name 'docer_env' appears to be a typo. It should likely be 'docker_env' to match the pattern from other entries in this file.
| {"artifact_id": "sosp24_wasabi", "artifact_dir": "sosp24_wasabi", "artifact_readme": "sosp24_wasabi/wasabi/README.md", "artifact_url": "https://github.com/bastoica/wasabi/tree/sosp24-ae", "evaluator": "sosp24_wasabi/wasabi/_agent_eval/main.py", "expected_score": 4, "docer_env": "bastoica/ae-agent-ubuntu24.04:latest"} | |
| {"artifact_id": "osdi24_anvil", "artifact_dir": "osdi24_anvil", "artifact_readme": "sosp23_acto/acto/README.md", "artifact_url": "https://github.com/anvil-verifier/anvil", "evaluator": "osdi24_anvil/anvil/_agent_eval/main.py", "expected_score": 4, "docer_env": "bastoica/ae-agent-ubuntu24.04:latest"} | |
| {"artifact_id": "sosp23_acto", "artifact_dir": "sosp23_acto", "artifact_readme": "sosp23_acto/acto/README.md", "artifact_url": "https://github.com/xlab-uiuc/acto", "evaluator": "sosp23_acto/acto/_agent_eval/main.py", "expected_score": 4, "docer_env": "bastoica/ae-agent-ubuntu24.04:latest"} | |
| \ No newline at end of file | |
| {"artifact_id": "sosp23_acto", "artifact_dir": "sosp23_acto", "artifact_readme": "sosp23_acto/acto/README.md", "artifact_url": "https://github.com/xlab-uiuc/acto", "evaluator": "sosp23_acto/acto/_agent_eval/main.py", "expected_score": 4, "docer_env": "bastoica/ae-agent-ubuntu24.04:latest"} | |
| {"artifact_id": "osdi25_emt", "artifact_dir": "osdi25_emt", "artifact_readme": "osdi25_emt/emt/README.md", "artifact_url": "https://github.com/xlab-uiuc/emt-system-intelligence-benchmark", "evaluator": "osdi25_emt/emt/_agent_eval/main.py", "expected_score": 4, "docer_env": ""} | |
| {"artifact_id": "sosp24_wasabi", "artifact_dir": "sosp24_wasabi", "artifact_readme": "sosp24_wasabi/wasabi/README.md", "artifact_url": "https://github.com/bastoica/wasabi/tree/sosp24-ae", "evaluator": "sosp24_wasabi/wasabi/_agent_eval/main.py", "expected_score": 4, "docker_env": "bastoica/ae-agent-ubuntu24.04:latest"} | |
| {"artifact_id": "osdi24_anvil", "artifact_dir": "osdi24_anvil", "artifact_readme": "sosp23_acto/acto/README.md", "artifact_url": "https://github.com/anvil-verifier/anvil", "evaluator": "osdi24_anvil/anvil/_agent_eval/main.py", "expected_score": 4, "docker_env": "bastoica/ae-agent-ubuntu24.04:latest"} | |
| {"artifact_id": "sosp23_acto", "artifact_dir": "sosp23_acto", "artifact_readme": "sosp23_acto/acto/README.md", "artifact_url": "https://github.com/xlab-uiuc/acto", "evaluator": "sosp23_acto/acto/_agent_eval/main.py", "expected_score": 4, "docker_env": "bastoica/ae-agent-ubuntu24.04:latest"} | |
| {"artifact_id": "osdi25_emt", "artifact_dir": "osdi25_emt", "artifact_readme": "osdi25_emt/emt/README.md", "artifact_url": "https://github.com/xlab-uiuc/emt-system-intelligence-benchmark", "evaluator": "osdi25_emt/emt/_agent_eval/main.py", "expected_score": 4, "docker_env": ""} |
|
@bastoica can you please review this PR? |
|
@xuafeng This has multiple conflicts due to an earlier force-push from one of the other benchmarks. I'm not sure why this artifact entries all other benchmark entries. I've been trying to fix this over the last week but it's too broken. I'll need to reach out to the EMT authors to try to solve this. |
sorry about that! I actually have a backup from before the force push. happy to share it if that helps. also, if you tell me what file tree you’d like from the previous version, I can clean up the git history myself to save you some time. |
|
Don't worry about it @tareknaser ! Stuff like this happens and it's not a big deal. I think we can safely close this PR and create another. It's also my fault since I procrastinated a lot before reviewing it... :-( |
Description
This PR is to add the artifact of EMT, presented at OSDI'2025, to the ArtEvalBench of System Intelligence Benchmark.
Changes
Testing
I have finished testing the full AE pipeline locally by running python main.py, together with
git clone,git submodule update, and a local image build. These commands are kept in the Oracle scripts but commented out.Checklist