SYS-572 Add Blazegraph support for Wikidata SSE update stream by devin-ai-integration[bot] · Pull Request #1 · Geneea/qlever-control

devin-ai-integration · 2026-04-14T13:20:17Z

Summary

Adds Blazegraph support for keeping a self-hosted Blazegraph instance synchronized with Wikidata via the public RDF update stream (SSE). Uses a thin wrapper subclass pattern to minimize changes to the upstream update_wikidata.py, keeping the fork mergeable with future upstream updates.

Commits (in order):

Refactor update_wikidata.py for backend extensibility — extract parse_update_result() as overridable method, make sparql_endpoint/access_token/show safe via getattr()
Add thin Blazegraph wrapper — blazegraph_wikidata_updater.py subclass that overrides only response parsing and adds Blazegraph URL as CLI arg
Fix Blazegraph HTML response parsing — recognize Blazegraph's HTML success responses (COMMIT: totalElapsed=...)
Make User-Agent configurable — --user-agent flag with Geneea-identifying default per Wikimedia policy
Customize pyproject.toml — new package name geneea-wikidata-updater, trimmed deps, console script entry point
Replace curl subprocess with requests library — all three curl call sites replaced with requests.post(), removing dependency on bash/curl
Handle SSE stream connection errors (HTTP 429) — _iter_sse_events() wrapper catches mid-stream connection errors; empty-batch guard reconnects from same offset

Review & Testing Checklist for Human

Run with --num-messages 5 against your Blazegraph instance to verify the curl→requests replacement works end-to-end (offset query, batch UPDATE, response parsing)
Verify resume-from-offset still works: run, stop (Ctrl+C), restart without --use-cached-sparql-queries
Simulate a network interruption during streaming to verify the SSE error handling reconnects gracefully instead of crashing
Check that git merge upstream/main still applies cleanly after these changes

Notes

The only changes to update_wikidata.py are: extracted parse_update_result(), replaced 3 curl calls with requests.post(), added _iter_sse_events() wrapper, and added getattr() guards. All existing QLever behavior is preserved.
requests was already a transitive dependency (via requests-sse); now it's an explicit direct dependency.
Version bumped to 0.2.0 with the curl→requests change.

Link to Devin session: https://app.devin.ai/sessions/33060449c31b4f439350bce7d4877fba

Small surgical changes to allow subclassing for different SPARQL backends: - Extract response parsing into overridable parse_update_result() method. Returns dict with time_total_ms on success, 'retry' or 'error' strings. Subclasses (e.g. for Blazegraph) can override just this method. - Use getattr() for sparql_endpoint so callers can pass a pre-built URL instead of host_name + port (falls back to original behavior). - Use getattr() for access_token and skip the query parameter when absent. - Use getattr() for show flag (defaults to False when not present). All existing QLever behavior is preserved — the changes are additive. Co-Authored-By: radim.kubacki <radim.kubacki@geneea.com>

Thin wrapper next to update_wikidata.py that subclasses UpdateWikidataCommand and overrides only parse_update_result() to handle Blazegraph's XML response format instead of QLever's JSON. The wrapper: - Parses the Blazegraph SPARQL endpoint URL as a positional argument - Delegates all other flags (--batch-size, --since, --offset, etc.) to UpdateWikidataCommand.additional_arguments() - Overrides parse_update_result() to parse Blazegraph XML responses (<data modified="N" milliseconds="M"/>) - Handles error pages (HTML) and empty responses gracefully Usage: python -m qlever.commands.blazegraph_wikidata_updater \ http://localhost:9999/bigdata/namespace/wdq/sparql Co-Authored-By: radim.kubacki <radim.kubacki@geneea.com>

devin-ai-integration · 2026-04-14T13:20:20Z

🤖 Devin AI Engineer

I'll be helping with this pull request! Here's what you should know:

✅ I will automatically:

Address comments on this PR. Add '(aside)' to your comment to have me ignore it.
Look at CI failures and help fix them

Note: I can only respond to comments from users who have write access to this repository.

⚙️ Control Options:

Disable automatic comment and CI monitoring

Blazegraph wraps successful SPARQL UPDATE results in HTML with statistics in <p> tags, not clean XML. The parser now recognizes: 1. Clean XML: <data modified="N" milliseconds="M"/> 2. HTML with COMMIT stats: totalElapsed=Nms, mutationCount=M 3. HTML with non-COMMIT stats (fallback) 4. HTML error pages (no stats found) Previously all HTML responses were treated as errors, causing the updater to fail even though the update was applied successfully. Co-Authored-By: radim.kubacki <radim.kubacki@geneea.com>

Add user_agent parameter to connect_to_sse_stream() so callers can identify themselves per Wikimedia User-Agent policy. In update_wikidata.py: - connect_to_sse_stream() accepts optional user_agent parameter, defaults to existing 'qlever update-wikidata' when not provided - Both call sites in execute() pass getattr(args, 'user_agent', None) so existing QLever usage is unchanged In blazegraph_wikidata_updater.py: - New --user-agent CLI argument - Default: 'Geneea-BlazegraphUpdater/1.0 (https://geneea.com; sysadmin@geneea.com)' per Wikimedia policy format: BotName/version (URL; email) Co-Authored-By: radim.kubacki <radim.kubacki@geneea.com>

Co-Authored-By: radim.kubacki <radim.kubacki@geneea.com>

- Rename package to geneea-wikidata-updater - Add console script entry point: geneea-wikidata-updater - Trim dependencies: remove argcomplete, pyyaml (only needed by full qlever CLI) - Scope packaging to qlever + qlever.commands only (no Qleverfiles data) - Update metadata: description, authors, URLs, keywords Co-Authored-By: radim.kubacki <radim.kubacki@geneea.com>

- Replace all curl + run_command() calls with requests.post() - Remove dependency on qlever.util.run_command (and transitively on bash/curl being available on the system) - Add requests as explicit dependency in pyproject.toml - Bump version to 0.2.0 The three replaced call sites: 1. get_next_offset_from_endpoint() — SPARQL query for stream offset 2. execute() — 'updates complete until' SPARQL query 3. execute() — SPARQL UPDATE POST request Behavior is preserved: query functions use raise_for_status() for clean error detection; the UPDATE response body is always passed through to parse_update_result() unchanged (matching the previous curl behavior where HTTP errors were not fatal at the transport level). Co-Authored-By: radim.kubacki <radim.kubacki@geneea.com>

- Add _iter_sse_events() wrapper that catches ConnectionError, HTTPError (429), and Timeout during event iteration, so a mid-stream failure stops the batch gracefully instead of crashing - Add empty-batch guard: if no events were processed (e.g. the connection dropped immediately), reset to the same offset and reconnect via the existing retry_with_backoff() loop - Initial SSE connection 429 was already handled by retry_with_backoff(); this commit covers the mid-stream case Co-Authored-By: radim.kubacki <radim.kubacki@geneea.com>

- Uses uv for venv creation, build, and dependency management - Single project build (no subproject loop) - Supports --keep-env, --reuse-env, --upload, --python-version flags - Upload to pypi.dev.g via twine - No Docker (not needed for this package) - Runs tests and linting via tox with tox-uv Co-Authored-By: radim.kubacki <radim.kubacki@geneea.com>

Replace tox (no tox.ini in repo) with direct pytest -v. Always install the package and test dependencies (pyyaml, argcomplete) before running the test suite. Co-Authored-By: radim.kubacki <radim.kubacki@geneea.com>

Add charset=UTF-8 to the Content-Type header for SPARQL UPDATE POST requests. Without it, Blazegraph's Jetty server defaults to ISO-8859-1, causing UTF-8 multi-byte characters (e.g. ü) to be stored as mojibake (e.g. Ã¼). This is a known Blazegraph issue (blazegraph/database#206) and the same fix was applied in rdflib (RDFLib/rdflib#2095). Co-Authored-By: radim.kubacki <radim.kubacki@geneea.com>

devin-ai-integration Bot and others added 2 commits April 14, 2026 13:19

devin-ai-integration Bot and others added 10 commits April 14, 2026 13:43

SYS-572 Change default contact email to support@geneea.com

519e4ca

Co-Authored-By: radim.kubacki <radim.kubacki@geneea.com>

SYS-572 Add pytest to build script

25ace04

Replace tox (no tox.ini in repo) with direct pytest -v. Always install the package and test dependencies (pyyaml, argcomplete) before running the test suite. Co-Authored-By: radim.kubacki <radim.kubacki@geneea.com>

SYS-572 bump project version after encoding fix

65723b6

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SYS-572 Add Blazegraph support for Wikidata SSE update stream#1

SYS-572 Add Blazegraph support for Wikidata SSE update stream#1
devin-ai-integration[bot] wants to merge 12 commits intomainfrom
SYS-572-blazegraph-support

devin-ai-integration Bot commented Apr 14, 2026 •

edited

Loading

Uh oh!

devin-ai-integration Bot commented Apr 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

devin-ai-integration Bot commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Commits (in order):

Review & Testing Checklist for Human

Notes

Uh oh!

devin-ai-integration Bot commented Apr 14, 2026

🤖 Devin AI Engineer

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

devin-ai-integration Bot commented Apr 14, 2026 •

edited

Loading