Skip to content

SYS-572 Add Blazegraph support for Wikidata SSE update stream#1

Open
devin-ai-integration[bot] wants to merge 12 commits intomainfrom
SYS-572-blazegraph-support
Open

SYS-572 Add Blazegraph support for Wikidata SSE update stream#1
devin-ai-integration[bot] wants to merge 12 commits intomainfrom
SYS-572-blazegraph-support

Conversation

@devin-ai-integration
Copy link
Copy Markdown

@devin-ai-integration devin-ai-integration Bot commented Apr 14, 2026

Summary

Adds Blazegraph support for keeping a self-hosted Blazegraph instance synchronized with Wikidata via the public RDF update stream (SSE). Uses a thin wrapper subclass pattern to minimize changes to the upstream update_wikidata.py, keeping the fork mergeable with future upstream updates.

Commits (in order):

  1. Refactor update_wikidata.py for backend extensibility — extract parse_update_result() as overridable method, make sparql_endpoint/access_token/show safe via getattr()
  2. Add thin Blazegraph wrapperblazegraph_wikidata_updater.py subclass that overrides only response parsing and adds Blazegraph URL as CLI arg
  3. Fix Blazegraph HTML response parsing — recognize Blazegraph's HTML success responses (COMMIT: totalElapsed=...)
  4. Make User-Agent configurable--user-agent flag with Geneea-identifying default per Wikimedia policy
  5. Customize pyproject.toml — new package name geneea-wikidata-updater, trimmed deps, console script entry point
  6. Replace curl subprocess with requests library — all three curl call sites replaced with requests.post(), removing dependency on bash/curl
  7. Handle SSE stream connection errors (HTTP 429)_iter_sse_events() wrapper catches mid-stream connection errors; empty-batch guard reconnects from same offset

Review & Testing Checklist for Human

  • Run with --num-messages 5 against your Blazegraph instance to verify the curl→requests replacement works end-to-end (offset query, batch UPDATE, response parsing)
  • Verify resume-from-offset still works: run, stop (Ctrl+C), restart without --use-cached-sparql-queries
  • Simulate a network interruption during streaming to verify the SSE error handling reconnects gracefully instead of crashing
  • Check that git merge upstream/main still applies cleanly after these changes

Notes

  • The only changes to update_wikidata.py are: extracted parse_update_result(), replaced 3 curl calls with requests.post(), added _iter_sse_events() wrapper, and added getattr() guards. All existing QLever behavior is preserved.
  • requests was already a transitive dependency (via requests-sse); now it's an explicit direct dependency.
  • Version bumped to 0.2.0 with the curl→requests change.

Link to Devin session: https://app.devin.ai/sessions/33060449c31b4f439350bce7d4877fba

devin-ai-integration Bot and others added 2 commits April 14, 2026 13:19
Small surgical changes to allow subclassing for different SPARQL backends:

- Extract response parsing into overridable parse_update_result() method.
  Returns dict with time_total_ms on success, 'retry' or 'error' strings.
  Subclasses (e.g. for Blazegraph) can override just this method.

- Use getattr() for sparql_endpoint so callers can pass a pre-built URL
  instead of host_name + port (falls back to original behavior).

- Use getattr() for access_token and skip the query parameter when absent.

- Use getattr() for show flag (defaults to False when not present).

All existing QLever behavior is preserved — the changes are additive.

Co-Authored-By: radim.kubacki <radim.kubacki@geneea.com>
Thin wrapper next to update_wikidata.py that subclasses
UpdateWikidataCommand and overrides only parse_update_result() to
handle Blazegraph's XML response format instead of QLever's JSON.

The wrapper:
- Parses the Blazegraph SPARQL endpoint URL as a positional argument
- Delegates all other flags (--batch-size, --since, --offset, etc.)
  to UpdateWikidataCommand.additional_arguments()
- Overrides parse_update_result() to parse Blazegraph XML responses
  (<data modified="N" milliseconds="M"/>)
- Handles error pages (HTML) and empty responses gracefully

Usage: python -m qlever.commands.blazegraph_wikidata_updater \
        http://localhost:9999/bigdata/namespace/wdq/sparql
Co-Authored-By: radim.kubacki <radim.kubacki@geneea.com>
@devin-ai-integration
Copy link
Copy Markdown
Author

🤖 Devin AI Engineer

I'll be helping with this pull request! Here's what you should know:

✅ I will automatically:

  • Address comments on this PR. Add '(aside)' to your comment to have me ignore it.
  • Look at CI failures and help fix them

Note: I can only respond to comments from users who have write access to this repository.

⚙️ Control Options:

  • Disable automatic comment and CI monitoring

devin-ai-integration Bot and others added 10 commits April 14, 2026 13:43
Blazegraph wraps successful SPARQL UPDATE results in HTML with
statistics in <p> tags, not clean XML. The parser now recognizes:

1. Clean XML: <data modified="N" milliseconds="M"/>
2. HTML with COMMIT stats: totalElapsed=Nms, mutationCount=M
3. HTML with non-COMMIT stats (fallback)
4. HTML error pages (no stats found)

Previously all HTML responses were treated as errors, causing the
updater to fail even though the update was applied successfully.

Co-Authored-By: radim.kubacki <radim.kubacki@geneea.com>
Add user_agent parameter to connect_to_sse_stream() so callers can
identify themselves per Wikimedia User-Agent policy.

In update_wikidata.py:
- connect_to_sse_stream() accepts optional user_agent parameter,
  defaults to existing 'qlever update-wikidata' when not provided
- Both call sites in execute() pass getattr(args, 'user_agent', None)
  so existing QLever usage is unchanged

In blazegraph_wikidata_updater.py:
- New --user-agent CLI argument
- Default: 'Geneea-BlazegraphUpdater/1.0 (https://geneea.com; sysadmin@geneea.com)'
  per Wikimedia policy format: BotName/version (URL; email)

Co-Authored-By: radim.kubacki <radim.kubacki@geneea.com>
Co-Authored-By: radim.kubacki <radim.kubacki@geneea.com>
- Rename package to geneea-wikidata-updater
- Add console script entry point: geneea-wikidata-updater
- Trim dependencies: remove argcomplete, pyyaml (only needed by full qlever CLI)
- Scope packaging to qlever + qlever.commands only (no Qleverfiles data)
- Update metadata: description, authors, URLs, keywords

Co-Authored-By: radim.kubacki <radim.kubacki@geneea.com>
- Replace all curl + run_command() calls with requests.post()
- Remove dependency on qlever.util.run_command (and transitively
  on bash/curl being available on the system)
- Add requests as explicit dependency in pyproject.toml
- Bump version to 0.2.0

The three replaced call sites:
1. get_next_offset_from_endpoint() — SPARQL query for stream offset
2. execute() — 'updates complete until' SPARQL query
3. execute() — SPARQL UPDATE POST request

Behavior is preserved: query functions use raise_for_status() for
clean error detection; the UPDATE response body is always passed
through to parse_update_result() unchanged (matching the previous
curl behavior where HTTP errors were not fatal at the transport
level).

Co-Authored-By: radim.kubacki <radim.kubacki@geneea.com>
- Add _iter_sse_events() wrapper that catches ConnectionError,
  HTTPError (429), and Timeout during event iteration, so a
  mid-stream failure stops the batch gracefully instead of crashing
- Add empty-batch guard: if no events were processed (e.g. the
  connection dropped immediately), reset to the same offset and
  reconnect via the existing retry_with_backoff() loop
- Initial SSE connection 429 was already handled by retry_with_backoff();
  this commit covers the mid-stream case

Co-Authored-By: radim.kubacki <radim.kubacki@geneea.com>
- Uses uv for venv creation, build, and dependency management
- Single project build (no subproject loop)
- Supports --keep-env, --reuse-env, --upload, --python-version flags
- Upload to pypi.dev.g via twine
- No Docker (not needed for this package)
- Runs tests and linting via tox with tox-uv

Co-Authored-By: radim.kubacki <radim.kubacki@geneea.com>
Replace tox (no tox.ini in repo) with direct pytest -v.
Always install the package and test dependencies (pyyaml,
argcomplete) before running the test suite.

Co-Authored-By: radim.kubacki <radim.kubacki@geneea.com>
Add charset=UTF-8 to the Content-Type header for SPARQL UPDATE
POST requests.  Without it, Blazegraph's Jetty server defaults to
ISO-8859-1, causing UTF-8 multi-byte characters (e.g. ü) to be
stored as mojibake (e.g. ü).

This is a known Blazegraph issue (blazegraph/database#206) and the
same fix was applied in rdflib (RDFLib/rdflib#2095).

Co-Authored-By: radim.kubacki <radim.kubacki@geneea.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant