SYS-572 Add Blazegraph support for Wikidata SSE update stream#1
Open
devin-ai-integration[bot] wants to merge 12 commits intomainfrom
Open
SYS-572 Add Blazegraph support for Wikidata SSE update stream#1devin-ai-integration[bot] wants to merge 12 commits intomainfrom
devin-ai-integration[bot] wants to merge 12 commits intomainfrom
Conversation
Small surgical changes to allow subclassing for different SPARQL backends: - Extract response parsing into overridable parse_update_result() method. Returns dict with time_total_ms on success, 'retry' or 'error' strings. Subclasses (e.g. for Blazegraph) can override just this method. - Use getattr() for sparql_endpoint so callers can pass a pre-built URL instead of host_name + port (falls back to original behavior). - Use getattr() for access_token and skip the query parameter when absent. - Use getattr() for show flag (defaults to False when not present). All existing QLever behavior is preserved — the changes are additive. Co-Authored-By: radim.kubacki <radim.kubacki@geneea.com>
Thin wrapper next to update_wikidata.py that subclasses
UpdateWikidataCommand and overrides only parse_update_result() to
handle Blazegraph's XML response format instead of QLever's JSON.
The wrapper:
- Parses the Blazegraph SPARQL endpoint URL as a positional argument
- Delegates all other flags (--batch-size, --since, --offset, etc.)
to UpdateWikidataCommand.additional_arguments()
- Overrides parse_update_result() to parse Blazegraph XML responses
(<data modified="N" milliseconds="M"/>)
- Handles error pages (HTML) and empty responses gracefully
Usage: python -m qlever.commands.blazegraph_wikidata_updater \
http://localhost:9999/bigdata/namespace/wdq/sparql
Co-Authored-By: radim.kubacki <radim.kubacki@geneea.com>
Author
🤖 Devin AI EngineerI'll be helping with this pull request! Here's what you should know: ✅ I will automatically:
Note: I can only respond to comments from users who have write access to this repository. ⚙️ Control Options:
|
Blazegraph wraps successful SPARQL UPDATE results in HTML with statistics in <p> tags, not clean XML. The parser now recognizes: 1. Clean XML: <data modified="N" milliseconds="M"/> 2. HTML with COMMIT stats: totalElapsed=Nms, mutationCount=M 3. HTML with non-COMMIT stats (fallback) 4. HTML error pages (no stats found) Previously all HTML responses were treated as errors, causing the updater to fail even though the update was applied successfully. Co-Authored-By: radim.kubacki <radim.kubacki@geneea.com>
Add user_agent parameter to connect_to_sse_stream() so callers can identify themselves per Wikimedia User-Agent policy. In update_wikidata.py: - connect_to_sse_stream() accepts optional user_agent parameter, defaults to existing 'qlever update-wikidata' when not provided - Both call sites in execute() pass getattr(args, 'user_agent', None) so existing QLever usage is unchanged In blazegraph_wikidata_updater.py: - New --user-agent CLI argument - Default: 'Geneea-BlazegraphUpdater/1.0 (https://geneea.com; sysadmin@geneea.com)' per Wikimedia policy format: BotName/version (URL; email) Co-Authored-By: radim.kubacki <radim.kubacki@geneea.com>
Co-Authored-By: radim.kubacki <radim.kubacki@geneea.com>
- Rename package to geneea-wikidata-updater - Add console script entry point: geneea-wikidata-updater - Trim dependencies: remove argcomplete, pyyaml (only needed by full qlever CLI) - Scope packaging to qlever + qlever.commands only (no Qleverfiles data) - Update metadata: description, authors, URLs, keywords Co-Authored-By: radim.kubacki <radim.kubacki@geneea.com>
- Replace all curl + run_command() calls with requests.post() - Remove dependency on qlever.util.run_command (and transitively on bash/curl being available on the system) - Add requests as explicit dependency in pyproject.toml - Bump version to 0.2.0 The three replaced call sites: 1. get_next_offset_from_endpoint() — SPARQL query for stream offset 2. execute() — 'updates complete until' SPARQL query 3. execute() — SPARQL UPDATE POST request Behavior is preserved: query functions use raise_for_status() for clean error detection; the UPDATE response body is always passed through to parse_update_result() unchanged (matching the previous curl behavior where HTTP errors were not fatal at the transport level). Co-Authored-By: radim.kubacki <radim.kubacki@geneea.com>
- Add _iter_sse_events() wrapper that catches ConnectionError, HTTPError (429), and Timeout during event iteration, so a mid-stream failure stops the batch gracefully instead of crashing - Add empty-batch guard: if no events were processed (e.g. the connection dropped immediately), reset to the same offset and reconnect via the existing retry_with_backoff() loop - Initial SSE connection 429 was already handled by retry_with_backoff(); this commit covers the mid-stream case Co-Authored-By: radim.kubacki <radim.kubacki@geneea.com>
- Uses uv for venv creation, build, and dependency management - Single project build (no subproject loop) - Supports --keep-env, --reuse-env, --upload, --python-version flags - Upload to pypi.dev.g via twine - No Docker (not needed for this package) - Runs tests and linting via tox with tox-uv Co-Authored-By: radim.kubacki <radim.kubacki@geneea.com>
Replace tox (no tox.ini in repo) with direct pytest -v. Always install the package and test dependencies (pyyaml, argcomplete) before running the test suite. Co-Authored-By: radim.kubacki <radim.kubacki@geneea.com>
Add charset=UTF-8 to the Content-Type header for SPARQL UPDATE POST requests. Without it, Blazegraph's Jetty server defaults to ISO-8859-1, causing UTF-8 multi-byte characters (e.g. ü) to be stored as mojibake (e.g. ü). This is a known Blazegraph issue (blazegraph/database#206) and the same fix was applied in rdflib (RDFLib/rdflib#2095). Co-Authored-By: radim.kubacki <radim.kubacki@geneea.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds Blazegraph support for keeping a self-hosted Blazegraph instance synchronized with Wikidata via the public RDF update stream (SSE). Uses a thin wrapper subclass pattern to minimize changes to the upstream
update_wikidata.py, keeping the fork mergeable with future upstream updates.Commits (in order):
update_wikidata.pyfor backend extensibility — extractparse_update_result()as overridable method, makesparql_endpoint/access_token/showsafe viagetattr()blazegraph_wikidata_updater.pysubclass that overrides only response parsing and adds Blazegraph URL as CLI argCOMMIT: totalElapsed=...)--user-agentflag with Geneea-identifying default per Wikimedia policypyproject.toml— new package namegeneea-wikidata-updater, trimmed deps, console script entry pointcurlsubprocess withrequestslibrary — all three curl call sites replaced withrequests.post(), removing dependency on bash/curl_iter_sse_events()wrapper catches mid-stream connection errors; empty-batch guard reconnects from same offsetReview & Testing Checklist for Human
--num-messages 5against your Blazegraph instance to verify the curl→requests replacement works end-to-end (offset query, batch UPDATE, response parsing)--use-cached-sparql-queriesgit merge upstream/mainstill applies cleanly after these changesNotes
update_wikidata.pyare: extractedparse_update_result(), replaced 3 curl calls withrequests.post(), added_iter_sse_events()wrapper, and addedgetattr()guards. All existing QLever behavior is preserved.requestswas already a transitive dependency (viarequests-sse); now it's an explicit direct dependency.0.2.0with the curl→requests change.Link to Devin session: https://app.devin.ai/sessions/33060449c31b4f439350bce7d4877fba