DRILL-8529: Caching QueryPlan Results#3023
Conversation
85a5b9b to
819c3f1
Compare
|
@vdegans Wow! This is an impressive first contribution to Drill! Before we start review, would you please do a clean rebase on |
5ac8803 to
c49c581
Compare
|
@cgivre Thanks! I fixed the rebase. |
Thanks. Before I start review I had a few questions:
What I'm getting at here is would it make sense to have global settings which you already have, but then also give the user the ability to set custom settings for specific plugins if they wanted to do so. I genuinely don't know if that is worth the effort or not. I could imagine this being more of an issue with data where the schema could change--MongoDB or JSON files for instance--and queries like SELECT * might bring back different data every time you run them. |
|
Thanks, that’s a great point. I agree that giving users control over caching at the per-plugin level makes sense. Some plugins, especially ones where the underlying data might change frequently, like MongoDB or JSON files, could benefit from having caching disabled or customized independently from the global settings. I think supporting per-plugin cache configuration would give users the flexibility to optimize caching behavior for their specific use cases and improve the overall user experience. |
|
@vdegans Hi Vincent, Any update? |
|
Hi @cgivre, I was thinking about the suggestion for a setting to enable/disable caching per plugin and I got stuck on the idea of when to cache and when not to. I didn't get much time to look at the code yet, but I would like to hear your thoughts about the settings per plugin. |
The whole idea of partially cached query plans is extremely tricky. I think but could be wrong but there may have been some work on that from the Calcite team at one point. In any event, my suggestion would be to start simple. Let's get all the unit tests to pass and just start with simple caching. IE: exact query match. Once that's done and merged, we can iterate and find improvements. |
|
@vdegans You should rebase on current master. I think that will solve the size limit issue you're running into. |
|
I think caffeine might cause this, should I add caffeine to the exclude list? |
You can either exclude caffeine or bump up the max size. Either is fine. |
|
Locally this seems to have fixed the issue |
992e6f3 to
35ac0a1
Compare
|
@vdegans I have some time now and can assist with this. Could you please rebase on current master and once you've done that, I'll see how I can help. |
35ac0a1 to
59223a5
Compare
|
Thank you for your time, I have just rebased the branch. |
|
@vdegans Can you take a look at my fork (https://github.com/cgivre/drill/tree/rebased-cache)? I think these changes will help get the CI/CD to pass. |
…ogic - Fix CustomCacheManager static initializer that crashed tests during class load by using lazy initialization with double-checked locking - Consolidate caching logic: remove duplicate ConcurrentHashMap cache from DrillSqlWorker and redundant caching in Foreman, use CustomCacheManager only - Clean up verbose debug logging (remove logger.info calls, printStackTrace) - Add clearCaches() and reset() methods for testing support - Only cache SELECT queries (not DDL/DML statements)
cgivre
left a comment
There was a problem hiding this comment.
Thank you very much for this contribution. I have some minor nits, and some questions:
- Is there ever a situation where the underlying data would change and it would affect the query plan such that Drill isn't returning the correct data?
- If possible, I'd really like to see a flag added to the metadata that is returned which would indicate that the query plan was from cache.
- Do you anticipate any security issues from using cached query plans? For instance, let's say that we have user translation enabled and user 1 executes a query against a MySQL database. User 2 then tries the same query, but does not have the same access. Would user 2 be able to access the data?
| } | ||
|
|
||
| public static void logCacheStats() { | ||
| ensureInitialized(); |
There was a problem hiding this comment.
These stats would be really useful to expose in the UI. However, I'd suggest adding that in another PR. Let's get this merged first!
| if (planCacheEnabled) { | ||
| CustomCacheManager.putTransformedPlan(key, output); | ||
| logger.debug("Cached transform result for phase: {}", phase); | ||
| } |
There was a problem hiding this comment.
Do you think it would be possible to include a flag in the results that would indicate that the query plan was retrieved from cache?
| @SuppressWarnings("deprecation") | ||
| final OptionDefinition[] definitions = new OptionDefinition[]{ | ||
| new OptionDefinition(PlannerSettings.PLAN_CACHE), | ||
| // new OptionDefinition(PlannerSettings.PLAN_CACHE_TTL), |
There was a problem hiding this comment.
Please remove commented out lines.
| * Moves query to RUNNING state. | ||
| */ | ||
| private void startQueryProcessing() { | ||
| logger.info("Starting query processing"); |
There was a problem hiding this comment.
Please make debug. Drill already emits a lot of logs.
| logger.info("Query text for query with id {} issued by {}: {}", queryIdString, | ||
| queryContext.getQueryUserName(), sql); | ||
| runSQL(sql); | ||
| logger.info("RunSQL is executed within {}", new Date().getTime() - start); |
DRILL-8529: Caching QueryPlan Results
Description
Implements a caching mechanism for query plans and transformations, to shorten the prepare phase.
Documentation
The cache behavior can be customized via drill-override.conf
At runtime, caching can also be toggled with:
planner.cache.enable (true = enabled, false = disabled)Testing