Skip to content

Core, AWS: Allow stopping thread pools manually#15312

Open
svalaskevicius wants to merge 33 commits intoapache:mainfrom
svalaskevicius:control-of-static-thread-pools
Open

Core, AWS: Allow stopping thread pools manually#15312
svalaskevicius wants to merge 33 commits intoapache:mainfrom
svalaskevicius:control-of-static-thread-pools

Conversation

@svalaskevicius
Copy link
Copy Markdown

@svalaskevicius svalaskevicius commented Feb 13, 2026

Control of static thread pools — manual shutdown and lifecycle management

This change introduces a centralized ThreadPoolManager that gives users explicit control over Iceberg's thread pool lifecycle.

Motivation

The previous approach used Guava's MoreExecutors.getExitingExecutorService, which registered unremovable shutdown hooks that accumulated over time with short-lived pools. This new design replaces it with a central manager, enabling manual shutdown and opt-out of automatic hook registration.

This is needed when the application registers its own shutdown hooks, that need to commit the iceberg file upload cleanly. Because JVM does not allow control of the order in which the shutdown hooks get invoked, the problem with the current state is that iceberg kills its thread pools and is unable to complete the export.

Intended Use Cases

  1. Stop thread pools manually to avoid leaks in hot-reload environments
  2. Opt out of the standard JVM shutdown hook mechanism to manage graceful service stops (e.g., committing last pending files before exiting).

Key Changes

  • shutdownThreadPools() — gracefully shuts down all registered pools and removes the JVM shutdown hook
  • removeShutdownHook() — opt out of automatic shutdown hooks so applications can manage their own graceful shutdown sequence (within their shutdown hooks)

Fixes issue #15039

 - allows to stop thread pools manually, to avoid leaks in hot-reload
   environments
 - allows to opt-out of the standard shutdown mechanism to manage
   graceful service stops (and commit the last pending files)
@svalaskevicius svalaskevicius force-pushed the control-of-static-thread-pools branch from cbb17a3 to db22fba Compare February 13, 2026 10:44
Comment thread core/src/main/java/org/apache/iceberg/util/ThreadPools.java Outdated
Comment thread core/src/main/java/org/apache/iceberg/util/ThreadPools.java Outdated
Comment thread core/src/main/java/org/apache/iceberg/util/ThreadPools.java
Comment thread core/src/main/java/org/apache/iceberg/util/ThreadPools.java Outdated
Comment thread core/src/main/java/org/apache/iceberg/util/ThreadPools.java Outdated
@github-actions github-actions Bot added the AWS label Feb 13, 2026
@svalaskevicius svalaskevicius marked this pull request as ready for review February 16, 2026 09:38
@svalaskevicius svalaskevicius requested a review from mxm February 16, 2026 09:38
Comment thread core/src/main/java/org/apache/iceberg/util/ThreadPools.java Outdated
Comment thread core/src/main/java/org/apache/iceberg/util/ThreadPools.java Outdated
Comment thread core/src/main/java/org/apache/iceberg/util/ThreadPools.java Outdated
Comment thread core/src/main/java/org/apache/iceberg/util/ThreadPools.java Outdated
Comment thread core/src/main/java/org/apache/iceberg/util/ThreadPools.java Outdated
Comment thread core/src/main/java/org/apache/iceberg/util/ThreadPools.java Outdated
Comment thread core/src/main/java/org/apache/iceberg/util/ThreadPools.java Outdated
Comment thread core/src/main/java/org/apache/iceberg/util/ThreadPools.java Outdated
@svalaskevicius
Copy link
Copy Markdown
Author

svalaskevicius commented Feb 19, 2026

@mxm

It works, but it is quite nondeterministic how the shutdown is executed. With the current code, the shutdown timeout can be reset multiple times if there are multiple calls to shutdownThreadPools().

I've updated the javadoc to word it more seriously - to only call this at the end of the intended usage of the library, and never before. Also, it is safe to call it multiple times -- but only AFTER the client code is sure that it's not longer using the library

@svalaskevicius svalaskevicius requested a review from mxm February 19, 2026 11:57
Comment thread core/src/main/java/org/apache/iceberg/util/ThreadPools.java Outdated
Comment thread core/src/main/java/org/apache/iceberg/util/ThreadPools.java Outdated
Copy link
Copy Markdown
Contributor

@mxm mxm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of more comments:

  1. Why are we not touching AuthRefreshPoolHolder?
  2. There are no tests yet.

@svalaskevicius
Copy link
Copy Markdown
Author

@mxm updated, please check.

I had missed AuthRefreshPoolHolder and indeed the whole scheduled executors side - which is now added as well.

also, what tests do you have in mind?

this is still all pretty much static code and invoking the shutdown in test affects a lot. I suppose one option would be to extract functions for the more complicated, non-obvious features, and test them in isolation - do you have any preferences?

I'm a bit wary of the nondeterministic state of this PR/task - esp with the new parallel PR out there :) would it be possible to have a think and set out a single list of changes pending before it can be considered complete/good enough?

Thanks!

@svalaskevicius svalaskevicius force-pushed the control-of-static-thread-pools branch 2 times, most recently from 186f578 to ff4d7e0 Compare March 3, 2026 11:22
Comment thread core/src/main/java/org/apache/iceberg/util/ThreadPools.java Outdated
Comment thread core/src/main/java/org/apache/iceberg/util/ThreadPools.java Outdated
Comment thread core/src/main/java/org/apache/iceberg/util/ThreadPools.java Outdated
Comment thread core/src/test/java/org/apache/iceberg/util/TestThreadPools.java Outdated
Comment thread core/src/main/java/org/apache/iceberg/util/ThreadPools.java Outdated
Comment thread core/src/test/java/org/apache/iceberg/util/TestThreadPools.java
@github-actions github-actions Bot removed the AWS label Apr 24, 2026
Comment thread core/src/main/java/org/apache/iceberg/util/ThreadPools.java Outdated
@svalaskevicius svalaskevicius requested a review from nastra April 24, 2026 12:07
@rdblue
Copy link
Copy Markdown
Contributor

rdblue commented Apr 27, 2026

@svalaskevicius can you please update the description with exactly what this is changing and why? I think it is a red flag that this is touching the static worker pool, but doesn't specifically call that out in the PR description and also doesn't explain why that is needed.

If this PR is about how to manage threadpools coming from factory methods in ThreadPools, then what this is trying to accomplish should be more clear.

I also am skeptical of this kind of update in general. This class was intended to hold a single threadpool, not to provide threadpools. As it grew over time, people wanted to have threadpools configured the same way. But changing the behavior of the default pools in order to make other uses of the convenience methods more generic is not the right path forward.

Copy link
Copy Markdown
Contributor

@rdblue rdblue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that this should not touch the default threadpools and should clearly outline the motivation for these changes in the PR description.

@svalaskevicius
Copy link
Copy Markdown
Author

@rdblue

Hi, thanks for the feedback. I've updated the PR description - I hope it's clearer.

Please can you explain what do you mean regarding the default thread pools? Having them created via MoreExecutors is exactly the problem, or do you mean to simply revert the commit 1bc41cc ?

Thanks

@rdblue
Copy link
Copy Markdown
Contributor

rdblue commented Apr 28, 2026

I've updated the PR description - I hope it's clearer.

If I understand correctly from #15031, it looks like the issue is that Flink uses new classloaders and the shutdown hook is never called because the JVM never exits. So we need use threadpools tied to Flink's lifecycle. That's reasonable, but we definitely do NOT want to give "users" control over threadpool lifecycles.

I think this implementation goes a bit beyond what we need to do. Assuming that we want to have a way to shut down these pools, that doesn't mean that we should remove the shutdown hook. The contract of newExitingWorkerPool is that it will be shut down by a hook. Why not update the shutdown hook so that it is a noop if the pool is already shut down?

Also, it's debatable whether we need to track all pools created by the factory methods here. This class is responsible for its own static pools and its methods are reused elsewhere. Having a way to close all threadpools created by this class seems very risky to me. I'm not convinced that we want to expose a method to shut down a single static threadpool directly, let alone one that will shut down pools created and managed by other classes. The reason for static pools is to ensure one is available. Allowing them to be closed allows one class to break functionality for another. Allowing all of them to be closed is even more risk.

I think the solution is not to expose these methods. Instead, I propose two things:

  1. Deprecate and remove newExitingWorkerPool and fix the 3 uses of it. HadoopFileIO can manage the lifecycle of its thread pool.
  2. Lazily create the pools managed by ThreadPools. The APIs that use the static pools support passing in an executor service with a lifecycle managed by the client. This means that a Flink application can manage its own threadpools (which is why we added the API) and shut them down appropriately. If Flink isn't using the static pools, then they will never start and we don't need a shutdown method.

I think if we do these two things then we will fix the problem and be better off. I'd also be okay with an option to prevent these pools from being started if we need more guarantees. I think that's safer and better than using them but shutting them down.

@pvary
Copy link
Copy Markdown
Contributor

pvary commented Apr 29, 2026

@rdblue

Allowing them to be closed allows one class to break functionality for another.

Technically, nothing prevents users from shutting down the static pools already, for example:

    ThreadPools.getWorkerPool().shutdown();

Today we mostly rely on reviews to prevent this, and in some cases it is not necessarily obvious. If the goal is to guarantee that users cannot shut these pools down, that could be addressed more directly (e.g., by wrapping the returned executor), but that safeguard does not exist today.

Changing all usages of newExitingWorkerPool / newExitingScheduledPool would also be fairly invasive. These thread pools are used across several FileIO implementations (e.g., HadoopFileIOnewExitingWorkerPool, S3FileIO / GCSFileIOnewExitingScheduledPool, ADLSFileIOgetWorkerPool).

Do you think a better long‑term direction would be to change FileIO.initialize(Map<String, String> properties) to something like:

    FileIO.initialize(
        Map<String, String> properties,
        Function<Class<? extends FileIO>, ScheduledExecutorService> executorFactory)

and propagate this through the FileIO implementations?

In practice, FileIO instances are often created by catalogs, which suggest that we change the Catalog.initialize(String name, Map<String, String> properties) to something like:

    Catalog.initialize(
        String name,
        Map<String, String> properties,
        Function<Class<? extends FileIO>, ScheduledExecutorService> executorFactory)

and propagate this through the Catalog implementations and change the CatalogUtil.loadCatalog(String impl, String catalogName, Map<String, String> properties, Object hadoopConf) to allow setting the factory.

The FileIO implementations can't manage the lifecycle for the pools as they don't know if they should keep the shared resources open or they should close them too.

@svalaskevicius
Copy link
Copy Markdown
Author

NOTE: updated the description again to add a section about the non-deterministic nature of the JVM shutdown hooks and how it prevents clean iceberg file commits when the application is exiting

@rdblue
Copy link
Copy Markdown
Contributor

rdblue commented Apr 29, 2026

@pvary, let me outline my assumptions so that my rationale is clear. First, I'm assuming that we are only worrying about the exiting pools. The other pools should have their own lifecycle management. The non-exiting pools are what I would expect FileIO instances to use, unless the pools are shared. Next, I see newExistingScheduledPool has 2 outside uses, S3FileIO and GCSFileIO, and newExitingWorkerPool has 3 outside uses in HadoopFileIO, RESTMetricsReporter, and AuthSessionCache. 5 uses is not a huge issue, but this could be a larger problem if the actual issue is that FileIO don't properly manage pool lifecycles.

However, I don't think the FileIO situation has much influence over the decision here. This PR proposes that we add:

  1. A way to remove shutdown hooks from all thread pools
  2. A global shutdown for all thread pools created by ThreadPools

I think those are both poor solutions. For the first one, if you don't want a threadpool that is configured like the static pools, the right answer is to use your own threadpool and manage its lifecycle. Customizing the lifecycle of the static pools opens the possibility of misconfiguration. The reasonable way to allow using your own threadpool (and not worrying about the shutdown hooks) is to lazily start the static pools. But if you can already shut down the static pools, then we don't really need even that.

For the second one, I think it is a bad idea to have a global shutdown for the static pools, let along all of the pools created by the factory methods. This is a drastic behavior change that can be misused. It may allow us to avoid fixing pools created by FileIO, but we should not be creating dangerous ways to work around the right solution. And if I understand correctly, the problem you're saying we need to solve is that the pools created for FileIO have inconsistently managed lifecycles.

@svalaskevicius
Copy link
Copy Markdown
Author

Hi @rdblue,

Just to clarify:

Next, I see newExistingScheduledPool has 2 outside uses, S3FileIO and GCSFileIO, and newExitingWorkerPool has 3 outside uses in HadoopFileIO, RESTMetricsReporter, and AuthSessionCache.

The problem covers all thread pools currently managed by MoreExecutors - including both worker pools here.

  1. A global shutdown for all thread pools created by ThreadPools

A global shutdown exists already via MoreExecutors, just that it is deferred to the JVM shutdown hook and there is no control as to when exactly will it be invoked.


I think those are both poor solutions. For the first one, if you don't want a threadpool that is configured like the static pools, the right answer is to use your own threadpool and manage its lifecycle <..>

I agree with this. If iceberg had thread pools injected via constructors (and without relying on the global JVM shutdown hooks) it would address the issue cleanly. At this point, however, I was looking for a pragmatic solution. We have an application that uses the iceberg library, and cannot commit/flush files on exit because iceberg has already killed its threads and is producing a lot of errors how the ThreadPools have been stopped.

Given that, and that there is an open issue related to thread stopping, my reasoning was that these can be combined to a single change to solve both problems - this is the resulting PR.

I suppose the question is - is this an acceptable interim solution until the threadpool (factory?) injection is implemented, or should this PR be closed and converted to an open issue (for the uncontrolled/non-deterministic shutdown behaviour)?

Do you have any thoughts as to when could the proper fix be expected?

Regards,
Sarunas

@pvary
Copy link
Copy Markdown
Contributor

pvary commented Apr 30, 2026

@rdblue: Thanks for the detailed reply. This helps clarify where our views differ.

First, I'm assuming that we are only worrying about the exiting pools.

Agreed.

The other pools should have their own lifecycle management.

Fully agree.

The non-exiting pools are what I would expect FileIO instances to use, unless the pools are shared.

In practice, all FileIO implementations currently rely on static exiting worker pools. The pattern is that these pools are created on first use and then shared across instances. That approach has two advantages:

  • The threads are used only occasionally, so sharing them keeps the overall thread footprint small.
  • The pool size naturally bounds the level of outgoing parallelism.

This PR proposes that we add:

  • A way to remove shutdown hooks from all thread pools
  • A global shutdown for all thread pools created by ThreadPools

The naming may indeed need improvement, but the proposed shutdown only targets the exiting pools created by newExitingWorkerPool and newExitingScheduledPool. These are exactly the pools where we currently lack direct control over shutdown.

the problem you're saying we need to solve is that the pools created for FileIO have inconsistently managed lifecycles.

From my perspective, the pattern across FileIO implementations is actually quite consistent: they require a global shared pool per FileIO type, reused across instances and normally closed only on exit. The issue is that “exit” is not always equivalent to a JVM shutdown. In cases like hot replacement, we need a way to release those resources even while the JVM keeps running.

Based on this discussion, I see a few possible directions:

  1. Adopt the previously mentioned (but somewhat invasive) change to the FileIO API.
  2. Introduce a ThreadPoolManager outside of ThreadPools and migrate FileIO implementations to use managed pools.
  3. Keep the ThreadPoolManager within ThreadPools, but don’t let it manage the standard pools (WORKER_POOL, DELETE_WORKER_POOL, AuthRefreshPoolHolder.INSTANCE).

Independently of which path we take, I think we should ensure that standard pools are lazily initialized, and that every place using them (using the methods: getWorkerPool, getDeleteWorkerPool, authRefreshPool) allows injecting an external pool as a substitute.

I would prefer 2 or 3. Do you think it would be reasonable? Do you have another suggestion?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants