Core, AWS: Allow stopping thread pools manually by svalaskevicius · Pull Request #15312 · apache/iceberg

svalaskevicius · 2026-02-13T10:26:56Z

Control of static thread pools — manual shutdown and lifecycle management

This change introduces a centralized ThreadPoolManager that gives users explicit control over Iceberg's thread pool lifecycle.

Motivation

The previous approach used Guava's MoreExecutors.getExitingExecutorService, which registered unremovable shutdown hooks that accumulated over time with short-lived pools. This new design replaces it with a central manager, enabling manual shutdown and opt-out of automatic hook registration.

This is needed when the application registers its own shutdown hooks, that need to commit the iceberg file upload cleanly. Because JVM does not allow control of the order in which the shutdown hooks get invoked, the problem with the current state is that iceberg kills its thread pools and is unable to complete the export.

Intended Use Cases

Stop thread pools manually to avoid leaks in hot-reload environments
Opt out of the standard JVM shutdown hook mechanism to manage graceful service stops (e.g., committing last pending files before exiting).

Key Changes

shutdownThreadPools() — gracefully shuts down all registered pools and removes the JVM shutdown hook
removeShutdownHook() — opt out of automatic shutdown hooks so applications can manage their own graceful shutdown sequence (within their shutdown hooks)

Fixes issue #15039

- allows to stop thread pools manually, to avoid leaks in hot-reload environments - allows to opt-out of the standard shutdown mechanism to manage graceful service stops (and commit the last pending files)

svalaskevicius · 2026-02-19T11:53:33Z

@mxm

It works, but it is quite nondeterministic how the shutdown is executed. With the current code, the shutdown timeout can be reset multiple times if there are multiple calls to shutdownThreadPools().

I've updated the javadoc to word it more seriously - to only call this at the end of the intended usage of the library, and never before. Also, it is safe to call it multiple times -- but only AFTER the client code is sure that it's not longer using the library

mxm

A couple of more comments:

Why are we not touching AuthRefreshPoolHolder?
There are no tests yet.

svalaskevicius · 2026-03-03T10:49:46Z

@mxm updated, please check.

I had missed AuthRefreshPoolHolder and indeed the whole scheduled executors side - which is now added as well.

also, what tests do you have in mind?

this is still all pretty much static code and invoking the shutdown in test affects a lot. I suppose one option would be to extract functions for the more complicated, non-obvious features, and test them in isolation - do you have any preferences?

I'm a bit wary of the nondeterministic state of this PR/task - esp with the new parallel PR out there :) would it be possible to have a think and set out a single list of changes pending before it can be considered complete/good enough?

Thanks!

Co-authored-by: Eduard Tudenhoefner <etudenhoefner@gmail.com>

… it is invoked

This reverts commit f6c1d52.

rdblue · 2026-04-27T22:40:28Z

@svalaskevicius can you please update the description with exactly what this is changing and why? I think it is a red flag that this is touching the static worker pool, but doesn't specifically call that out in the PR description and also doesn't explain why that is needed.

If this PR is about how to manage threadpools coming from factory methods in ThreadPools, then what this is trying to accomplish should be more clear.

I also am skeptical of this kind of update in general. This class was intended to hold a single threadpool, not to provide threadpools. As it grew over time, people wanted to have threadpools configured the same way. But changing the behavior of the default pools in order to make other uses of the convenience methods more generic is not the right path forward.

rdblue

I think that this should not touch the default threadpools and should clearly outline the motivation for these changes in the PR description.

svalaskevicius · 2026-04-28T08:27:01Z

@rdblue

Hi, thanks for the feedback. I've updated the PR description - I hope it's clearer.

Please can you explain what do you mean regarding the default thread pools? Having them created via MoreExecutors is exactly the problem, or do you mean to simply revert the commit 1bc41cc ?

Thanks

rdblue · 2026-04-28T22:09:27Z

I've updated the PR description - I hope it's clearer.

If I understand correctly from #15031, it looks like the issue is that Flink uses new classloaders and the shutdown hook is never called because the JVM never exits. So we need use threadpools tied to Flink's lifecycle. That's reasonable, but we definitely do NOT want to give "users" control over threadpool lifecycles.

I think this implementation goes a bit beyond what we need to do. Assuming that we want to have a way to shut down these pools, that doesn't mean that we should remove the shutdown hook. The contract of newExitingWorkerPool is that it will be shut down by a hook. Why not update the shutdown hook so that it is a noop if the pool is already shut down?

Also, it's debatable whether we need to track all pools created by the factory methods here. This class is responsible for its own static pools and its methods are reused elsewhere. Having a way to close all threadpools created by this class seems very risky to me. I'm not convinced that we want to expose a method to shut down a single static threadpool directly, let alone one that will shut down pools created and managed by other classes. The reason for static pools is to ensure one is available. Allowing them to be closed allows one class to break functionality for another. Allowing all of them to be closed is even more risk.

I think the solution is not to expose these methods. Instead, I propose two things:

Deprecate and remove newExitingWorkerPool and fix the 3 uses of it. HadoopFileIO can manage the lifecycle of its thread pool.
Lazily create the pools managed by ThreadPools. The APIs that use the static pools support passing in an executor service with a lifecycle managed by the client. This means that a Flink application can manage its own threadpools (which is why we added the API) and shut them down appropriately. If Flink isn't using the static pools, then they will never start and we don't need a shutdown method.

I think if we do these two things then we will fix the problem and be better off. I'd also be okay with an option to prevent these pools from being started if we need more guarantees. I think that's safer and better than using them but shutting them down.

pvary · 2026-04-29T09:23:00Z

@rdblue

Allowing them to be closed allows one class to break functionality for another.

Technically, nothing prevents users from shutting down the static pools already, for example:

    ThreadPools.getWorkerPool().shutdown();

Today we mostly rely on reviews to prevent this, and in some cases it is not necessarily obvious. If the goal is to guarantee that users cannot shut these pools down, that could be addressed more directly (e.g., by wrapping the returned executor), but that safeguard does not exist today.

Changing all usages of newExitingWorkerPool / newExitingScheduledPool would also be fairly invasive. These thread pools are used across several FileIO implementations (e.g., HadoopFileIO → newExitingWorkerPool, S3FileIO / GCSFileIO → newExitingScheduledPool, ADLSFileIO → getWorkerPool).

Do you think a better long‑term direction would be to change FileIO.initialize(Map<String, String> properties) to something like:

    FileIO.initialize(
        Map<String, String> properties,
        Function<Class<? extends FileIO>, ScheduledExecutorService> executorFactory)

and propagate this through the FileIO implementations?

In practice, FileIO instances are often created by catalogs, which suggest that we change the Catalog.initialize(String name, Map<String, String> properties) to something like:

    Catalog.initialize(
        String name,
        Map<String, String> properties,
        Function<Class<? extends FileIO>, ScheduledExecutorService> executorFactory)

and propagate this through the Catalog implementations and change the CatalogUtil.loadCatalog(String impl, String catalogName, Map<String, String> properties, Object hadoopConf) to allow setting the factory.

The FileIO implementations can't manage the lifecycle for the pools as they don't know if they should keep the shared resources open or they should close them too.

svalaskevicius · 2026-04-29T15:18:14Z

NOTE: updated the description again to add a section about the non-deterministic nature of the JVM shutdown hooks and how it prevents clean iceberg file commits when the application is exiting

rdblue · 2026-04-29T23:36:09Z

@pvary, let me outline my assumptions so that my rationale is clear. First, I'm assuming that we are only worrying about the exiting pools. The other pools should have their own lifecycle management. The non-exiting pools are what I would expect FileIO instances to use, unless the pools are shared. Next, I see newExistingScheduledPool has 2 outside uses, S3FileIO and GCSFileIO, and newExitingWorkerPool has 3 outside uses in HadoopFileIO, RESTMetricsReporter, and AuthSessionCache. 5 uses is not a huge issue, but this could be a larger problem if the actual issue is that FileIO don't properly manage pool lifecycles.

However, I don't think the FileIO situation has much influence over the decision here. This PR proposes that we add:

A way to remove shutdown hooks from all thread pools
A global shutdown for all thread pools created by ThreadPools

I think those are both poor solutions. For the first one, if you don't want a threadpool that is configured like the static pools, the right answer is to use your own threadpool and manage its lifecycle. Customizing the lifecycle of the static pools opens the possibility of misconfiguration. The reasonable way to allow using your own threadpool (and not worrying about the shutdown hooks) is to lazily start the static pools. But if you can already shut down the static pools, then we don't really need even that.

For the second one, I think it is a bad idea to have a global shutdown for the static pools, let along all of the pools created by the factory methods. This is a drastic behavior change that can be misused. It may allow us to avoid fixing pools created by FileIO, but we should not be creating dangerous ways to work around the right solution. And if I understand correctly, the problem you're saying we need to solve is that the pools created for FileIO have inconsistently managed lifecycles.

svalaskevicius · 2026-04-30T11:16:29Z

Hi @rdblue,

Just to clarify:

Next, I see newExistingScheduledPool has 2 outside uses, S3FileIO and GCSFileIO, and newExitingWorkerPool has 3 outside uses in HadoopFileIO, RESTMetricsReporter, and AuthSessionCache.

The problem covers all thread pools currently managed by MoreExecutors - including both worker pools here.

A global shutdown for all thread pools created by ThreadPools

A global shutdown exists already via MoreExecutors, just that it is deferred to the JVM shutdown hook and there is no control as to when exactly will it be invoked.

I think those are both poor solutions. For the first one, if you don't want a threadpool that is configured like the static pools, the right answer is to use your own threadpool and manage its lifecycle <..>

I agree with this. If iceberg had thread pools injected via constructors (and without relying on the global JVM shutdown hooks) it would address the issue cleanly. At this point, however, I was looking for a pragmatic solution. We have an application that uses the iceberg library, and cannot commit/flush files on exit because iceberg has already killed its threads and is producing a lot of errors how the ThreadPools have been stopped.

Given that, and that there is an open issue related to thread stopping, my reasoning was that these can be combined to a single change to solve both problems - this is the resulting PR.

I suppose the question is - is this an acceptable interim solution until the threadpool (factory?) injection is implemented, or should this PR be closed and converted to an open issue (for the uncontrolled/non-deterministic shutdown behaviour)?

Do you have any thoughts as to when could the proper fix be expected?

Regards,
Sarunas

pvary · 2026-04-30T11:43:33Z

@rdblue: Thanks for the detailed reply. This helps clarify where our views differ.

First, I'm assuming that we are only worrying about the exiting pools.

Agreed.

The other pools should have their own lifecycle management.

Fully agree.

The non-exiting pools are what I would expect FileIO instances to use, unless the pools are shared.

In practice, all FileIO implementations currently rely on static exiting worker pools. The pattern is that these pools are created on first use and then shared across instances. That approach has two advantages:

The threads are used only occasionally, so sharing them keeps the overall thread footprint small.
The pool size naturally bounds the level of outgoing parallelism.

This PR proposes that we add:

A way to remove shutdown hooks from all thread pools

A global shutdown for all thread pools created by ThreadPools

The naming may indeed need improvement, but the proposed shutdown only targets the exiting pools created by newExitingWorkerPool and newExitingScheduledPool. These are exactly the pools where we currently lack direct control over shutdown.

the problem you're saying we need to solve is that the pools created for FileIO have inconsistently managed lifecycles.

From my perspective, the pattern across FileIO implementations is actually quite consistent: they require a global shared pool per FileIO type, reused across instances and normally closed only on exit. The issue is that “exit” is not always equivalent to a JVM shutdown. In cases like hot replacement, we need a way to release those resources even while the JVM keeps running.

Based on this discussion, I see a few possible directions:

Adopt the previously mentioned (but somewhat invasive) change to the FileIO API.
Introduce a ThreadPoolManager outside of ThreadPools and migrate FileIO implementations to use managed pools.
Keep the ThreadPoolManager within ThreadPools, but don’t let it manage the standard pools (WORKER_POOL, DELETE_WORKER_POOL, AuthRefreshPoolHolder.INSTANCE).

Independently of which path we take, I think we should ensure that standard pools are lazily initialized, and that every place using them (using the methods: getWorkerPool, getDeleteWorkerPool, authRefreshPool) allows injecting an external pool as a substitute.

I would prefer 2 or 3. Do you think it would be reasonable? Do you have another suggestion?

github-actions Bot added the core label Feb 13, 2026

svalaskevicius mentioned this pull request Feb 13, 2026

Core: Static thread pools in ThreadPools.java cause ClassLoader leaks in hot-reload scenarios #15031

Open

svalaskevicius marked this pull request as draft February 13, 2026 10:28

svalaskevicius force-pushed the control-of-static-thread-pools branch 2 times, most recently from 7f955d4 to cbb17a3 Compare February 13, 2026 10:41

Allow stopping thread pools manually

db22fba

- allows to stop thread pools manually, to avoid leaks in hot-reload environments - allows to opt-out of the standard shutdown mechanism to manage graceful service stops (and commit the last pending files)

svalaskevicius force-pushed the control-of-static-thread-pools branch from cbb17a3 to db22fba Compare February 13, 2026 10:44

mxm reviewed Feb 13, 2026

View reviewed changes

svalaskevicius added 3 commits February 13, 2026 16:08

use ThreadPools within non tests code

f6c1d52

clean up after review

5681ec6

make sure to wait up to total of SHUTDOWN_TIMEOUT_SECONDS

8db36b6

github-actions Bot added the AWS label Feb 13, 2026

svalaskevicius added 2 commits February 13, 2026 16:46

add logging on exception

7915de9

add more security exception catching / logging

dfd4044

svalaskevicius marked this pull request as ready for review February 16, 2026 09:38

svalaskevicius requested a review from mxm February 16, 2026 09:38

mxm reviewed Feb 18, 2026

View reviewed changes

apply review feedback

6d5efe5

javadoc for shutting down threadpools

74a6f49

svalaskevicius requested a review from mxm February 19, 2026 11:57

mxm reviewed Feb 23, 2026

View reviewed changes

Comment thread core/src/main/java/org/apache/iceberg/util/ThreadPools.java Outdated

make methods touching shutdown hook synchronized

31a58f6

svalaskevicius requested a review from mxm February 24, 2026 09:59

svalaskevicius mentioned this pull request Mar 2, 2026

Core: Fix thread pool shutdown hook leak in ThreadPools #15492

Closed

mxm reviewed Mar 2, 2026

View reviewed changes

Comment thread core/src/main/java/org/apache/iceberg/util/ThreadPools.java Outdated

mxm reviewed Mar 2, 2026

View reviewed changes

svalaskevicius force-pushed the control-of-static-thread-pools branch 2 times, most recently from 186f578 to ff4d7e0 Compare March 3, 2026 11:22

nastra reviewed Apr 21, 2026

View reviewed changes

Comment thread core/src/main/java/org/apache/iceberg/util/ThreadPools.java Outdated

nastra reviewed Apr 21, 2026

View reviewed changes

Comment thread core/src/main/java/org/apache/iceberg/util/ThreadPools.java Outdated

nastra reviewed Apr 21, 2026

View reviewed changes

Comment thread core/src/main/java/org/apache/iceberg/util/ThreadPools.java Outdated

nastra reviewed Apr 21, 2026

View reviewed changes

Comment thread core/src/test/java/org/apache/iceberg/util/TestThreadPools.java Outdated

nastra reviewed Apr 21, 2026

View reviewed changes

Comment thread core/src/main/java/org/apache/iceberg/util/ThreadPools.java Outdated

nastra reviewed Apr 21, 2026

View reviewed changes

Comment thread core/src/test/java/org/apache/iceberg/util/TestThreadPools.java

svalaskevicius and others added 10 commits April 21, 2026 15:30

update docs

177525d

Fix docs

9a4d680

Co-authored-by: Eduard Tudenhoefner <etudenhoefner@gmail.com>

update tests, add a new one

58741a3

recreate worker pools if init is called

1bc41cc

call direct shutdown operation instead of removing shutdown hook when…

d3d3015

… it is invoked

simplify thread clean up code

5066227

sort the threadpools so the shutdownNow is invoked after correct timeout

419623d

fix variable name

d008e34

remove method used in tests only

f62e174

Revert "use ThreadPools within non tests code"

c0a6baf

This reverts commit f6c1d52.

github-actions Bot removed the AWS label Apr 24, 2026

update docs

5595ffa

nastra reviewed Apr 24, 2026

View reviewed changes

Comment thread core/src/main/java/org/apache/iceberg/util/ThreadPools.java Outdated

use javadoc links for method references

c50f37e

svalaskevicius requested a review from nastra April 24, 2026 12:07

rdblue requested changes Apr 27, 2026

View reviewed changes

Conversation

svalaskevicius commented Feb 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Control of static thread pools — manual shutdown and lifecycle management

Motivation

Intended Use Cases

Key Changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

svalaskevicius commented Feb 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mxm left a comment

Choose a reason for hiding this comment

Uh oh!

svalaskevicius commented Mar 3, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

rdblue commented Apr 27, 2026

Uh oh!

rdblue left a comment

Choose a reason for hiding this comment

Uh oh!

svalaskevicius commented Apr 28, 2026

Uh oh!

rdblue commented Apr 28, 2026

Uh oh!

pvary commented Apr 29, 2026

Uh oh!

svalaskevicius commented Apr 29, 2026

Uh oh!

rdblue commented Apr 29, 2026

Uh oh!

svalaskevicius commented Apr 30, 2026

Uh oh!

pvary commented Apr 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

svalaskevicius commented Feb 13, 2026 •

edited

Loading

svalaskevicius commented Feb 19, 2026 •

edited

Loading