I'm observing that after loading a model with tensorizer, I see sustained CPU memory usage, and it depends how many readers I use.
It seems there's a memory pool or buffer that's not being freed when exiting the TensorDeserializer context. Is there something else I should be doing to clean up after using a TensorDeserializer?
Sample code, very similar to what's in the README.
from tensorizer import TensorDeserializer
from tensorizer.utils import convert_bytes, get_mem_usage, no_init_or_tensor
import time
num_readers = 1
materialization_device = torch.device("cuda")
start = time.perf_counter()
with (
TensorDeserializer(
tensorized_model_path, num_readers=num_readers, device=materialization_device
) as deserializer):
deserializer.load_into_module(model)
end = time.perf_counter()
total_bytes_str = convert_bytes(deserializer.total_tensor_bytes)
duration = end - start
per_second = convert_bytes(deserializer.total_tensor_bytes / duration)
after_mem = get_mem_usage()
print(f"Deserialized {total_bytes_str} in {end - start:0.2f}s, {per_second}/s")
print(f"Memory usage before: {before_mem}")
print(f"Memory usage after: {after_mem}")
print(f"Model device is {model.device}")
Operating conditions:
- 13.8 GB model tensorized with v2.9.0
- Running with v2.10.0 (v2.9.0 same behaviour)
- Model is on disk (EBS volume). Note usually we read straight from S3, but the behaviour occurs regardless
- Running on a g5.xlarge instance
And here's the output, for num_readers = 1, 10 and 15 respectively, as a table. Notice that the memory usage increases with increasing num_readers, and that it's sustained after exitting the TensorDeserializer context.
| Readers |
Deserialization Time |
Throughput |
CPU Memory Before (maxrss / F) |
GPU Memory Before (U / F / T) |
TORCH Before (R / A) |
CPU Memory After (maxrss / F) |
GPU Memory After (U / F / T) |
TORCH After (R / A) |
Model Device |
| 1 reader |
3.90s |
3.5 GB/s |
1,569MiB / 3,242MiB |
256MiB / 22,342MiB / 22,598MiB |
0MiB / 0MiB |
2,598MiB / 2,176MiB |
13,434MiB / 9,164MiB / 22,598MiB |
13,156MiB / 13,152MiB |
cuda:0 |
| 10 readers |
2.77s |
5.0 GB/s |
1,568MiB / 3,235MiB |
256MiB / 22,342MiB / 22,598MiB |
0MiB / 0MiB |
3,751MiB / 995MiB |
13,454MiB / 9,144MiB / 22,598MiB |
13,174MiB / 13,152MiB |
cuda:0 |
| 15 readers |
3.26s |
4.2 GB/s |
1,567MiB / 3,245MiB |
256MiB / 22,342MiB / 22,598MiB |
0MiB / 0MiB |
4,391MiB / 318MiB |
13,462MiB / 9,136MiB / 22,598MiB |
13,182MiB / 13,152MiB |
cuda:0 |
and raw:
1 reader:
Deserialized 13.8 GB in 3.90s, 3.5 GB/s
Memory usage before: CPU: (maxrss: 1,569MiB F: 3,242MiB) GPU: (U: 256MiB F: 22,342MiB T: 22,598MiB) TORCH: (R: 0MiB/0MiB, A: 0MiB/0MiB)
Memory usage after: CPU: (maxrss: 2,598MiB F: 2,176MiB) GPU: (U: 13,434MiB F: 9,164MiB T: 22,598MiB) TORCH: (R: 13,156MiB/13,156MiB, A: 13,152MiB/13,152MiB)
Model device is cuda:0
10 readers:
Deserialized 13.8 GB in 2.77s, 5.0 GB/s
Memory usage before: CPU: (maxrss: 1,568MiB F: 3,235MiB) GPU: (U: 256MiB F: 22,342MiB T: 22,598MiB) TORCH: (R: 0MiB/0MiB, A: 0MiB/0MiB)
Memory usage after: CPU: (maxrss: 3,751MiB F: 995MiB) GPU: (U: 13,454MiB F: 9,144MiB T: 22,598MiB) TORCH: (R: 13,174MiB/13,174MiB, A: 13,152MiB/13,152MiB)
Model device is cuda:0
15 readers:
Deserialized 13.8 GB in 3.26s, 4.2 GB/s
Memory usage before: CPU: (maxrss: 1,567MiB F: 3,245MiB) GPU: (U: 256MiB F: 22,342MiB T: 22,598MiB) TORCH: (R: 0MiB/0MiB, A: 0MiB/0MiB)
Memory usage after: CPU: (maxrss: 4,391MiB F: 318MiB) GPU: (U: 13,462MiB F: 9,136MiB T: 22,598MiB) TORCH: (R: 13,182MiB/13,182MiB, A: 13,152MiB/13,152MiB)
Model device is cuda:0
I'm observing that after loading a model with tensorizer, I see sustained CPU memory usage, and it depends how many readers I use.
It seems there's a memory pool or buffer that's not being freed when exiting the TensorDeserializer context. Is there something else I should be doing to clean up after using a TensorDeserializer?
Sample code, very similar to what's in the README.
Operating conditions:
And here's the output, for num_readers = 1, 10 and 15 respectively, as a table. Notice that the memory usage increases with increasing
num_readers, and that it's sustained after exitting the TensorDeserializer context.and raw: