CPU memory consumption increases with more readers and is not freed

I'm observing that after loading a model with tensorizer, I see sustained CPU memory usage, and it depends how many readers I use.

It seems there's a memory pool or buffer that's not being freed when exiting the TensorDeserializer context. Is there something else I should be doing to clean up after using a TensorDeserializer?

Sample code, very similar to what's in the README.

```python
from tensorizer import TensorDeserializer
from tensorizer.utils import convert_bytes, get_mem_usage, no_init_or_tensor
import time

num_readers = 1
materialization_device = torch.device("cuda")

start = time.perf_counter()
with (
    TensorDeserializer(
        tensorized_model_path, num_readers=num_readers, device=materialization_device
    ) as deserializer):
    deserializer.load_into_module(model)
    end = time.perf_counter()
    total_bytes_str = convert_bytes(deserializer.total_tensor_bytes)
    duration = end - start
    per_second = convert_bytes(deserializer.total_tensor_bytes / duration)
after_mem = get_mem_usage()
print(f"Deserialized {total_bytes_str} in {end - start:0.2f}s, {per_second}/s")
print(f"Memory usage before: {before_mem}")
print(f"Memory usage after: {after_mem}")
print(f"Model device is {model.device}")
```

Operating conditions:

* 13.8 GB model tensorized with v2.9.0
* Running with v2.10.0 (v2.9.0 same behaviour)
* Model is on disk (EBS volume). Note usually we read straight from S3, but the behaviour occurs regardless
* Running on a g5.xlarge instance


And here's the output, for num_readers = 1, 10 and 15 respectively, as a table. Notice that the memory usage increases with increasing `num_readers`, and that it's sustained after exitting the TensorDeserializer context. 

| Readers     | Deserialization Time | Throughput | CPU Memory Before (maxrss / F) | GPU Memory Before (U / F / T)       | TORCH Before (R / A)   | CPU Memory After (maxrss / F)     | GPU Memory After (U / F / T)       | TORCH After (R / A)        | Model Device |
|-------------|----------------------|------------|--------------------------------|--------------------------------------|--------------------------|------------------------------------|--------------------------------------|-----------------------------|---------------|
| 1 reader    | 3.90s                | 3.5 GB/s   | 1,569MiB / 3,242MiB            | 256MiB / 22,342MiB / 22,598MiB      | 0MiB / 0MiB              | **2,598MiB / 2,176MiB**             | 13,434MiB / 9,164MiB / 22,598MiB    | 13,156MiB / 13,152MiB      | cuda:0        |
| 10 readers  | 2.77s                | 5.0 GB/s   | 1,568MiB / 3,235MiB            | 256MiB / 22,342MiB / 22,598MiB      | 0MiB / 0MiB              | **3,751MiB / 995MiB**               | 13,454MiB / 9,144MiB / 22,598MiB    | 13,174MiB / 13,152MiB      | cuda:0        |
| 15 readers  | 3.26s                | 4.2 GB/s   | 1,567MiB / 3,245MiB            | 256MiB / 22,342MiB / 22,598MiB      | 0MiB / 0MiB              | **4,391MiB / 318MiB**               | 13,462MiB / 9,136MiB / 22,598MiB    | 13,182MiB / 13,152MiB      | cuda:0        |


and raw:

```
1 reader: 

Deserialized 13.8 GB in 3.90s, 3.5 GB/s
Memory usage before: CPU: (maxrss: 1,569MiB F: 3,242MiB) GPU: (U: 256MiB F: 22,342MiB T: 22,598MiB) TORCH: (R: 0MiB/0MiB, A: 0MiB/0MiB)
Memory usage after: CPU: (maxrss: 2,598MiB F: 2,176MiB) GPU: (U: 13,434MiB F: 9,164MiB T: 22,598MiB) TORCH: (R: 13,156MiB/13,156MiB, A: 13,152MiB/13,152MiB)
Model device is cuda:0

10 readers:

Deserialized 13.8 GB in 2.77s, 5.0 GB/s
Memory usage before: CPU: (maxrss: 1,568MiB F: 3,235MiB) GPU: (U: 256MiB F: 22,342MiB T: 22,598MiB) TORCH: (R: 0MiB/0MiB, A: 0MiB/0MiB)
Memory usage after: CPU: (maxrss: 3,751MiB F: 995MiB) GPU: (U: 13,454MiB F: 9,144MiB T: 22,598MiB) TORCH: (R: 13,174MiB/13,174MiB, A: 13,152MiB/13,152MiB)
Model device is cuda:0

15 readers:
Deserialized 13.8 GB in 3.26s, 4.2 GB/s
Memory usage before: CPU: (maxrss: 1,567MiB F: 3,245MiB) GPU: (U: 256MiB F: 22,342MiB T: 22,598MiB) TORCH: (R: 0MiB/0MiB, A: 0MiB/0MiB)
Memory usage after: CPU: (maxrss: 4,391MiB F: 318MiB) GPU: (U: 13,462MiB F: 9,136MiB T: 22,598MiB) TORCH: (R: 13,182MiB/13,182MiB, A: 13,152MiB/13,152MiB)
Model device is cuda:0
```



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CPU memory consumption increases with more readers and is not freed #191

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Readers	Deserialization Time	Throughput	CPU Memory Before (maxrss / F)	GPU Memory Before (U / F / T)	TORCH Before (R / A)	CPU Memory After (maxrss / F)	GPU Memory After (U / F / T)	TORCH After (R / A)	Model Device
1 reader	3.90s	3.5 GB/s	1,569MiB / 3,242MiB	256MiB / 22,342MiB / 22,598MiB	0MiB / 0MiB	2,598MiB / 2,176MiB	13,434MiB / 9,164MiB / 22,598MiB	13,156MiB / 13,152MiB	cuda:0
10 readers	2.77s	5.0 GB/s	1,568MiB / 3,235MiB	256MiB / 22,342MiB / 22,598MiB	0MiB / 0MiB	3,751MiB / 995MiB	13,454MiB / 9,144MiB / 22,598MiB	13,174MiB / 13,152MiB	cuda:0
15 readers	3.26s	4.2 GB/s	1,567MiB / 3,245MiB	256MiB / 22,342MiB / 22,598MiB	0MiB / 0MiB	4,391MiB / 318MiB	13,462MiB / 9,136MiB / 22,598MiB	13,182MiB / 13,152MiB	cuda:0

CPU memory consumption increases with more readers and is not freed #191

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions