SAC log_alpha different from paper

Thanks for the great work on this project! The issue is illustrated in the following line in the SAC trainer:

https://github.com/rail-berkeley/rlkit/blob/master/rlkit/torch/sac/sac.py#L161

In the SAC trainer, it seems as though `log_alpha` is being trained as though it is `alpha`, and then an exponentiation is being performed when computing the soft policy & critic losses. Initially I thought that `log_alpha` was being used for numerical precision reasons, but the loss function for `alpha` is not modified to work in log space, so it looks like what is happening is `log_alpha` is really `alpha`, and then in the soft policy & critic objectives `exp(alpha)` is being used in place of `alpha`.

I have not seen any discussion of this choice in the literature. If my understanding is correct, I think this technically breaks the soft-policy iteration and dual-descent theory in the original SAC papers (specifically the follow-up paper that introduced the auto-tuning entropy version of SAC)

It would be great if there could be some clarification as to why this choice was made or, at the very least, a comment/note saying that this deviates from the original (auto-entropy tuning) version of SAC.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SAC log_alpha different from paper #171

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

SAC log_alpha different from paper #171

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions