Skip to content

[sysvabi64] Add chapter on Thread Local Storage#311

Open
smithp35 wants to merge 13 commits intoARM-software:mainfrom
smithp35:sysvabitls
Open

[sysvabi64] Add chapter on Thread Local Storage#311
smithp35 wants to merge 13 commits intoARM-software:mainfrom
smithp35:sysvabitls

Conversation

@smithp35
Copy link
Copy Markdown
Contributor

The thread local storage chapter contains:

  • A description of Thread Local Storage based on addenda32
  • The key design decisions of AArch64 TLS such as tls variant, tls dialect, TCB size.
  • The ABI required code sequence for TLSDESC that must be emitted exactly, as GNU ld requires it to be.
  • Sequences for the different code-models.
  • Relaxations for GD->IE, GD->LE and IE->LE.
  • Synchronization requirements for Lazy TLSDESC. With advice not to support it due to overhead of synchronization.

and ``PT_TLS`` as the program header with type PT_TLS. ``PAD`` must be
the smallest positive integer that satisfies the following congruence:

``TP + TCB + PAD ≡ PT_TLS.p_vaddr (modulo PT_TLS.p_align)``
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TP+TCB+PAD on the left could be confusing, as TCB is placed before TP. Perhaps mention the requirement of TP first (= 0 (modulo p_align)), then describe PAD and this formula.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll see if I can word it better. I've found it difficult to try and explain the formula intuitively.

add xn, tp, :tprel_hi12:var, lsl #12 // R_AARCH64_TLSLE_ADD_TPREL_HI12 var
ldr xn, [xn, #:tprel_lo12_nc:var] // R_AARCH64_TLSLE_LDST64_TPREL_LO12_NC var

Static link time TLS Relaxations
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps call this Optimization to be consistent with x86/ppc and "Relocation optimization" (ADRP) and leave the term "relocation relaxation" for RISC-V style section shrinking.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For TLS specifically I'd prefer to keep relaxation as that's what its been referred to in all the previous literature such as Drepper's ELF Handling for Thread Local Storage and the TLSDESC paper too. It should help people searching in the references.

I take the point that it ought to have been called optimization. I'll add a sentence to say that we're using relaxation as a term from the existing literature.

Copy link
Copy Markdown
Contributor Author

@smithp35 smithp35 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks very much for the review.

I've updated based on this and some comments I received internally.

add xn, tp, :tprel_hi12:var, lsl #12 // R_AARCH64_TLSLE_ADD_TPREL_HI12 var
ldr xn, [xn, #:tprel_lo12_nc:var] // R_AARCH64_TLSLE_LDST64_TPREL_LO12_NC var

Static link time TLS Relaxations
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For TLS specifically I'd prefer to keep relaxation as that's what its been referred to in all the previous literature such as Drepper's ELF Handling for Thread Local Storage and the TLSDESC paper too. It should help people searching in the references.

I take the point that it ought to have been called optimization. I'll add a sentence to say that we're using relaxation as a term from the existing literature.

and ``PT_TLS`` as the program header with type PT_TLS. ``PAD`` must be
the smallest positive integer that satisfies the following congruence:

``TP + TCB + PAD ≡ PT_TLS.p_vaddr (modulo PT_TLS.p_align)``
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll see if I can word it better. I've found it difficult to try and explain the formula intuitively.


AArch64 TLS SystemV design choices

* AArch64 uses variant 1 TLS as described in ELFTLS_.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps mention ELFTLS when doing the TLS introduction as a for more in-depth info resource.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ACK. I'll mention that the introduction in the ABI is only sufficient to describe the terms used like Thread Control Block. A general introduction can be found in ELFTLS_

Copy link
Copy Markdown
Contributor Author

@smithp35 smithp35 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks very much for the comments. I'll hopefully have a new patch tomorrow.


AArch64 TLS SystemV design choices

* AArch64 uses variant 1 TLS as described in ELFTLS_.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ACK. I'll mention that the introduction in the ABI is only sufficient to describe the terms used like Thread Control Block. A general introduction can be found in ELFTLS_

knows that the TLS variable is defined in the same module as the code
that is accessing the variable. In this case the offset of the TLS
variable from the start of the module's TLS block is a static link
time constant. Instead of dynamically calculating the offset of the
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ACK

Copy link
Copy Markdown
Contributor Author

@smithp35 smithp35 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review comments. I've made the following updates:

  • Simple NFC text changes.
  • Better description of deferred TLS and generation count.
  • Reworded the padding size derivation.
  • Moved paragraphs around to make the flow a bit easier.

Should be visible as 4 separate commits

* (Most local) Automatic data (stack variables, instanced once per function
activation, per thread).

Rules governing thread local storage on AArch64
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This section probably should be named "Scope" and the part about thread_local and __thread should probably elsewhere.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will have a think to see where best to split out the source parts.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've decided to remove the source parts as they are out of scope, and everyone knows what they are anyway.

return dtv[module_id][offset];
}

The calculation in __tls_get_addr is the most general and it can be
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

s/__tls_get_addr/``__tls_get_addr``

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ACK

thread's DTV is updated, and the TLS for the ``module_id`` is
allocated if it is not present.

In pseudo code
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this pseudo code listing is not necessary since we already have the description of the operation above. If both are retained, the code and the description may diverge over time.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I wanted to include the pseudo code in case the description wasn't good enough, but if it is then I can remove it. There are other sources to find pseudo code.

}

The calculation in __tls_get_addr is the most general and it can be
applied to both static and dynamic TLS. There are four defined models
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are four defined models

This probably should start new paragraph and probably a new section

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ACK

The calculation in __tls_get_addr is the most general and it can be
applied to both static and dynamic TLS. There are four defined models
of accessing TLS that trade off generality for performance. In order
of descending generality:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

descending generality

nit: maybe "descending" is better?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will have a think. Perhaps In descending order of generality:

4. Local Exec, can be used in the executable for TLS variables
defined in the executables static TLS block.

SystemV AArch64 TLS addressing
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SystemV AArch64 TLS addressing

The title of this section is very similar to the previous one. I think it makes it a bit difficult to navigate this chapter.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will have a think to see if I can find a better one.

only the descriptor dialect as this is the default dialect for GCC
and the only dialect supported by clang.

* The thread pointer (TP) is always accessible via the ``TPIDR_EL0``
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The thread pointer (TP)

nit:

s/TP/``TP``

as TP is used like inline code below.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ACK

(``PADsize``) between the TCB and the executable's TLS Block. Using
``TCBsize`` as the size of the TCB (16 bytes), the following expression can be used to calcluate ``PADsize`` from the ``PT_TLS`` program header.

``PADsize = (PT_TLS.p_vaddr - TCBsize) mod PT_TLS.p_align``.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: add a .. code-block:: instead

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same for other formulae in this section. I think it will make reading easier.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ACK

resolver function.

The static relocations with a prefix of ``R_AARCH64_TLSDESC_``
targeting TLS symbol ``var``, instruct the static linker to create a
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: superfluous comma after var?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ACK


.. code-block:: asm

adrp xn, :gottprel: var // R_AARCH64_TLSIE_ADR_GOTTPREL_PAGE21 var
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: superfluous space before var

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ACK

Copy link
Copy Markdown
Contributor

@yury-khrustalev yury-khrustalev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added a few minor comments. Otherwise this LGTM and should be approved and merged after the small fixes as per the comments. I think this chapter is very useful and provides good level of detail suitable for this ABI documentation.

Copy link
Copy Markdown
Contributor Author

@smithp35 smithp35 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the comments, I'll prepare a new patch, most likely going to be Wednesday.

* (Most local) Automatic data (stack variables, instanced once per function
activation, per thread).

Rules governing thread local storage on AArch64
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will have a think to see where best to split out the source parts.

thread's DTV is updated, and the TLS for the ``module_id`` is
allocated if it is not present.

In pseudo code
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I wanted to include the pseudo code in case the description wasn't good enough, but if it is then I can remove it. There are other sources to find pseudo code.

return dtv[module_id][offset];
}

The calculation in __tls_get_addr is the most general and it can be
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ACK

4. Local Exec, can be used in the executable for TLS variables
defined in the executables static TLS block.

SystemV AArch64 TLS addressing
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will have a think to see if I can find a better one.

only the descriptor dialect as this is the default dialect for GCC
and the only dialect supported by clang.

* The thread pointer (TP) is always accessible via the ``TPIDR_EL0``
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ACK

(``PADsize``) between the TCB and the executable's TLS Block. Using
``TCBsize`` as the size of the TCB (16 bytes), the following expression can be used to calcluate ``PADsize`` from the ``PT_TLS`` program header.

``PADsize = (PT_TLS.p_vaddr - TCBsize) mod PT_TLS.p_align``.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ACK

resolver function.

The static relocations with a prefix of ``R_AARCH64_TLSDESC_``
targeting TLS symbol ``var``, instruct the static linker to create a
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ACK


.. code-block:: asm

adrp xn, :gottprel: var // R_AARCH64_TLSIE_ADR_GOTTPREL_PAGE21 var
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ACK

The thread local storage chapter contains:
* A description of Thread Local Storage based on addenda32
* The key design decisions of AArch64 TLS such as tls variant,
  tls dialect, TCB size.
* The ABI required code sequence for TLSDESC that must be emitted
  exactly, as GNU ld requires it to be.
* Sequences for the different code-models.
* Relaxations for GD->IE, GD->LE and IE->LE.
* Synchronization requirements for Lazy TLSDESC. With advice not
  to support it due to overhead of synchronization.
* Edits to split up the bullet points in How to denote TLS
  in source.
* Changed program-own state to process-state as the thread-id
  may not be stored separately from the programs data.
* Removed typically from some of the descriptions as the typically
  will almost always be the case for a sysvabi platform.
* Linked alignment padding to the definition.
* Provided a bit more information about generation counters.
smithp35 added 10 commits March 25, 2026 09:48
* Rearranged formulas and used TCBsize to make it clearer.
* Taken out "significant" from a significant number of dynamic
  linkers.
* Give reason for using relaxation rather than optimization.
* Clarify that there is no requirement to implement any TLSDESC
  resolver given in the sysvabi.
Change the input register in add xn, xn, :tprel_hi12:var, lsl ARM-software#12
to the thread pointer tp. We want to calculate the offset from the
thread pointer so it needs to be an input of the add.
Document the decision in the GCC mailing list thread
TLSDESC clobber ABI stability/futureproofness?
https://gcc.gnu.org/legacy-ml/gcc/2018-10/msg00112.html

TLSDESC resolver functions assume that any registers added
by an extension are caller saved for a TLSDESC call.

A brief summary:

Dynamic TLS may be lazy allocated upon the first use of a TLSDESC
resolver. This may involve calls to heap allocation functions
provided by the user, which may use registers from extensions
like SVE and SME. As the resolver function can't know what is
saved it would have to save all SVE and SME state. This would
be way more expensive than a caller save, and an older libc
written prior to the introduction of the extension would be
unaware of them so the caller has to do the save.

* The SVE and SME state is already
Include a pseudo code description of __tls_get_addr with deferred
TLS for dynamic modules.
Use integers modulo m to avoid excess use of (modulo m).
Explain the congruence symbol.
Put expression first so derivation is optional.
The TLSDESC resolver functions are not ABI so we can move them
out of the sysvabi64 document. Providing some examples that can
be used by a dynamic linker is still useful so move this to the
design documents section.

Add a comment about DTV surplus TLS that permits a dynamic loader
to dlopen a DSO with initial-exec TLS. There can be a small
number of performance critical shared-libraries that use initial
exec TLS, but are expected to be opened via dlopen, particularly
by scripting languages like python.
* Added `` `` to some variables.
* Added some more section headings.
* Used code-blocks for formula.
* Fixed reference to design document.
@smithp35
Copy link
Copy Markdown
Contributor Author

Made changes to address review comments and rebase.

Previous review comments changed name of a section.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants