Skip to content

feat(units): add an expression parser#173

Open
HaoZeke wants to merge 7 commits intometatensor:mainfrom
HaoZeke:feat/unit-expression-parser
Open

feat(units): add an expression parser#173
HaoZeke wants to merge 7 commits intometatensor:mainfrom
HaoZeke:feat/unit-expression-parser

Conversation

@HaoZeke
Copy link
Member

@HaoZeke HaoZeke commented Mar 4, 2026

Closes #154.

Replaced the per-quantity lookup tables with a Shunting-Yard expression parser
that works on arbitrary compound unit strings in the spirit of lumol.

"kJ/mol/A^2"  -->  tokenize  -->  shunting-yard  -->  AST  -->  eval
                   [kJ,/,mol,     [kJ,mol,/,         tree     {factor, dim}
                    /,A,^,2]       A,2,^,/]

Each token resolves to an SI conversion factor and a 5-element dimension vector
[L, T, M, Q, Theta]. The parser composes these through multiplication, division,
and exponentiation. Conversion factor between two expressions = ratio of their
SI factors after verifying dimension equality.

API changes

| Before (3-arg)                                  | After (2-arg)                                    |
|-------------------------------------------------+--------------------------------------------------|
| ~unit_conversion_factor("energy", "eV", "meV")~   | ~unit_conversion_factor("eV", "meV")~              |
| ~unit_conversion_factor("force", "eV/A", "eV/A")~ | ~unit_conversion_factor("eV/A", "eV/A")~           |
| Not possible                                    | ~unit_conversion_factor("(eV*u)^(1/2)", "u*A/fs")~ |

Expression syntax

Operators: * (multiply), / (divide), ^ (power), () (grouping).
Whitespace ignored. Case-insensitive. Numeric literals allowed in exponents.
Fractional exponents via parenthesized division: ^(1/2).

Token table

Single flat unordered_map with 30+ entries covering length (angstrom, bohr, nm,
m, cm, mm, um), energy (eV, meV, hartree, ry, joule, kcal, kJ), time (fs, ps),
mass (u, kg, g, electronmass), charge (e, coulomb), dimensionless (mol), and
derived (hbar).

Notes

kelvin is NOT in the token table because temperature conversions between
offset-based scales (Celsius, Fahrenheit) are non-multiplicative.
DIM_TEMPERATURE exists as dimension [0,0,0,0,1] for potential future use but
no tokens currently carry it. (maybe once we do an API break, can revisit during mini-metatomic)

Contributor (creator of pull-request) checklist

  • Tests updated (for new features and bugfixes)?
  • Documentation updated (for new features)?
  • Issue referenced (for PRs that solve an issue)?

Reviewer checklist

  • CHANGELOG updated with public API or any other important changes?

@HaoZeke HaoZeke requested review from GardevoirX and Luthaf March 4, 2026 11:10
@HaoZeke HaoZeke force-pushed the feat/unit-expression-parser branch 3 times, most recently from 92bf24c to 0fe0ef7 Compare March 4, 2026 12:05
@GardevoirX
Copy link
Contributor

Thanks a lot! I think it would be better if we can use the new functionality to check if the quantity and unit match, when initializing ModelOutput here
https://github.com/HaoZeke/metatomic/blob/0fe0ef73e82b089811278472acba56267578b80c/metatomic-torch/include/metatomic/torch/model.hpp#L48-L61

HaoZeke added a commit to HaoZeke/metatomic that referenced this pull request Mar 4, 2026
Address PR metatensor#173 review feedback from GardevoirX:
- Add s, second, ms, us, ns, ps with full-word aliases to time tokens
- Add tests verifying ModelOutput rejects mismatched quantity/unit dims
- Add tests for standalone micro sign (U+00B5) -> Dalton resolution
- Update docs token table and doxygen with new time unit coverage
- Fix stray dash in RST list-table Dimensionless row
@HaoZeke HaoZeke requested a review from GardevoirX March 4, 2026 15:26
Copy link
Contributor

@GardevoirX GardevoirX left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, love it!

Comment on lines +23 to +24
Dimensional compatibility is verified automatically; no ``quantity`` parameter
is needed. Token lookup is case-insensitive, and whitespace is ignored.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be documented as part of unit_conversion_factor directly?

- Tokens
- Notes
* - **Length**
- ``angstrom`` (``a``), ``bohr``, ``nm`` (``nanometer``), ``meter`` (``m``), ``cm`` (``centimeter``), ``mm`` (``millimeter``), ``um`` (``micrometer``)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- ``angstrom`` (``a``), ``bohr``, ``nm`` (``nanometer``), ``meter`` (``m``), ``cm`` (``centimeter``), ``mm`` (``millimeter``), ``um`` (``micrometer``)
- ``angstrom`` (``A``), ``bohr``, ``nm`` (``nanometer``), ``meter`` (``m``), ``cm`` (``centimeter``), ``mm`` (``millimeter``), ``um`` (``micrometer``)

This should explicitly list µm as well. I'm also not sure why the main unit is sometimes meter sometimes cm. I would use the full name every time, and put the abbreviation in parenthesis.

- ``angstrom`` (``a``), ``bohr``, ``nm`` (``nanometer``), ``meter`` (``m``), ``cm`` (``centimeter``), ``mm`` (``millimeter``), ``um`` (``micrometer``)
-
* - **Energy**
- ``ev``, ``mev``, ``hartree``, ``ry`` (``rydberg``), ``joule`` (``j``), ``kcal``, ``kj``
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- ``ev``, ``mev``, ``hartree``, ``ry`` (``rydberg``), ``joule`` (``j``), ``kcal``, ``kj``
- ``eV``, ``meV``, ``Hartree``, ``Ry`` (``rydberg``), ``Joule`` (``J``), ``kcal``, ``kJ``

let's use the usual upper case version of the units in the docs

- ``ev``, ``mev``, ``hartree``, ``ry`` (``rydberg``), ``joule`` (``j``), ``kcal``, ``kj``
- ``kcal`` and ``kj`` are bare (not per-mol); write ``kcal/mol`` for the per-mole unit
* - **Time**
- ``s`` (``second``), ``ms`` (``millisecond``), ``us`` (``microsecond``), ``ns`` (``nanosecond``), ``ps`` (``picosecond``), ``fs`` (``femtosecond``)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

needs µs as well

Comment on lines +57 to +58
Known quantities
^^^^^^^^^^^^^^^^
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Long term, I think we'll want to remove the user-provided quantity, and just check that an "energy" output has an unit with the dimension of an energy.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now we can keep this!


CHECK_THROWS_WITH(options->set_length_unit("unknown"),
StartsWith("unknown unit 'unknown' for length")
StartsWith("unknown unit token 'unknown'")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think user care about tokens, so why mention it in the error message?


### Changed

- `validate_unit` now accepts arbitrary parseable expressions and validates
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

validate_unit is not public and should not be mentioned here =)

_unit_conversion_factor_v2 = torch.ops.metatomic.unit_conversion_factor_v2
_unit_conversion_factor_v1 = torch.ops.metatomic.unit_conversion_factor

def unit_conversion_factor(*args, **kwargs):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this work with TorchScript? I thought *args, **kwargs is not supported

HaoZeke added 3 commits March 9, 2026 12:47
Replace the old string-matching unit conversion with a Shunting-Yard
expression parser that supports *, /, ^, and parentheses. Unit code
extracted to units.hpp/cpp per review feedback. Micro sign handling
uses explicit base_units entries instead of global normalization.
Remove standalone micro sign test (no longer normalizes globally),
add micro sign microsecond tests instead. Update error message
assertions ("unknown unit" not "unknown unit token"). Add comment
explaining why models.cpp test uses valid quantity/unit strings.
Move full unit expression documentation to the Doxygen comment on
unit_conversion_factor in units.hpp (renders via autofunction).
Remove redundant standalone sections from misc.rst, keeping only
the known-quantities table and deprecation note.
@HaoZeke HaoZeke force-pushed the feat/unit-expression-parser branch from 01198d7 to eb8e28a Compare March 9, 2026 11:48
HaoZeke added 4 commits March 9, 2026 12:50
Remove validate_unit from CHANGELOG (not public API). Move
unit_conversion_factor docstring from documentation.py to the
Python function in __init__.py (documentation.py is only for
C++ ops).
to_lower() now skips non-ASCII bytes, preventing macOS locale from
mangling UTF-8 micro sign (0xB5 -> 'u'). Restore unit_conversion_factor
in documentation.py since __init__.py imports it for Sphinx builds.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Use a proper expression parser for unit conversions

3 participants