gh-137146: Validate IPv6 ZoneID characters against RFC 6874 in urllib.parse#137148
Conversation
…pliant set
The current parsing logic for IPv6 addresses with Zone Identifiers (ZoneIDs)
uses the `ipaddress` module, which validates ZoneIDs according to RFC 4007,
allowing any non-null string. However, when used in URLs, ZoneIDs must follow
the percent-encoded format defined in RFC 6874.
This patch adds a check to restrict ZoneIDs to the allowed characters:
ALPHA / DIGIT / "-" / "." / "_" / "~" / "% HEXDIG HEXDIG"
RFC 6874 §2.1 specifies the format of an IPv6 address with a ZoneID in a URI as:
`IPv6addrz = IPv6address "%25" ZoneID`
Additionally, RFC 6874 recommends accepting a bare `%` without hex digits as a
liberal extension, but that flexibility still requires ZoneID content to conform
to a safe character set. This patch enforces that ZoneIDs do not include
characters outside the permitted range.
### Before the fix:
```py
>>> import urllib.parse
>>> urllib.parse.urlparse("http://[::1%2|test]/path")
ParseResult(scheme='http', netloc='[::1%2|test]', path='/path', ...)
```
Invalid characters such as `|` were incorrectly accepted in ZoneIDs.
### After the fix:
```py
>>> import urllib.parse
>>> urllib.parse.urlparse("http://[::1%2|test]/path")
Traceback (most recent call last):
...
ValueError: IPv6 ZoneID is invalid
```
This patch ensures `urllib.parse` properly rejects ZoneIDs with invalid characters,
improving compliance with the URI standards and helping prevent subtle bugs
or security vulnerabilities.
|
In the future, please use the title format I have edited your title too, as so that our automation can recognise it. |
StanFromIreland
left a comment
There was a problem hiding this comment.
This needs a blurb entry.
ZeroIntensity
left a comment
There was a problem hiding this comment.
Please add a test case.
| @@ -0,0 +1 @@ | |||
| Validate IPv6 ZoneID characters in bracketed hostnames to match RFC 6874. `urllib.parse` now rejects ZoneIDs containing invalid or unsafe characters. | |||
There was a problem hiding this comment.
This is reStructuredText, not Markdown, so references look like this:
| Validate IPv6 ZoneID characters in bracketed hostnames to match RFC 6874. `urllib.parse` now rejects ZoneIDs containing invalid or unsafe characters. | |
| Validate IPv6 ZoneID characters in bracketed hostnames to match RFC 6874. :mod:`urllib.parse` now rejects ZoneIDs containing invalid or unsafe characters. |
There was a problem hiding this comment.
urllib.parse is a module, if you want to talk about the function it's urllib.parse.urlparse. I've edited Zero's answer by changing the role as I didn't check which functions are affected (if it's the entire module, it's fine to only quote the module)
There was a problem hiding this comment.
I corrected the blurb and added the tests, thank you for your help.
Lib/urllib/parse.py
Outdated
| ip = ipaddress.ip_address(hostname) # Throws Value Error if not IPv6 or IPv4 | ||
| if isinstance(ip, ipaddress.IPv4Address): | ||
| raise ValueError(f"An IPv4 address cannot be in brackets") | ||
| if "%" in hostname and not re.match(r"\A(%[a-fA-F0-9]{2}|[\w\.~-])+\z", hostname.split("%", 1)[1], flags=re.ASCII): |
There was a problem hiding this comment.
Why are we using \A and \z instead of fullmatch? In addition, we should instead use a compiled regex.
There was a problem hiding this comment.
Thanks for the review, it's done.
Lib/urllib/parse.py
Outdated
| if isinstance(ip, ipaddress.IPv4Address): | ||
| raise ValueError(f"An IPv4 address cannot be in brackets") | ||
| if "%" in hostname and not re.match(r"\A(%[a-fA-F0-9]{2}|[\w\.~-])+\z", hostname.split("%", 1)[1], flags=re.ASCII): | ||
| raise ValueError(f"IPv6 ZoneID is invalid") |
There was a problem hiding this comment.
The f-string is not necessary (you can also remove it from the other raise ValueError).
There was a problem hiding this comment.
Thanks for the review, it's done.
Misc/NEWS.d/next/Library/2025-07-27-15-23-32.gh-issue-137146.BE_ylT.rst
Outdated
Show resolved
Hide resolved
Lib/urllib/parse.py
Outdated
| ip = ipaddress.ip_address(hostname) # Throws Value Error if not IPv6 or IPv4 | ||
| if isinstance(ip, ipaddress.IPv4Address): | ||
| raise ValueError(f"An IPv4 address cannot be in brackets") | ||
| if "%" in hostname and not re.match(r"\A(%[a-fA-F0-9]{2}|[\w\.~-])+\z", hostname.split("%", 1)[1], flags=re.ASCII): |
There was a problem hiding this comment.
Can't this actually be delegated to ipadress.ip_address instead?
There was a problem hiding this comment.
Thanks @picnixz for your review. I explain it in the description.
ipadress.ip_address uses RFC 4007 because it may implement ZoneID in the IPv6 string representation. In URLs, there are special and not allowed characters. RFC 4007 defines a good string representation, but it is not very suitable for URL format and parsing. So there is another RFC that defines how ZoneID may be written in URL: RFC 6874.
ipadress.ip_address follows RFC 4007 to parse the IPv6 format and urllib should parse the ZoneID as defined by the URL format. So these are two different representations and two different implementations. We can't use ipadress.ip_address if we want to respect the URL format and parsing rules.
Co-authored-by: Bénédikt Tran <10796600+picnixz@users.noreply.github.com>
This PR tightens the validation of IPv6 Zone Identifiers (ZoneIDs) in bracketed hostnames handled by
urllib.parse(#137146).Problem
Currently,
urllib.parseaccepts any non-null string as a ZoneID, because it delegates IPv6 parsing to theipaddressmodule, which follows RFC 4007. However, RFC 6874 §2.1 defines a stricter character set for ZoneIDs when used in URLs:ZoneIDs in URIs must be percent-encoded and may optionally begin with a literal
%(e.g.,%25) as described in the RFC.Fix
This patch adds an explicit validation step to check that any ZoneID in a URL conforms to the allowed character set.
Before the fix:
After the fix:
Notes
%is present in the hostname (i.e., it's a ZoneID).This improves RFC compliance, reduces risk of incorrect or insecure behavior, and ensures more predictable URL parsing.
urllib.parseaccepts invalid characters in IPv6 ZoneIDs and IPvFuture addresses #137146