Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
url: update WHATWG URL parser to latest spec
- Update to spec
  - Add opaque hosts
  - File state did not correctly deal with lack of base URL
  - Cleanup API for file and non-special URLs
  - Allow % and IPv6 addresses in non-special URL hosts
  - Use specific names for percent-encode sets
  - Add empty host concept for file and non-special URLs
  - Clarify IPv6 serializer
- Fix existing mistakes
  - Add missing ':' to forbidden host code point list.
  - Correct IPv4 parser empty label behavior
- Maintain type equivalence in URLContext with spec
  - scheme, username, and password should always be strings
  - host, port, query, and fragment may be strings or null
  - Align scheme state more closely with the spec
  - Make sure the `special` variable is always synced with
    URL_FLAG_SPECIAL.

PR-URL: #12523
Fixes: #10608
Fixes: #10634
Refs: whatwg/url#185
Refs: whatwg/url#225
Refs: whatwg/url#224
Refs: whatwg/url#218
Refs: whatwg/url#243
Refs: whatwg/url#260
Refs: whatwg/url#268
Reviewed-By: James M Snell <jasnell@gmail.com>
Reviewed-By: Daijiro Wachi <daijiro.wachi@gmail.com>
Reviewed-By: Joyee Cheung <joyeec9h3@gmail.com>
  • Loading branch information
TimothyGu committed Apr 23, 2017
commit e0530f1eba3c8fcf0db63edf6ec445299574237d
28 changes: 15 additions & 13 deletions doc/api/url.md
Original file line number Diff line number Diff line change
Expand Up @@ -1053,23 +1053,25 @@ located within the structure of the URL. The WHATWG URL Standard uses a more
selective and fine grained approach to selecting encoded characters than that
used by the older [`url.parse()`][] and [`url.format()`][] methods.

The WHATWG algorithm defines three "encoding sets" that describe ranges of
characters that must be percent-encoded:
The WHATWG algorithm defines three "percent-encode sets" that describe ranges
of characters that must be percent-encoded:

* The *simple encode set* includes code points in range U+0000 to U+001F
(inclusive) and all code points greater than U+007E.
* The *C0 control percent-encode set* includes code points in range U+0000 to
U+001F (inclusive) and all code points greater than U+007E.

* The *default encode set* includes the *simple encode set* and code points
U+0020, U+0022, U+0023, U+003C, U+003E, U+003F, U+0060, U+007B, and U+007D.
* The *path percent-encode set* includes the *C0 control percent-encode set*
and code points U+0020, U+0022, U+0023, U+003C, U+003E, U+003F, U+0060,
U+007B, and U+007D.

* The *userinfo encode set* includes the *default encode set* and code points
U+002F, U+003A, U+003B, U+003D, U+0040, U+005B, U+005C, U+005D, U+005E, and
U+007C.
* The *userinfo encode set* includes the *path percent-encode set* and code
points U+002F, U+003A, U+003B, U+003D, U+0040, U+005B, U+005C, U+005D,
U+005E, and U+007C.

The *simple encode set* is used primary for URL fragments and certain specific
conditions for the path. The *userinfo encode set* is used specifically for
username and passwords encoded within the URL. The *default encode set* is used
for all other cases.
The *userinfo percent-encode set* is used exclusively for username and
passwords encoded within the URL. The *path percent-encode set* is used for the
path of most URLs. The *C0 control percent-encode set* is used for all
other cases, including URL fragments in particular, but also host and path
under certain specific conditions.

When non-ASCII characters appear within a hostname, the hostname is encoded
using the [Punycode][] algorithm. Note, however, that a hostname *may* contain
Expand Down
Loading