URLs: It's Complicated

  • Just to share a little more of the weirdness (discovered while reading a couple of the historical URL & URI RFCs several days ago):

    Per the original spec, in FTP URLs,

    - ftp://example.net/foo/bar will get you bar inside the foo directory inside the default directory of the FTP server at example.net (i.e. CWD foo, RETR bar);

    - ftp://example.net//foo/bar will get you bar inside the foo directory inside the empty string directory inside the default directory of the FTP server at example.net (i.e. CWD, CWD foo, RETR bar; what do FTP servers even do with this?);

    - and it’s ftp://example.net/%2Ffoo/bar that you must use if you want bar inside the foo directory inside the root directory of the FTP server at example.net (i.e. CWD /foo, RETR bar; %2F being the result of percent-encoding a slash character).

  • It seems like the colon is too ambiguous (is used as a protocol delimiter, delimiter for user/pass, delimiter for port).

    Reminds a little bit of Java labels where you can do this:

      public class Labels {
        public static void main(String args[]){
            https://hn.ycombinator.com
            for(int i=0; i<10; i++){
                System.out.println("......."+i );
            }  
        }  
      }
    
    the https: is a label named https and everything after the colon is a comment so this is valid code.

  • All extremely useful: the overview, the examples and the comments.

    A few months ago while writing a bot/crawler I searched for hours for something like this, but I found only full specs or just bits and pieces scattered around that used different terminology and/or had different opinions.

    In the end I didn't even clearly understand what should be the max total URL length (e.g. mixed opinions here https://stackoverflow.com/questions/417142/what-is-the-maxim... - come on, a xGiB long URL?) => most of the time 2000 bytes is mentioned but it's not 100% clear.

    Writing a bot made me understand 1) why browsers are so complicated and 2) that the Internet is a mess (e.g. once I even found a page that used multiple character encodings...).

    My personal opinion is that everything is too lax. Browsers try to be the best ones by implementing workarounds for stuff that does not have (yet) or does not comply to a spec => this way it can only end up in a mess. A simple example is the HTTP-header "Content-Encoding" ( https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Co... ) which I think should only indicate what kind of compression is being used, but I keep seeing in there stuff like "utf8"/"image/jpeg"/"base64"/"8bit"/"none"/"binary"/etc... and all those pages/files work perfectly in the browsers even if with those values they should actually be rejected... .

  • I have come across even more issues caused by IRIs used incorrectly in place of URIs by a popular web framework, causing havoc with OAuth redirects.

    https://en.wikipedia.org/wiki/Internationalized_Resource_Ide...

  • > making this is a valid URL: https://!$%:)(*&^@www.netmeister.org/blog/urls.html

    Uh, no. "%:)" is not <"%" HEXDIG HEXDIG> nor is % allowed outside of that. (Although your browser will likely accept it)

    > This includes spaces, and the following two URLs lead to the same file located in a directory that's named " ": > https://www.netmeister.org/blog/urls/ /f > https://www.netmeister.org/blog/urls/%20/f > Your client may automatically percent-encode the space, but e.g., curl(1) lets you send the raw space:

    Uh, no. Just because one of your clients is wrong and some servers allow it doesn't mean it's allowed by the spec.

    In fact, the HTTP/1.1 RFC defers to RFC2396 for the meaning of <abs_path>: <path_segments> which begin with a /.

    What is <path_segments>? A bunch of slash-delimited <segment>s.

    What is <segment>? A bunch of <pchar> and maybe a semicolon.

    What is <pchar>? <unreserved>, <escaped>, or some special characters (not including space).

    What is <unreserved>? Letters, digits, and some special characters (not including space).

    What is <escaped>? <"%" hex hex>.

    Most HTTP clients and servers are pretty forgiving about what they accept, because other people do broken stuff, like sending them literal spaces. But that doesn't mean it's "allowed", that doesn't mean every server allows it, and that doesn't mean it's a good idea.

    > That is, if your web server supports (and has enabled) user directories, and you submit a request for "~username": [it does stuff]

    Uh, no. If you're using Apache, that might be true. As you mentioned, this is implementation-defined (as are all pathnames).

    > Now with all of this long discussion, let's go back to that silly URL from above: ... Now this really looks like the Buffalo buffalo equivalent of a URL.

    Not really.

    > Now we start to play silly tricks: "⁄ ⁄www.netmeister.org" uses the fraction slash characters

    You are aware that URLs predate Unicode, right? Not to mention that Unicode lookalike characters are a Unicode (or UI) problem, not a URL problem?

    > The next "https" now is the hostname component of the authority: a partially qualified hostname, that relies on /etc/hosts containing an entry pointing https to the right IP address.

    Or on a search domain (which could be configured locally, or through GPO on Windows, or through DHCP!). Or maybe your resolver has a local zone for it. Or maybe ...

  • Layouts using <table>s are complicated too. For example, this page has a ~7800px-wide <pre> tag in a <table> that's 720px wide.

  • Specifically using another font for the code tag then the rest of the blog to hide the difference between ⁄⁄ and // seems weird. I get that it wouldn't be interesting if not doing that, but doesn't that just show that it's really not as complicated as you make it out to be?

  • URLs are not complicated, unless you complicate them.

    foo|foo -foo 's^foo^foo^'"">foo 2>>foo

    is not a very good example for teaching the structure of the the command line.

    Pick a better one.

    It's simple.

  • It doesn’t seem complicated at all. Complicated to me means difficult to understand. This just involves reading the spec and it all seems pretty simple and consistent.

    Complicated doesn’t mean “new to me.” If I haven’t read a man page, that doesn’t mean the command is complicated.