Paul E. Hoffman
With a little thought to their presentation, however, URLs for most Internet services can be made more inviting and useful. This is not to say that the specification for the structure of URLs must be changed: instead, people who create Internet services need to think more about how to name them so that they will appear better in URLs.
This paper describes some of the problems that are common in current URLs seen on the Internet. Solutions to these problems are not always possible, or even desirable in all cases. There are certainly no "right" or "wrong" ways to name URLs. However, if Internet users keep these interface guidelines in mind when they create Internet resources, it is likely that the URLs for those resources will be easier for other people to use.
Documents, whether they are in print or electronic form, often refer to other documents. There is usually a standard way to make references in one document to other documents. In printed books, for example, when you want to refer to another book, you usually refer to it by title. You might also refer to it by its International Standard Book Number (ISBN). At the time the Web was created, there was no standard method to refer to electronic documents. Thus, URLs were created as a reference mechanism.
It is important to understand that a URL describes a document by its location. This is somewhat akin to describing a book in a library by its location on the shelf instead of by its name. If you look at the Internet as a single, huge library, it is reasonable to describe documents by their location because there is currently no standard way to describe them by name.
Two schemes that do not use the common syntax are mailto and news. The scheme-specific part of a mailto URL is simply an Internet mailing address; the scheme-specific part of a news URL is simply the name of the Usenet newsgroup.
This description of the basic URL syntax starts to show some basic user interface issues for URLs. For many users, the "://" part of the common syntax reeks of technobabble. Since few users understand TCP ports, an additional ":" and number after the host name is also confusing, since there is no description of what the number means.
Further, not all characters in the US-ASCII character set can be used in a URL. RFC 1738 lists many punctuation characters that must first be encoded as hexadecimal before being included in a URL. For example, the tilde character (~) is an "unsafe" character, and must not appear in the URL. Instead, the Web user must encode the tilde as the characters "%7E". Thus, the URL:
is not allowed. Instead, it would be:
Because a URL simply describes the location of information on the Internet, you don't really "create" a URL. Instead, you place the information on an Internet server and then derive the URL from where you placed the information. The vast majority of Internet servers base the location on file names, leading to URLs that resemble file names in a directory path. On such systems, the choices made about the names of the directories and files, and about the hierarchy, directly affect the URLs for the information on that server.
Some typical problems with URLs found on the Web today are described in this section. Note that these are not things that must be avoided at all costs, but simply guidelines for what to look out for when creating URLs. At many sites, some of these problems are unavoidable without reconfiguring the server software. However, such changes can often be made with the help of the system administrator, particularly if there is a compelling reason to do so.
Printed URLs can contain characters that a reader may misinterpret when they type the URL into a computer program. The letter "l" looks like the digit "1"; the letter "O" looks like the digit "0"; and semicolons can be mistaken for commas. Depending on the design of the font, a double hyphen (--) can look like a single hyphen, a back quote (`) can become almost invisible, and so on.
In recent years, magazine publishers have discovered that a few characters that are normally used in typesetting also appear in URLs in such a way as to make it likely that the URL will be shown incorrectly unless the work is carefully proofread. Many typesetting systems use the tilde (~) as a special character that must be specially marked in order to appear in print; forgetting to do so causes the character to disappear from the printed text. In other typesetting systems, back quotes (`) and percent signs (%) also cause problems. Proofreaders must be careful to check the original text of a URL against what appears on the printed page when the URL contains any of these special characters.
A pattern of domain naming has developed in the past few years, partially due to the popularity of the Web. Most hosts that serve particular Internet protocols indicate the protocol names in their domain names. For example, if there is an FTP server at bigstate.edu, the domain name for that host is usually "ftp.bigstate.edu". There have been many arguments about the logic of this arrangement, but the pattern has been established throughout the Internet.
This has led to users expecting these domain names, even though they are voluntary and not terribly meaningful. If a host has the "wrong" name, it can confuse users who expect names to follow the pattern. For instance, if the host name part of an http URL was "web.bigcompany.com" or "http.bigcompany.com", the user might not notice that the name was different than the expected "www" and type the URL with "www.bigcompany.com". Fortunately, this problem is easy to remedy: system administrators can give all the expected names to a single host computer.
URLs that include a TCP port number are also likely to cause confusion and typing mistakes. Most Internet users understand a bit about domain names and understand directory trees, but few understand what TCP ports are. This is a particularly thorny issue for a single host that serves information differently from different ports. For example, assume that a user sees the URL:
They may at first type the URL without the port number. If there is also an HTTP server at the standard port, port 80, the user will access the wrong server, and will probably get an error message stating that the document doesn't exist. Similarly, if they type "8000" instead of "8001" and there is a server on that port as well, there may be no indication that they are on the wrong server.
Throughout this paper, the URLs given as examples have all used English as a base language. In fact, the vast majority of the Internet is English-centric, even though many people on the Internet speak little or no English. This emphasis on English is pervasive on the Internet, not just in URLs.
Naming URLs with languages other than English brings up the character set issue discussed earlier in this paper. In many languages, diacritic characters are essential for differentiating words, yet these characters don't exist in the US-ASCII character set. In many cases, the lack of international characters prevents people from using "logical" names for URLs when those names contain characters not represented in the US-ASCII character set.
The issue of human languages also goes a bit deeper in that some names used in URLs have different meanings in different languages. A word or name in one language can have quite different meanings in other languages. Further, there are also cultural meanings for the same word in the same language used in different countries. For example, there are words commonly used in American English that would be considered rude or vulgar in British English and vice versa.
For example, assume that a user sees the following URL:
There are many guesses that a user can make from this URL:
Instead of the URL shown previously for Yee's course notes, the server might instead make them available with the URL:
The course notes would be the same in both cases, but someone seeing this URL would not have any clues how to find information about anyone else in the Math Department, or any other department for that matter.
Paul E. Hoffman, President
127 Segre Place
Santa Cruz, CA 95060 USA