Some characters cannot be part of a URL (for example, the space) and some other characters have a special meaning in a URL. In HTML forms, the character = is used to separate a name from a value. The URI generic syntax uses URL encoding to deal with this problem, while HTML forms make some additional substitutions rather than applying percent encoding for all such characters.
For example, spaces in a string are either encoded with %20 or replaced with the plus sign (+). If you use a pipe character (|) as a separator, be sure to encode the pipe as %7C. A comma in a string should be encoded as %2C.
It is recommended you use your platform's normal URL building libraries to automatically encode your URLs, to ensure the URLs are properly escaped for your platform. Additionally, URLs are limited to 2048 characters for each request. Be aware of this limit when constructing your URLs.
Building a Valid URL
You may think that a "valid" URL is self-evident, but that's not quite the case. A URL entered within an address bar in a browser, for example, may contain special characters (e.g. "上海+中國"); the browser needs to internally translate those characters into a different encoding before transmission. By the same token, any code that generates or accepts UTF-8 input might treat URLs with UTF-8 characters as "valid", but would also need to translate those characters before sending them out to a web server. This process is called URL-encoding or percent-encoding.
Special characters
We need to translate special characters because all URLs need to conform to the syntax specified by the W3 Uniform Resource Identifier specification. In effect, this means that URLs must contain only a special subset of ASCII characters: the familiar alphanumeric symbols, and some reserved characters for use as control characters within URLs. The table below summarizes these characters:
Set | characters | URL usage |
Alphanumeric | a b c d e f g h i j k l m n o p q r s t u v w x y z A B C D E F G H I J K L M N O P Q R S T U V W X Y Z 0 1 2 3 4 5 6 7 8 9 | Text strings, scheme usage (http), port (8080), etc. |
Unreserved | - _ . ~ | Text strings |
Reserved | ! * ' ( ) ; : @ & = + $ , / ? % # [ ] | Control characters and/or Text Strings |
When building a valid URL, you must ensure that it contains only those characters shown above. Conforming a URL to use this set of characters generally leads to two issues, one of omission and one of substitution:
- Characters that you wish to handle exist outside of the above set. For example, characters in foreign languages such as 上海+中國 need to be encoded using the above characters. By popular convention, spaces (which are not allowed within URLs) are often represented using the plus '+' character as well.
- Characters exist within the above set as reserved characters, but need to be used literally. For example, ? is used within URLs to indicate the beginning of the query string; if you wish to use the string "? and the Mysterions," you'd need to encode the '?' character.
All characters to be URL-encoded are encoded using a '%' character and a two-character hex value corresponding to their UTF-8 character. For example, 上海+中國 in UTF-8 would be URL-encoded as %E4%B8%8A%E6%B5%B7%2B%E4%B8%AD%E5%9C%8B. The string ? and the Mysterians would be URL-encoded as %3F+and+the+Mysterians.