Friday, March 21, 2008

HTML Entities

HyperText Markup Language (or HTML as commonly known) is the language used in writing up documents for the web. This language provides the necessary markup to define the structure and content of a web document (or page). Each of these markups are formally called HTML elements. Usually an element consist of an opening tag and a closing tag, which mark the beginning and the end of the element respectively. For example the paragraph element, which marks up a paragraph of text, has the opening tag <p> and the closing tag </p>.

As you just saw, each HTML element is enclosed within angle brackets or within the two characters '<' and '>'. Since these two characters have a special meaning within the context of an HTML document, it is not possible to literally use them in a document. So what if you want to include these characters in the actual page that is shown on the browser? The answer is HTML entities.

The HTML has a set of characters called HTML entities to be used when including the characters that have special meaning in the language, inside an HTML page. For example, if you want to include an HTML or XML code snippet in your document, then you have to use these entities.

An entity consist of 3 parts, all written together without any spaces. They are

  • the ampersand character ( & )
  • an entity name or a number
  • a semicolon

For example, the '<' character is written as '&lt;', where 'lt' is the name of the entity. You might ask... if the ampersand character is used in all entities, how do I represent the ampersand character itself? Good question. The answer is, there is another entity to denote the ampersand character which goes as '&amp;'. It can get somewhat confusing at times. This type of mechanism is called an escape sequence in programmers' jargon.

To illustrate the use of entities let's take the following simple snippet.

<html>
<head>
<title>Test Page</title>
</head>
<body>
<p>
Hello World!
</p>
</body>
</html>

To get the <html> to appear like that on the browser, what I typed in to the actual HTML code is &lt;html&gt;. Confused? No it's not that difficult. This is what happens. When the browser reads an HTML document from beginning to end (yes, the browser also reads it as we do; technically this is called parsing) every &lt; that it comes across will be replaced with the single character '<', before displaying the page for the human readers. Similarly, every HTML entity will be replaced with its corresponding display character.

What's the use of HTML entities anyway? Well, as I said before, one use is when we want to display code snippets in our page. Then there are some characters that we cannot type using the keyboard such as the '£' and the '€', which we can insert using HTML entities. Here's a full list of HTML entities from the W3Schools site.

A simple technique to replace entities
If you want to include a code snippet in one of your articles, you might have to replace many less-than and greater-than signs with the corresponding entities. This can be pretty cumbersome even for a few lines of code.

You can use a very basic text editor such as the Notepad (on Windows) to do it easily. Here's how.
  1. Copy the code snippet in to a new Notepad document
  2. Go to Edit -> Replace... (or Ctrl+H) and type '<' in the Find what box. Type '&lt;' in the Replace With box.
  3. Click Replace All
  4. Repeat this for all the entities that you want to replace. (For example, '>' with '&gt;' and '"' with '&quot;')

The following figure illustrates this simple technique. You will find it very useful whenever you need to parse a code snippet that includes special characters to be embedded inside an HTML page.