Emma Marie McClellan

Theresa Lou Epley

Noah Roscoe Ray Hardcastle

2013

May

09

HTML and CSS for E-book Self-Publishers, Part 2: HTML, XML, and Both

By Duane

Here in the second installment, as promised, we examine HTML and its relatives, the linguistic realm of content description.  This is what defines what this web page (and all others) says.  How it looks (fonts, colors, etc.) is more properly done with CSS, which we'll discuss next time.  But for now, we are looking at the information you want to present.  In the world of e-book publishing, this will be, of course, what you have written.

The Pieces

There are three languages you need to know about in order to sound like you know what you're talking about, but don't worry.  They're really all the same thing.  But it does help to be able to distinguish them, especially if you're having a problem and need to ask about it online.

HTML

This is how it all started.  The first web pages I ever wrote were in HTML 1.0, which, hopefully, you'll never see again.  To throw you right into the shark-infested waters, here is a sample taken from the original HTML specification from 1993:

<HTML>

<TITLE>

A sample HTML instance

</TITLE>

<H1>

An Example of Structure

</H1>

Here's a typical paragraph.

<P>

<UL>

<LI>

Item one has an

<A NAME="anchor">

anchor

</A>

<LI>

Here's item two.

</LI>

</UL>

</P>

</HTML>

I would like you to note the overall structure. It consists of a number of elements (such as HTML) delineated by a start tag (<HTML>) and a stop tag (</HTML>).  Within it is a TITLE element, a header (H1) element, and a paragraph (P) element.  The paragraph element then contains other elements.  There are some things you should observe:

  1. The start and end tags for any element are identical except that the end tag name starts with a slash, or oblique stroke.
  2. Elements can be nested.
  3. The outside element is always HTML, which is the document itself.

This early HTML let the programmer get away with a lot.  You didn't necessarily have to use a stop tag.  For example, the end of one paragraph can be indicated by the start of a new one.  Furthermore, some elements like IMG doesn't have much use for a stop tag because no text goes in it; everything is handled by tag attributes, which we'll discuss in greater detail under XML.  Frankly, you can't get away with much of that anymore and having been a software engineer for a long time, I'll be the first to argue that it's a good thing.  There is a bigger learning curve up front, but you end up spending less time tracking down misbehavior later on.

XML

Only three years later, someone took the general idea of HTML and generalized it to XML (Extensible Markup Language).  XML uses the same structure to define any kind of data, even binary data like images, but does so with some restrictions.  It's not nearly as forgiving as the original HTML.  To reduce (I'd like to say 'eliminate' but you can never eliminate it all) programming foolishness, XML enforces additional syntax rules and provides specific document and interpretation elements.

If you were to go out and buy a book on XML, it would probably be three inches thick.  I've seen some of those.  But most of the XML features you would find therein like XSLT, XPATH, XQUERY, XML Schema, are those that you will probably never have to worry about in an e-book, and this is the last you will hear about them. I have used XSLT and XPATH, but that was in some complicated Web programming.  I don't you even can use them in an e-book.

XHTML

Now we've reached the present, and this is where the "Both" comes in.  XHTML is HTML that is XML-compliant.  Put another way, it is an XML application where the tags are the HTML tags.  This distinction is important because 99% of the time if someone says, "HTML," what he really means, if he knows anything at all about it, is "XHTML."  Nobody, but nobody, writes in the original HTML anymore, and if you tried to create e-book with it, every piece of software on the way to publication will complain about it.  The EPUB editor Sigil, which we'll talk about in Part 4, won't let you save or view a file unless it validates correctly.

This is another point worth making about the difference.  The original HTML specification described how to represent both the content and how it was to be presented, though the options in those early years paled in comparison with today's.  The latter was accomplished through the use of element attributes (more about those in a few paragraphs) such as cellspacing to affect the appearance of tables.  For backward compatibility, those features still exist (though are being phased out with HTML 5.0 and eventually won't work anymore), but it is bad form to use them; it is much better to control appearance with CSS.  A simple example should explain one reason why.  Suppose you have 27,000 paragraphs in 12-point type and you decide it needs to be 11-point.  Which would you rather do: change 27,000 paragraphs of HTML or 1 line of CSS?  Your call.

Now, on to business.

What you need to know about XML

As I said above, not much.  Most of the advanced features are useless here, though at some point there will eventually be a brief mention of Document Type Declarations.  Briefly, here are the rules for XML.

1. XML documents consist of elements delineated by a start tag and a stop tag

Exactly like the HTML example above.  Here's something you are quite likely to see:

<rainbow>(Some text goes here.)</rainbow>

The text really has nothing to do with rainbows, but does to demonstrate that in XML, you can name elements anything you want, just so long that you follow certain naming rules.  Since this series deals with e-books, you'll be using HTML element names, so you'll never have to make up any of your own>  So there is no point in my describing the rules.  If you're interested in more, there will be a link at the end of this section.  Notice again that the start tag and stop tag have the same name enclosed in angle brackets, but that the stop tag name starts with a slash, just like in old HTML.  So far, so easy.

2. Elements can be nested

Like this:

<sun><rainbow>(Some text goes here.)</rainbow>More text here</sun>

Elements can be nested as deeply as you need to go, and combined arbitrarily with text.  A trivial example might be introducing one boldface in a paragraph:

<p>This is a <b>boldface</b> word.</p>

If you're paying attention, you might have noticed that since the angle brackets identify tags, you will have a problem if there is an angle bracket in your text.  Don't worry; this base is covered.  We'll talk about special characters below.

3. Every element that is started must be stopped

Put another way, every start tag must have a matching stop tag, and every stop tag must have a matching start tag.  I suppose this makes sense.  However, every now and then you might run into a tag that doesn't have anything in it.  In fact, if you hack your own e-books, you will run into them.  A common example in HTML (remember, really XHTML!) is the break element that goes to the next line:

The end of one line...<br></br>...the start of the next

This is fine, but there is a shorter way.  Use one tag, with the name, but put a slash at the end:

The end of one line...<br/>...the start of the next

When you do this, the tag is both the start and stop tag.  Yes, there is still a matching tag, but it matches itself.  Just a little quicker and easier to read.

4. Elements must be properly nested

In other words, if you start an element within an existing element, it must also end within that element.  Here is another example from HTML:

<p>The last word in this paragraph is in <i>italics</i>.</p>

That is perfectly correct.  However, the following is wrong because the nesting is improper.

<p>The last word in this paragraph is in <i>italics.</p></i>

Many early browsers were forgiving of this sort of thing, and would work anyway.  I can't speak as to newer browsers or e-books, because I haven't fed them bad XML, at least for a very long time.  As I mentioned above, any competent XML or e-book editor will complain about errors like these.

5. Start tags and some special tags can have attributes

For example:

<rainbow colors="7" radius="100">Eat Skittles!</rainbow>

Again, in XML the attributes can be anything you want as long as they follow naming conventions, but using the same argument as above, I won't explain what those conventions are.  They're defined for you in HTML, and I'll show you what many of them are when we get to that point.

6. XML is case-sensitive

<rainbow>Text</RAINBOW>

is wrong because the tag names don't match.  They're different cases.  The original HTML didn't care.  Browsers that are otherwise XHTML-compliant may not care, but I wouldn't take the chance, and I don't want to experiment to find out if e-book readers care, but they should.  Anyway, it is an informal convention that everything is in lower case, and that is what I will assume from now on.

7. Every document starts with an XML declaration

It'll look something like this:

<?xml version="1.0" encoding="utf-8"?>

Note that this is a special tag, neither a start nor stop tag, and you'll only see it here.  It is distinguished by the question marks, and just says, "This is an XML document."

The attributes here deserve some remark: don't worry too much about either.  Realistically, almost everything you run into will be version 1.0; there's no need to use anything higher.  And as for encoding, I'd stick with UTF-8, which you don't really need to put in because it is the default.  UTF-8 covers every language you need to write in except Klingon and Elfin.  (If you need to write in those, I might include a how-to) in Part 5.

8. There is special notation for special characters

For characters that have some special meaning in XML, there is still a way to put them in.  They are called "escape sequences" or, more properly, "character entity references."  They start with an ampersand (&), end with a semicolon, and have some number of characters in between.  If you need an ampersand, it's &amp;. Two I use frequently are the m-dash (&mdash;) and the non-break space (&nbsp;).  The angle brackets are &lt; and &gt;. There are two or three hundred of these defined for HTML, so it makes no sense to list them here when I just give you a link to the table on Wikipedia.

No, you don't have to memorize them all.  As a matter of fact, you may never need to use most of them, because this list was compiled for browsers before UTF-8 was a player and they needed an allowance for accented characters in languages other than English that use the Roman alphabet.  Now that we have UTF-8, those characters can be coded directly, along with Cyrillic, Chinese, etc., so those special characters aren't really so special anymore.  The ones you'll need are the ones that have special meaning in XML.  I've listed three above; the other two are &quot; (") and &apos; (').

9. Every XML document has an outer element that encloses everything

In other words the whole document is one big element with everything else properly stuffed inside.  This is correct:

<xml>

Everything you ever wanted in the document

</xml>

This is wrong because there is more than one outer-level element:

<root1>

Some of the stuff

</root1>

<root2>

  More of the stuff

</root2>

It doesn't matter what that outer-level element is named.

 

There! You might be catching your breath by now, but don't worry.  If you do this for a while, it'll all fit together and it will be so natural that you'll wonder how anyone could think it could be any other way.  I've only presented maybe 0.1% of all there is to know about XML (probably less), but this is 99.9% (probably 100%) of all you'll ever need to know for e-books.  It's just enough so that you can understand XHTML.

But I did promise a reference if you want to know more.  Here it is.

What you need to know about XHTML

Or, colloquially,  HTML. Now that the basics if XML are under your belt, we can move onto elements that actually go into a web page.  Again, I won't be show you everything; just what you're likely to need for an e-book.  To that end, I'm going to skip forms, buttons, iframes, embedded object, tables ... the list goes on.  Actually, tables might be useful, but not for this series; I'm thinking fiction here, and they're rarely found in fiction.

Universal attributes

Before I get into what those elements are, this is a good place to identify some attributes that apply to all of them.  There are others, but once again you won't need to know them until you are an HTML/e-book Grand Master.  These three are common, and all three have direct application to formatting, which is the whole purpose of this series.  Unfortunately, this installment won't give you much information about how to use them, because that comes under CSS in Part 3.

id

This attribute assigns a name that is used internally to the page or document.  There is another attribute actually called name, but that one is used externally and is one that we won't need for e-books.  This attribute is useful because it is possible to assign styles to specific elements by id, though I almost never do it.  It seems clumsy.  If you use it, each identifier must be unique on any given page.

class

OK, folks, this is the big one.  If you only learn one of these three, this is the one you want.  I'm forced to say a little about CSS at this point so you know what I'm talking about.  CSS defines a large number of properties like color, font-size, text-align, you get the idea.  These are in turn grouped into collections called classes.  So by applying a specific CSS class to an element, you apply all of those properties in one smooth operation.  The class attribute specifies the name of that class.  Once you get all this figured out, you can cleverly apply more than one class by separating them with spaces.

style

It might that you have perfect paragraph style defined that you always want to use, but here or there you want to override one of the properties.  Let's say you just love 42-point Comic Sans in bright red (excuse me while I gag!)  But for emphasis somewhere you want one paragraph to be blue.  You could write:

<p class="uglyred" style="color: blue;">This is a blue paragraph.</p>

HTML Sample

Everything else in uglyred will stay the same, but the color will be changed to blue for that paragraph only.  Although you can do this, and although I have, rarely, it's still not a very good idea except in extreme cases, as it's dragging you away from the reason for using CSS in the first place.  If you want to change blue to hot pink, you have to go through your document and find every place you did it.  More about this next time.

Essential HTML Elements

Well, now that that's done, take break, have some tea and biscuits, and come back when you're ready.

This will be the intermission.

Back already?  Let's get on with it, then.  The ones you need to know:

<html>

This is simply the outer-level element that contains everything else.  Remember that a compliant XML document has one.

<head>

This element contains information that the browser or e-book reader needs to have in order to display the page correctly, or other information such as clues for search engines.  For practical e-books, there is only one sub-element that I'm sure  you'll need, but upon writing this paragraph, I wonder if search engines go so far as to index e-books.  I'll have to look into that.

<link>

This is the one sub-element of head that you need to know.  It specifies how to get to other resources necessary to process the page correctly, and the one resource you will need is your CSS style sheet.  The link element from my example in Part 1 looked like:

<link href="../Styles/Style0001.css" rel="stylesheet" type="text/css" />

You need all three of the attributes shown here.  rel specifies the relationship between the two documents, and type specifies what the other document is.  Practically speaking, for CSS references, they'll always be the values shown here.  The remaining one, href, specifies where to find it.  The path in this case is one within an EPUB  file, and you'll need to know how to figure this out.  More about that in Part 4.  You might have more than one CSS stylesheet link once you start getting fancy.

<body>

Right after <head>, you'll find <body>.  This is where the actual displayed content resides.  Let's take a moment at this point to review the structure so far so you can start to see the big picture.  What we have so far is:

<html>

<head>

<link href="someplace" rel="stylesheet" type="text/css" />

</head>

<body>

...everything you want to say...

</body>

</html>

Now, onward...

 <p>

This may be your most frequently used element, the paragraph

<hn>

Hah!  I tricked you!  There is no <hn> element.  There are, however six elements h1 through h6, with h1 being the highest outline level and h6 the lowest.  These are header elements that are used to separate a document into sections.  For example the title "Essential HTML Elements" above is h3. My chapter titles in e-books are almost always h2.  Browsers and readers have built-in styles for them that are generally bigger, bolder, but not necessarily better than ordinary paragraph text.

There is nothing particular that you can do with these elements that you can't do with a paragraph element, but they serve the additional purpose of defining document structure.  Editors like Sigil use them to automatically generate a table of contents, and believe me, it saves a lot of time over doing it by hand.  Use them.

 <img>

You will use the image element any time you want, well, images in your e-book.  Going back to the original version of my drop cap example, I embedded that big B with:

<img class="dropcap" src="../Images/drop-cap-b.jpg" />

The src attribute tells where to find the image, and again, that detail depends on the e-book format, in this case EPUB.

 <a>

The a stands for anchor, which is weird because neither of the two uses for it seems to be an anchor, given my understanding of the word.  Yes, I said two uses.  It does two different but related things.

This first, and much more common, is to create a clickable link.  This happens when you include the href attribute:

<a href="http://www.google.com">Google</a>

Of course, that displays the word Google, as a clickable link to Google's web site.  Of course, in an e-book that link will point to something internal, such as the start of a chapter.

The other use is to define a place to link to within a page. That happens when you use the name attribute instead of href.  (What happens if you use both, I don't know; I've never tried it.)

<a name="jumphere">Here</a>

In fact, I put such an anchor on the "Essential HTML Elements" header above, so you can go there by clicking here.  Or, if you just hover over the link your browser will probably show you where it's headed.  In Firefox, it's in the lower left corner.  Have a look.  The hash (#) is the important character.  Everything before that tells the browser what page to go to, and everything after that tell where within the page to go.

Now here is where I have to tell you that you might have to forget about the second use of the anchor tag.  It has been made obsolete in HTML5.  "Well," you ask, "why did you even bring it up?"  Because I don't know the current state of all e-readers in the world, and a lot of them won't know about HTML5 yet.  Besides, I'm sure that they will remain backward-compatible for a long time, and that the anchor target will still work for years.

So what has HTML5 replaced it with?  Something better, actually.  Instead of having to put in an explicit anchor to jump to, you can now jump to any element by its id.  Cool, huh?

<b> and <i>

This are simply boldface and italic, used fairly often.  You'll see some editors, if you mark text as bold, put in strong tags, and for italics, em tags.  These mean strong and emphasis but virtually every browser and e-reader render them as boldface and italic.  I rarely use them directly because they are longer to type, however, there is a good argument for using strong and em, so perhaps those editors are doing it right.  The argument is that if you want to change their meaning, as em meaning red text instead of italic text, it is easy to make that change in CSS without redefining b and i.

<br>

I mentioned this earlier.  It starts a new line.  Usually written <br/> instead of <br></br>.

<hr>

This is a horizontal rule that by default just puts a line across the page.  There are all kinds of cool formatting things you can do with it, though, once we get into CSS.  Usually written <hr/>.

<span>

I saved this one for last, frankly, because I wasn't sure how to describe it.  Really explaining it gets into the difference between block and inline display elements, and hierarchy, and I just didn't want to go there.  But neither did I want to leave it out, because it can be so useful.  So I decided just to explain what to do with it.

You can put a span element around anything in a web page, and by itself doesn't do a thing to the visual presentation.  However, one use for it that makes sense in e-books is to apply a property to a whatever it encloses, like this.

<p>We need one <span style="color: blue;">blue</span> in this paragraph</p>

All span does here is apply the blue property to the word blue.  You have to be careful doing this, because if you have other elements within the span, their properties may override what you set in the span element and give you unexpected results.  Unexpected, but correct.

I also ended up using it for my drop cap:

<span class="dropcap-b">&nbsp;</span>

But exactly how that works is a topic to broach later.

 

Again, there are just short of a zillion other elements out there, some of which just don't apply to e-books, and some of which you'll never need unless you're doing something exotic like embedding right-to-left Hebrew in left-to-right English.  (Hmmm.  I'm not even sure e-readers support that!)  But just in case you want to, here is where to go for the gory details.

I'll leave you with the entire text of my example from Part 1 in case you want to ponder it for a while.  Don't worry about the xmlns attribute; that's something that Sigil put in automatically and has to do with namespaces.  You don't need it.  Don't worry about the !DOCTYPE element, either.  We'll cover that in Part 4.

<?xml version="1.0" encoding="utf-8" standalone="no"?>

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">

<html xmlns="http://www.w3.org/1999/xhtml">

<head>

<title></title>

<link href="../Styles/Style0001.css" rel="stylesheet" type="text/css" />

</head>

<body>

<h2><span class="chapter">Chapter Three</span><br />

The Angel Gate</h2>

<p><span class="dropcap-b">&nbsp;</span>ekka, still barefoot, slid to a stop on the chilly

flagstone just inside the open main gate. The breeze — more than a breeze, really — coerced

her hair and gown into an urgent flutter around her. She stared gaping into the courtyard.

Something had just happened, something mysterious, something dark and dangerous. As if sensing

her hesitation, one final gust extinguished both torches in the castle's entryway, blowing soot

into her eyes. The elders would have to know about this!</p>

<p>"Laurie!" she squealed, trying to rub the sting from her eyes. "Laurie! Come quick!"<br /></p>

</body>

</html>

OK.  Give yourself a few days to digest this, and I'll follow up with Cascading Style Sheets.

Comments

There are no comments for this post.

You must be logged in to post a comment.