Tuesday, July 10, 2012

Using HTML as the Media Type for your API

There is an ongoing (and interesting) discussion on the API-craft mailing list revolving around designing new media types for enabling hypermedia APIs primarily for programmatic consumption. As some folks may know, I like to use HTML as the media type for my hypermedia APIs. Steven Willmott opined:

I think the problem isn't "why not HTML" it's "why HTML" - if you strip out all the parts of HTML which are to do with rendering things for presentation you're left with almost nothing at all:
  • <a>
  • <h1>, <h2>, <h3> ... (as nesting)
  • <ul><li>
  • <ol><li>
  • <p> maybe (as a kind of separator) - or <div> ...
and even some of these are marginal. There is useful stuff around encodings, meta-data etc. but pretty much everything else is redundant.

I thought this raised such an interesting implicit question, and I get asked about this enough that I thought it warranted a longer response. There are actually a variety of reasons I prefer using HTML:

  • rich semantics
  • hypermedia support
  • already standardized
  • tooling support

Rich Semantics

I've heard many folks say that HTML is primarily for presentation and not for conveying information, and hence it isn't suitable for API use. Hogwash, I say! There are many web experts (like Kimberly Blessing) who would insist that markup is exactly for conveying semantics and that presentation should be a CSS concern. People seem to forget that web sites actually worked before CSS or Javascript was invented! I rely on this heavily for my HTML APIs.

Now don't get me wrong--I'm not advocating a return to 1995; Javascript, CSS, and HTML advances have clearly afforded richer user experiences. But that doesn't mean your HTML API needs to serve up or depend on CSS or Javascript any more than clients need to execute it, necessarily. Just because the media type can express things you don't need or want doesn't make it a bad media type for your use--this is confusing the content for the media type.

So let's get to specifics. From a semantic and informational point of view, there are whole segments of the HTML spec that I've found useful for expressing data structures. We obviously have lists (<ol>), bags (<ul>), and maps (<dl>). Raw XML doesn't have any of these, and JSON can't distinguish lists from bags and is constrained to use strings as map keys. We get encapsulation or grouping via ancestor inclusion or explicitly with <div>. We get 2-dimensional data layouts via <table>, and in fact something even more general than a 2-dimensional array via @colspan and @rowspan.

But more powerfully, with the <a> tag, we have the ability to represent arbitrary data structures, even circular ones (which tree-structured media types like XML or JSON cannot represent). In fact, we can even represent distributed data structures (which, arguably, is what the Web as we know it is--a giant distributed data structure). This is amazingly powerful, and for comparable expressiveness in a different media type, you'll have to define conventions for all these things.

Now, let me just take a run through the HTML5 spec and identify which elements are useful or not useful from an API point of view:

<html>
required, so moot
<head>
useful for overall representation metadata, especially via <link> and <meta>
<title>
if you have a string that could be construed as a name for the whole representation, why not put it here?
<base>
useful for unambiguously supporting relative links
<link>
one of the key hypermedia controls, see the "Hypermedia Support" section below
<meta>
useful for arbitrary data annotations
<style>
Okay, I'll grant that this one is not as important for machine-to-machine (m2m) consumption, but it comes into play under the "Tooling Support" section below.
<script> and <noscript>
Arguably, this is useful for implementing code-on-demand, but I'll grant that my current m2m use cases aren't this advanced yet.
<body>
necessary for separating metadata in <head> from actual data
<section>, <article>, <aside>, <h1>-<h6>, <hgroup>, <header>, <footer>, <blockquote>
These are primarily useful for describing content meant for human consumption, and while I have not had cause to use these myself, they would clearly have an important place to play if data payloads had this structure, e.g., in the API for a content management system (CMS). That said, I'm happy to lump these into "not useful" for the sake of argument.
<nav>
For m2m, I'm not sure there's much benefit to this over <link>s in the <head>, although there's room for more expressiveness here. Let's say, YAGNI here.
<address>
If you have data that's an address, why not mark it as such? Seems like a not-that-unusual circumstance.
<p>, <pre>, <span>
These are fine containers for arbitrary string data with slightly different semantics, particularly around whether whitespace is significant or not, and whether content may reasonably flowed when presented in a UI. However, these offer the ability to have rich content if desired as well.
<ol>, <ul>, <dl>, <li>, <dt>, <dd>, <div>
As mentioned above, necessary for representing data structures.
<figure>, <figurecaption>
Arguably not needed for m2m interactions.
Text-level semantics like <i>, <b>, etc.
Not useful immediately for m2m interactions, but rather to allow rich payloads. Arguably, a JSON-based media type could carry HTML markup in its strings, but then there is an impact on tooling and visibility, which we'll discuss in tooling support below.
<img>
I've seen many APIs that send around links to thumbnails, for example. Clearly useful.
<iframe>, <embed>, <object>, <canvas>, etc.
Similar to the discussion of <script> above, our m2m interactions are not advanced enough to take advantage of these (yet).
<audio>, <video>
Similar to images, allows for discussion of multimedia as first-class objects.
<form> et al.
Perhaps the single biggest reason to use HTML is its support for parameterized navigation via forms. See "Hypermedia Support" below.

Looking back across this list, sure, there's a lot of things that might not be immediately useful, but there's actually quite a large portion of HTML that offers semantics I'd immediately find useful in a programmatic API. We basically get to reap the benefit of many years of evolution in HTML, where its expressive power has grown and been refined over the years. You'll end up repeating most of the HTML standardization process to get a new media type up to the same level of expressiveness.

On top of that, however, are facilities for describing application-domain specific semantics, namely through the use of microdata and/or RDFa--all the "semantic web" stuff. I don't have to create a new semantic ontology for my application domain; I can leverage and/or enrich my markup with Dublin Core or Schema.org.

In short, from a data description point of view, HTML and its associated standards give me all the tools I need to describe almost anything I could imagine, and those facilities are all off-the-shelf from my perspective.

Hypermedia Support

HTML offers <a>, <link>, and <form> as obvious examples of hypermedia controls. In fact, the use of <form> to support parameterized navigation (where the client supplies some of the information needed to formulate a request) fairly well sets HTML apart from most existing standard (in the sense of being registered in the IANA standards tree for media types) media types. While currently this construct is not as powerful or expressive as it could be--c.f. only supporting GET and POST for methods--it's actually enough to get by, and is certainly sufficient for a RESTful system (if you care about qualifying for the label). Furthermore, there are ongoing efforts within the HTML5 standards process to address this.

(As an aside, it's worth noting that <audio>, <video>, <iframe>, and <img> are also hypermedia controls).

Already Standardized

HTML is shepherded by an existing open standards process and a large community of experts, which means it has all the social machinery for ongoing support and evolution. More than that, however, HTML has had the opportunity to be battle-hardened with real world use for decades, including the documentation that comprises its specification. This is huge, because in documentation I can talk about "following links" and "submitting forms" without getting into details about how to construct those HTTP requests, because someone has already taken the trouble of writing that all down, including all the nasty corner cases. I'm lazy--I don't want to define and write down a bunch of rules that solve the same problems reams of experienced people that came before me have already solved.

Furthermore, due to its ubiquity, EVERYONE AND THEIR BROTHER understands HTML and lots of those people can write valid markup without consulting the HTML5 spec (of course, there are also lots who only think they can write valid markup without looking at the spec!). While developers may not be used to using HTML to power APIs, they can nonetheless look at an API response and understand what's going on. This is a huge advantage.

More importantly, HTML is already all over the Web, and there are both human and machine participants consuming it. If I'm starting from an API, then it's entirely possible that someone from the "human-oriented" Web might link to my API, and presto, they can use it, because:

human + browser = client for my HTML API

Similarly, if I'm writing a client, and it can parse HTML (and especially if it can parse RDFa or microdata), then there's a chance it could be pointed at the human-oriented Web and find it can do something useful. But if that client can't parse HTML, then it has no hope of accessing all the existing HTML content on the web.

The phrase here is "serendipitous reuse". The human stumbling onto my API will likely not find it pretty or well-designed, but they may still be able to use it. The programmatic client trolling through web sites will likely ignore half the stuff it downloads, but it still may find something useful (obviously Google has been able to do this). If I find my API is being visited by humans, too, I can add a link to a stylesheet and perhaps download a javascript client, and present them a more usable interface without bothering my programmatic clients that much. Similarly, if my human-oriented website decides it wants to serve programmatic clients too, it can always add semantic tagging in the meantime, and evolve elsewhere.

Tooling Support

Before we get too far into this, let's talk for a minute about the relationship of HTML to XML. Both are flavors of SGML, although the sets of valid documents each can describe are overlapping and distinct. Specifically, there are valid HTML documents that aren't valid XML documents and vice versa, but there are documents that are both valid HTML and valid XML. Then there's XHTML, which is always valid XML but not always valid HTML (depending on the versions). Thus, the relationship is:

Venn diagram showing the relationships of the sets of valid XML, HTML, and XHTML documents

In particular, I find that I can often use markup for my API that actually sits in the intersection of all three. My programmatic clients can ask for application/xhtml+xml, and I can give it to them, and browsers can ask for text/html, and I can give them the exact same bytes with a different Content-Type. If my client wants to use the ubiquitous and available XML parsing and handling libraries out there, great! If they want to be more robust and parse the full subset of HTML, great! And yes, there are full HTML parsing libraries (not XML parsing libraries) in most programming languages, for example: Python, Ruby, Javascript, Perl, PHP, C, and Java.

Now, I will grant that most of these give you a DOM, and not much support above that, so you are endlessly and tediously traversing descendents and siblings in for loops, examining attributes to find what you're looking for. We do have an example, though, that shows manipulating a DOM need not be hard or tedious, and that is likewise ubiquitous: JQuery. And indeed, you can use JQuery selector syntax in other languages, too, like Java or Python. So most of what you actually need for manipulating HTML programmatically in a client already probably exists.

On the server side, we are up to our ears in webserver frameworks that serve up HTML, and IDEs and practices that are set up to optimize developing, testing and debugging them. It's sure nice to load your API up in a browser and play with it. A human plus a browser is a fully-capable client of your HTML API, regardless of what programmatic clients you may be targeting. I can look at the requests and responses over the network and examine the markup in detail in Chrome's developer tools. Many frameworks written for compiled languages like Java can even hotload markup template changes on the fly without recompiling. Plus you can wave a stick and hit thousands (perhaps millions) of developers who are already familiar with all of these technologies.

But what about...?

Domain-specific media types. They're so concise! True; you'd have to work a little harder to represent a blog in HTML than in Atom or RSS, or to represent contact information in HTML rather than in vcard. If there's a domain-specific media type out there for what you're doing, great! Use it--that's what it's for! But I find I work in a world where the application domain is evolving rapidly with new concepts and new features, or where application domains are mixed and mashed up. Many domain-specific media types don't accommodate this well. Imagine trying to write a media type to document Facebook's functionality. You'd end up needing to change the spec daily! That defeats the purpose of having off-the-shelf libraries help you along for the parts that aren't changing much. Or wait--you could build a media type that was so flexible that it could express almost any application...oh.

Bloat. JSON is way more concise, and that really matters for mobile apps. I've heard this so many times that I'm going to have a hard time not being snarky here, so be warned. First off, if representation size or parsing speed is that critical, I'd suggest using a binary format instead, like Protocol Buffers or Avro. What's that? You don't want to use a binary format because it's not human readable? Ah, so you are willing to give up some efficiency to trade off for other things. I see.

But let's get down to some facts here. I often see the following argument presented:

"Here's my sweet JSON representation, only 122 bytes!"
{ "contacts" : [
  { "firstname" : "Jon", "lastname" : "Moore" },
  { "firstname" : "Homer", "lastname" : "Simpson" }
] }
"And here's the bad, old, ugly XML HTML representation. It's 266 bytes, 118% bigger!"
<html>
  <body>
    <ol class="contacts">
      <li><span class="firstname">Jon</span>
          <span class="lastname">Moore</span></li>
      <li><span class="firstname">Homer</span>
          <span class="lastname">Simpson</span></li>
    </ol>
  </body>
</html>
"Ergo, HTML is more bloated than JSON."

There are a couple of observations to make here. First, both of these would fit quite comfortably in a single TCP packet carried in a single 1500 byte Ethernet MTU frame, unless you've got a LOT of headers, in which case, start looking there for bandwidth savings first! So you're not going to notice the difference in practice.

But we're building an HTTP-powered API, right? And we're using compression, right? If I gzip those two files, the gzipped JSON version is 103 bytes and the HTML version is 150 bytes. Now the HTML is only 45% bigger, not 118% bigger. But still bloated, right? Wait, there's more.

These are really small files. Compression algorithms like Huffman coding are based on repeatability of the occurrence of certain strings of bytes, so the compression rate is based on how big and how common those repeated strings are. Well, it turns out that what you call "bloat", gzip calls "compressable." The longer the document, the better it compresses, and the closer gzip will get to the information theoretic minimal representation. Let's see this in action, and with a real API, rather than a toy example. Here's a sample JSON response from the Twitter API, and here's an equivalent XML response, also from the Twitter API. Finally, here's turning it into an HTML-style response.

These samples are, respectively, 44265 bytes (JSON), 64493 bytes (XML), and 40252 bytes (HTML). Wait, what? The HTML representation is the smallest? How is that even possible? I did take the liberty of eliding blank properties, using HTML5 data attributes, and putting true boolean properties as @class values (and leaving off false boolean properties), which I assert are all common HTML idioms. But compare the source gists linked above and decide for yourself.

Now let's gzip them: 7366 bytes (gzipped JSON), 7855 (gzipped XML), 7287 (gzipped HTML). This is only a size difference of 7% from smallest to largest, and even if you don't consider my HTML version comparable, you can see that gzip compression is removing a lot of the differences.

Now, don't get me wrong, JSON is a fine format, and I use it regularly. There are lots of good reasons to use it, but claiming that it is more economical on the wire, while possibly true, is probably not true by enough to make it a deciding factor (and if that really is a deciding factor, you probably want to go to binary formats anyway).

Summary

So what this all boils down to is that HTML offers me quite a lot of convenience as a hypermedia-aware, domain-agnostic media type. I have lots of off-the-shelf tooling, including getting my first client for free (the browser), and from a documentation point of view, between the HTML and HTTP, there's a whole lot of mechanics I don't have to discuss. In fact, if I'm using microdata, I don't even necessarily need to write much down about the particular application domain, at least from a vocabulary point of view. It might even be sufficient to document an HTML API just by listing out:

  • URL of the entry point(s)
  • link relations used (with pointers to their definitions elsewhere!), and important <form> @class values and <input> @names of importance (I think forms need parameterized link relations to do this a little more formally, but we don't quite have those yet)
  • pointers to the microdata definitions of importance (again, elsewhere).
That's not a lot to have to write down.

19 comments:

Mike Kelly said...

I wrote a comment here that turned into a short blog post:

http://blog.stateless.co/post/26898344742/dont-use-html-as-the-media-type-for-your-api

Jon Moore said...

@Mike: Thanks for reading! I commented on your response post.

Jon Moore said...

It strikes me I should document some assumptions I'm making:

1. I can't use a domain-specific media type because the my API either spans domains or the domains are evolving too rapidly. So I need a domain-agnostic media type.

2. I want to build a hypermedia API, because I believe that helps me with evolvability. So I want the media type to support links and forms (or parameterized links) natively.

3. I want there to be serendipitous reuse of my API, so I want to primarily use formally standardized formats/protocols/techniques (also because I want to minimize what I have to document).

4. If possible, I prefer having representations that can serve both human and machine clients.

kpobococ said...

Well, this article just added html to my mental list of available API formats =)

Anonymous said...

For mobile apps, every byte counts.

yes JSON is compact compared to HTML, but then most things are. However JSON is very wordy, as its schema less (well sort of).

what you really want is a binary protocol.



*yes really, its not too hard to design and there are many prebuilt ones that are schema less, if that floats your boat

Andrei Neculau said...

My 4 cents on this:

* http://blog.programmableweb.com/2011/11/18/rest-api-design-putting-the-type-in-content-type/

* no matter if it's XML, JSON, HTML, HAL you are still left with mostly syntax: parse this text into a data-structure of type: integer, string, map, bag, "link class"

* different POVs, but HTML's semantics are mainly presentational: tags like EM, B, P, H1, etc. Whether there is CSS or not, the browser would take those semantics and act upon them from a presentation perspective

* "true" API semantics come not from what exists in the data that is sent, but what it means when it exists. Compare looking at a
- "application/html" document which I can see that it has two ADDRESS tags (if I'm the browser, I know how to render them)
- "application/vnd.shop+html" document which I know that it is defined to have x,y,z, an ADDRESS tag with ID "physical_address", meaning the address where the shop runs the business, and another ADDRESS with ID "return_address", meaning the address where one can do returns (if I'm the API client, and I have to do a return, I know that I will use the second one)

Jon Moore said...

@Anonymous: Re: binary protocols. I agree, which is why I reference ProtoBufs and Avro in the article--those are great options.

Jon Moore said...

@Andrei: Thanks for the link to your article--I enjoyed reading it. I've actually found that "typed" content-types, like XML schema or DTDs have their pros and cons. On the one hand, it makes it easier to write a client, but on the other hand, those clients are less adaptable to change.

Evolvability is all about reducing coupling; you can't break an assumption that isn't made in the first place. Forcing clients to do a little "spelunking" for what they're trying to find makes them far more robust and adaptive (particularly when written against servers operated by someone else).

Rich Czyzewski said...

Great post Jon,

This reminds me of a couple times in the past where using a 3rd party's api was so awful that I found it easier to just scrape the information off their web site instead.

XPath targeting the css class names made this approach very straight forward as each of these data elements had its own specific styling.

The data was easy to see on the page and therefore was (and should be) easy to access and work with.

Draws some interesting parallels to what you mentioned. I'll definitely be exploring your idea more.

Jon Moore said...

@Rich: That's exactly the right idea. A "robust" screen scraper needs to be generous about the markup it gets, same idea here, except that we're at least intentionally leaving hints/handles for the clients to find (like those CSS styles you mentioned).

In addition, sometimes you "XPath" to look for an A tag with a particular @rel, or a FORM tag with a particular @class; besides just looking for data.

Andrei Neculau said...

@Jon: "spelunking" (had to google it) is great, and it doesn't go against a semantic media-type: the API that I'm designing right now follows quite religiously the thoughts of https://secure.designinghypermediaapis.com/nodes/fdivisitjqwp

So from that perspective, I see things differently: the lack of a semantic media-type, although increases "spelunking", it also increases assumptions, and it the end the coupling.

With a semantic media-type, you would do a little "spelunking" to decrease coupling
E.g. does the response have property X? if it does, then I know (from the media-type's definition), rather than assume, that it means Y, and that I can do this and that)

Anonymous said...

Ok, HTML gives you a semantic context, but in the end it does not really help because you dont have proper tool support.

I dont see the disadvantages of using the proper XML Tools for this. As in HTML you get parsing for free, but unlike HTML validation (schema) is free as well.

"Brittle" as you call it is failing fast and reliably, so things can be fixed opposed to behaving unpredictably, even on schema consistent changes.

The HTML approach makes the client programmer guess and assume, proper XML (in most cases) generates client models for free.

Now please sell us PHP as a good System Programming Language.

I mean seriously ?

Jon Moore said...

@Andrei: It sounds like you might be talking about media type profiles like XMDP,which I think is compatible with the general idea here.

I'm trying to separate structure and hypermedia controls into one layer (HTML), and application semantics into another (RDFa/microformats/link relations/microdata), rather than lumping them into one semantic media type, like, say, vcard (which dictates structure and semantics).

I think if you had HTML for structure and hypermedia, and the HTML had a HEAD tag with @profile pointing to a specific collection of semantic conventions (i.e. your semantic "type") you get to approximately the same place.

Jon Moore said...

@Anonymous: Schemas work well and are useful for testing, right up until the point you find you need to change them, at which point you're faced with the choice I'm trying to avoid. This is why I say they are brittle--not to mention that at least one of my goals is perhaps have clients interacting with the HTML on the web at large, which I guarantee is not going to be subject to schema. So I want to figure out how to write clients that are robust enough to not require the crutch of a schema. I think this is possible, we'll find out, I guess. :)

Please note that there isn't any guessing for HTML; it has perfectly clear rules for validation. "HTMLParser.parse(s)" is no harder than "XMLParser.parse(s)" or "JSONParser.parse(s)". You say that the HTML makes the client guess and assume, but I say a schema leads the client to make a more problematic assumption, which is that the schema will always govern those resources. I've run into enough rapid development situations where schema would not be appropriate, because the application domain is changing too rapidly.

Anonymous said...

Schemas are in fact not that hard :)

"You say that the HTML makes the client guess and assume, but I say a schema leads the client to make a more problematic assumption, which is that the schema will always govern those resources."

HTML only gives you semantics and syntax, not structure - how does the consumer/client know what is where. The Schema defines the structure, which has to be guessed or defined elsewhere otherwise.

The Client will always have to react on Changes in the structure.
Case A (change is valid in schema, but has not been used before):
non-schema client *maybe* still works or needs adaptation.
schema client just works.

Case B (change alters schema):
both clients have to be changed.

"Magic" clients that guess a lot may work in those scenarios, but are highly incompatible and messy. They also may behave wrong without indication, possibly corrupting the data.


Also nobody prevents you to make your schema xml look like a subset of html.

Jon Moore said...

@Anonymous: The processing model is slightly different, for sure. If there's a schema, then I agree it gets very easy for a client to find what it's looking for. With the approach I'm describing, the client behavior is somewhat less direct; it's more like "ok, let me digest this response entirely, and then decide what I want to do".

On the other hand, consider the "resource inlining" refactorings I showed in this talk. Would a pre-existing schema have accommodated this change? Maybe, maybe not (in my experience, most schemas wouldn't). If the schema did accommodate data being inlined or remote, then I'm suggesting the client behavior is actually pretty similar: "if I can find the data here then X else Y".

My experience with schemas has perhaps been in use cases where they aren't as useful. When the application domain changes rapidly and in unanticipated ways (common during user-facing feature innovation), I've always found the schemas get in the way, from either preventing a change to something we now realize would work better, or from just having to be re-versioned and then needing to handle client migrations.

However, I think we basically agree. In this case, a schema makes clients easier to write, at the expense of tighter coupling (clients can't handle a schema-breaking change). A schema-free approach makes it harder to write clients--no argument--but by definition reduces the number of assumptions the client may make about server behavior.

Which tradeoff you want to make depends entirely on your problem setting and your goals.

Jon Moore said...

Also, I would not characterize the clients as "guessing"; "searching" or "exploring" would be more appropriate. The clients are still reacting to specifics provided by the server: microdata, link relations, semantic structure. They may not know ahead of time where these things will be found, but they can be recognized when they are encountered.

Anonymous said...

I don't really understand the argument for HTML. You need a format for transferring data. Most likely, clients will be written to match whatever format you define - there aren't clients already written to your API. Any format for which parsers are readily available will do - XML and JSON included.

Using HTML will make parsing more tedious for web clients written in Javascript, which is why I prefer JSON. OTOH, if using HTML, you're always bound to a pretty strict and limited schema, whereas you don't have a schema restriction for JSON (unless you use JSON-Schema), and you can always extend XML schemas.

I also don't buy it that HTML is more expressive than JSON or XML. <a href="...">...</a> is IMO harder to read than <link uri="..." /> or <link>...</link>, or "link": "...". But in either XML or JSON you can represent things like "link" { "proto": "http", "host": "...", "port": ..., "path": "..." }, while in HTML you force the client to use an additional parser for this.

Putting <form></form> in whatever you transfer doesn't really make your clients support submitting forms, nor does it make this easier for them. In fact, relying on HTML's form mechanism you make it harder for clients, since submitting forms involves using yet another format - form contents aren't submitted as HTML. Whereas for XML-RPC, SOAP or JSON-RPC both the request and the response use the same format.

Jon Moore said...

@Anonymous: I think you raise some good points here. Yes, form-encoded inputs are a specific media type distinct from HTML, yet most HTTP libraries make form submission no harder than constructing a map of name-value pairs. This has not been a problem for me in practice in the clients I've written.

I also want to be a bit of a stickler here about expressiveness. XML and JSON do not have standard ways of representing links and/or forms, and neither does SGML. You can define conventions for links and forms in XML and JSON, but then you're defining a new media type (e.g. HAL+JSON or, more poignantly, application/xhtml+xml). HTML is the only media type in the IANA standards tree with support for links AND forms, at the moment.

There are certainly efforts underway to standardize other media types, and I would be happy to entertain them when they arrive. Much of the argument presented in this article is a result of the point-in-time state of the various standards bodies.