|Title||RSS and Atom: Understanding and Implementing Content Feeds and Syndication|
|Publisher||Packt Publishing Ltd.|
What are Newsfeeds?
RSS and Atom are XML formats for messages and other information that is updated frequently. The documents that are written in these formats are called "newsfeeds" or "feeds".
Scenario 1: Weblogs
M. writes a weblog. She composes new entries several times a week. M. writes for a group of friends, some of whom are webloggers as well. M.'s friend Peter learns about M.'s new postings through his newsreader (see Section 1.1).
M.'s audience reads her newsfeed primarily in newsreaders and aggregators. M. would like her feed to be easy to subscribe to, and to look as good in the interface offered by these programs, as in a browser. Besides this, it is important for M. to be able to easily inform weblog communities that she has written a new weblog.
Scenario 2: Publishing of Metadata
N. is in charge of a gallery's website. The gallery regularly offers new drawings to its clients. The website of the gallery is based on a database that continuously incorporates new information. N. wants to inform clients and colleagues through a newsfeed about every information update in his database.
For N.'s newsfeed, it is crucial that the content can be processed. The receivers of the newsfeed are to be alerted automatically as soon as a new work of a certain artist, with a certain subject or from a certain epoch is put up for sale in the gallery.
Scenario 3: Aggregating and Archiving of Newsfeeds
T. is a journalist. Her contract includes the writing of a daily news service for a publisher. This service is based on two types of sources: on pre-existing newsfeeds and on websites that don't make newsfeeds available.
The purpose of T.'s service is not only to be read on a daily basis. The messages are archived in a database. They are supposed to be saved there with information about their original source. Above all, T. is interested in aggregating news from different feeds, that is, to write a new feed from those that already exist. Besides this, T. also depends on the messages being permanently accessible.
Scenario 4: Asynchronous Broadcasting
P. works for a district radio. Part of the broadcast includes interviews with artists and authors. These interviews are available on the Web as podcasts. Interested listeners can download them to their MP3 player and listen to them while traveling.
Like M., P.'s main interest is that his audience can subscribe to his feed. For P.'s feed it is also important that the audios can be downloaded automatically and as easily as possible by the users to the terminal of their choice. They only listen to P.'s online broadcasts regularly if they don't have to endure long download times. For that, audio data has to be downloaded at the time when the listeners' computers are idle, for example, early in the morning.
Content and Metadata
Scenarios 1 and 4 are already everyday experience; 2 and 3 can soon become reality. M., N., T., and P. all share and distribute information. Their feeds consist of the content itself and metadata, that is, information about the data that makes up the content. Newsfeeds give users access to web content in different contexts and on different devices, and allow various services to inform users about updates through the metadata. The range of these services extends from simple headline news to the beginnings of the Semantic Web, which is the automated processing of web content.
When Do We Talk about Syndication?
The technical term for the regular exchange of up-to-date information between websites is "content syndication". The first form of syndication was to regularly integrate news from one website, or newsfeed, into another site. Newsfeeds can also be directly subscribed to and read with special programs called "newsreaders". At the same time, newsreaders serve as "aggregators"; aggregators give an overview of various newsfeeds. They show what information the feeds contain, which feeds have been updated, and which feeds' content the user hasn't read yet. Often, they also allow users of an online community to share newsfeeds.
One of the specifications of newsfeed formats defines syndication as "making data available online for further transmission, aggregation, or online publication". Syndication of web content means that the content is distributed at different locations on the Web. In this context, "location" is to be understood in a figurative sense, like a web address, which also doesn't refer to a place in real space.
Syndication or feed formats were developed in the 1990s to exchange content between websites and to integrate the content into portals. For that purpose, software on the server subscribed to feeds from other websites. The first portal of this kind, Netscape's My Netscape, gave registered users the option to compile feeds from different sources for their own purposes.
Soon after these portals, independent online aggregators became available. UserLand developed the first aggregator in 1997. Initially it was a simple directory of newsfeeds, but it soon developed into a web interface that allowed the user to subscribe to newsfeeds and share his/her own feed with others. Online aggregators spread as a tool for personal publishing. UserLand's aggregator, for example, was integrated with the weblog editor Radio UserLand. With a few clicks, users could transfer messages from another feed to their own feed to cite, comment on, or just spread. Radio UserLand is also prototypical of later developments insofar as members of a community could display the feeds to which they have subscribed. Like a hit parade or bestseller list, the ranking helps the further spread of the most popular feeds. The author of a weblog can find out who has subscribed to his/her feed. The reader finds sources of the authors he or she is specifically interested in.
In many cases, those applications that compile feeds and filter them according to certain criteria are also called aggregators, for example, O'Reilly's Meerkat service. Usually, aggregators of this type automatically generate metafeeds from the compilation of feeds of several individual topics or from different sources.
Newsreaders like Feedreader, RSS Bandit, FeedDemon, and NetNewsWire are desktop tools to subscribe to newsfeeds. They frequently offer a more sophisticated interface than online aggregators. In addition, users can read newsfeeds with them while offline and newsfeeds can be saved and searched locally. Newsfeeds can be subscribed to and read with newer browsers and e-mail programs as well.
Meanwhile, some offline newsreaders can synchronize themselves with online aggregators like Bloglines while online, so that users can take advantage of both worlds. Microsoft's next operating system, "Windows Vista", will allow users to subscribe to the results of web searches on their computers or other machines as newsfeeds. It is certain that for the user, the difference between online and offline use, especially in the area of newsfeeds, is growing narrower and narrower.
1.2 Feed-Based Services
Aggregators and newsreaders helped newsfeeds to have their breakthrough. Recently, numerous services have developed on the Web that process and analyze newsfeeds, or offer specific feeds themselves. Among the first of these services were feed directories like NewsIsFree and syndic8. Special search engines like Feedster and Daypop scan feeds to find up-to-date information. Today, UPS clients can track the status of their packages via RSS feed. Google's Gmail users receive the content of their e-mails via RSS. Players of Microsoft Halo2 can keep track of their rank through the posts on the players' ranking list. Very soon the advantages of RSS for companies' intranets became obvious as well. Companies like Moreover.com specialized in creating aggregated newsfeeds for commercial clients. RSS is easy to combine with knowledge management technology in this particular environment. Newsfeeds can also be used as a tool to observe the media, an example in this case being RSS Radars such as this. RSS search engines can indicate new information with great precision, because the newsfeed itself tells them what was updated and when this was done. For this reason they are much more reliable in searching for news than common search engines.
Collaborative Filtering with RSS
The idea of collaborative filtering of newsfeeds already forms the basis of Radio UserLand. In its simplest form, the author of a weblog publishes in a "blogroll" which feeds he or she subscribes to. The more unmanageable the amount of information on the Net becomes, the more interesting are the possibilities of recommendations from people with similar interests. Interesting attempts in this direction are Rojo and Nearest Neighbor News Network.
Publication of Geocoded Information
Newsfeeds also have important applications in connection with localized services. The generation of newsfeeds from geocoded information with tools like worldKit, for example, allows the user to receive regularly updated information concerning certain regions or places. After the tsunami disaster in the Indian Ocean at the end of 2004, services were developed that spread seismographic information via newsfeed.
Feed Combinations as Website Metaphors
There is a lot of evidence to suggest that the success of feed formats will continue. Newsfeeds are not just an important part of the infrastructure of the "Semantic Web" but they might soon change the common concept of a website—and with it the content management systems as well. More and more, websites themselves could become aggregators, in which different feeds with specific common interests or characteristics are produced, combined, and recombined (Jason Kottke: Some "Web as platform" noodling).
1.3 RSS Requirements
Up to now I have only introduced some application scenarios for newsfeeds and referred to certain exemplary programs and services that are based on newsfeeds. Most users don't know that these programs and services are made possible through common document types for newsfeeds, which clearly differ from HTML. These documents have become widely accepted as the first XML formats on the Web.
The abbreviation RSS has established itself as the collective term for these newsfeed formats. The name "RSS" encompasses a number of closely connected technologies that identify and find updated or updatable information on the Web, and show and exchange that information. The term RSS developed from an abbreviation that can be interpreted in different ways: the three letters, depending on your interpretation, stand for "RDF Site Summary", "Rich Site Summary", or "Really Simple Syndication". "Atom" is the name of an attempt to formulate RSS in a new way, more precisely and in close synchronization with other up-to-date web technologies.
A document format is an important precondition to syndicate content. The exchange of these documents on the Web needs communication protocols to be already considered in the definition of the format. However, these protocols don't necessarily have to be RSS specific. As you will see, RSS usually uses HTTP, the standard communication protocol of the World Wide Web.
Advantages of a Standardized Syndication Format for Users and Providers
A standardized syndication format makes it possible to receive precise information on which of the information objects, accessible through a URI, were changed and when that change occurred. A user can use this information to not only decide which parts of the updated web offering he or she wants to have a look at, but he or she can also get the new information with the feed itself. Software can process the appropriate elements automatically.
For both the content providers and the receivers, feed formats have important advantages:
- Bandwidth Advantage
One important advantage of a syndication format can be that the transferred data needs less bandwidth than the original documents. In practice, however, this advantage plays only a secondary role, because today many documents in syndication formats contain the entire content of the original page.
- Clear Semantics
More importantly there is a second advantage: the simple and clear semantics of the language medium, which can be defined to carry information about the latest changes to a website. An HTML document doesn't indicate which of its
h3 elements contains the headings of up-to-date information, and where these messages originate. In a syndication document each of these messages can become an information object, which has a title and further attributes.
- Time Saving
To visit more than 20 websites a day regularly is not easy for anyone, with regard to the time this would entail. Without a standardized exchange format; I would have to actively search for the information that an aggregator or newsreader provides, or I would be dependent on subproviders. The syndication format would give me easy access to many different news sources. I don't need an entity between the provider of the information and myself as the receiver; be it software, a specific server, or a company.
A standardized syndication format makes the user more independent; he or she can make a much better decision on what news to receive and when to receive it. At the other end, a syndication format increases the range of the news producer. The provider of news is not dependent on interested users checking their website for news; users can be actively informed about all changes on the site.
RSS is an example of the end-to-end principle, and in this it is similar to many other successful Internet technologies.
With RSS, an intermediate or switching level is no longer necessary. However, RSS is a purely technical tool; the task of choosing and assessing the content still remains with the user.
Requirements of a Standard Format
In the first section, we have seen examples of what feed formats are used for. These formats achieve the biggest impact because they have established themselves as standards. As such, they have advantages that were unimaginable with just a syndication format, however good it might have been. A shared format and standardized publication processes make it easier to:
- Find updated information
- Display it
- Exchange and further publish it
The requirements of a standardized feed format can be described on two levels:
- What information does an RSS document have to transmit (functional requirements)?
- How does it work together with other formats and protocols (formal requirements)?
The first level deals with application and use. These functional requirements are manifold: the users want to keep an overview of a large amount of different information; the information providers want to easily distribute information about different topics and in different formats and to provide their audience with up-to-date news. For that purpose, many platforms and many different types of content have to be considered (such as photo and video blogs, and the transfer of data for automatic processing).
Formal requirements have to be met, so that a feed format can be standardized. The chances that a feed format establishes itself are best if it goes back to previously established technology, which it complements and modifies only for its specific purposes. With a format for sharing content, standardization is not only nice to have, but a must: the wider the technical base is spread, the better syndication works.
Only a solution that is effective, abstract, and simple at the same time can be used as a standard: effective, because otherwise it could not manage the job; abstract, so that it can be adapted to different situations; and simple, so that it can be applied by many users. Furthermore, it has to fit into the "ecological" system within which it is used, that is, it has to match the architecture and infrastructure of the World Wide Web.
Functional Requirement: Finding Updated Information
Newspaper sites like Financial Times, news sites like Slashdot, portals like Yahoo!, and weblogs like Scripting News are updated on a regular basis, often hourly. Other operators update their sites with new information with a lower frequency. When and which components of a website have been updated is clearly recognizable; software can search for these specific elements.
In fact, the HTTP protocol also allows the user to find out if and when a web document was updated, but a server can inform a client via HTTP only of changes to the document as a whole, not of individual components that have been added or modified. The client can find out through the information in the HTTP header that the homepage of a daily newspaper has changed, but can't discern which messages and articles were added or modified.
Functional Requirement: Presentation of Information
Primarily, RSS is processed to better present RSS documents, that is, to make them readable. The information has to be structured in such a way that it can be easily shown, and that it offers an overview of the content. Without conventions for a standardized presentation of updated web resources, users have to surf the Internet for individual documents and to direct themselves within their internal navigation.
In fact, HTML is also a standard to present information in a standardized way. However, HTML doesn't have the semantics for news or news-like information, because it was developed as a language for all kinds of information as a sort of lowest common denominator for laying out web documents.
In contrast, standardized information about what is new on a site makes software possible that searches many sources for news and compiles the updated information. It is not specified, though, how much of the updated information is enclosed in an RSS document and how much in a source to which that document refers.
Functional Requirement: Exchange and Processing
Publishing information about changes on a website doesn't actually become interesting until that information can appear on other websites as well.
In this case, a website can subscribe to other websites and integrate their content, just as genetic material from one cell can be inserted in DNA strings of other cells. Without a standard for web news, such exchange operations can become complex and unstable. Users have to know the exact structure of the content they want to integrate, and then change it into their own publication format. The scripts necessary for this integration have to be rewritten for every change in the source structure. A standard, however, makes it possible to use material of any kind—aside from any legal problems.
Publishing and republishing also includes the commenting on, citing, and changing of information. An intention of the first web developers was to create a medium for users to publish and write, as well as receive and read. This "Semantic Web" needs rules for integrating and republishing if it is supposed to work worldwide, and be accessible for everyone.
Functional Requirement: Publishing and Editing of Information
Feed formats can also be used to publish or edit documents. In this case, the document reaches the web server in a feed format—publication protocols or APIs (Application Programming Interfaces), regulate how the data on the server is to be interpreted. Here, too, the combination of RSS with other XML formats and web protocols plays an important role. On the one hand, HTML fragments often belong to the content of the documents that are to be published. On the other hand, technologies like HTTP, XMLRPC, and SOAP are used for publishing.
Functional Requirement: Extracting and Processing Metadata
Another type of requirement is the extraction of information for automatic processing. Here in particular, the connections between RSS and the resource description format are of relevance. Magazine publishers, for example, can provide within their newsfeeds, the bibliographical data of all articles in machine-readable form. A feed with seismographic data can be analyzed for disaster warnings.
Functional Requirement: Extensibility
The history of the development of feed formats along with the applications that are based on them suggests that feed formats are likely to face numerous further challenges. Often it is particularly important to combine data in these formats with other forms of data. That is why feed formats need a standardized extension mechanism. Such a mechanism makes sure that new applications can be developed without the need to change existing formats and applications, or making them obsolete.
Formal Requirement: Integration in the Architecture of the Web
Added to these requirements, which can be derived from the challenges of the format, there are further requirements that arise from the environment that the format will mainly be used in: newsfeeds and documents on the World Wide Web that have to work in this specific environment. This means:
- Feed formats have to work in a similar fashion to other universal web technologies; they have to be simple and stable. This requirement concerns all aspects of feed formats: the syntax, semantics, and their application.
- Content is published in newsfeeds. Their format has to work with other web content formats. That is why the connections to these formats have to be well defined. This requirement concerns not only the syntax of feed documents, but also that of documents that use feed formats together with other vocabularies. HTML markup, for example, occurs in many newsfeeds. One demand for the specification of a feed format is to determine the relationship between these two vocabularies: whether an HTML passage in the content of a feed document is also a logical part of the document (belonging to the same document tree), or whether it is just cited.
- Newsfeeds contain information about other information or what is known as metadata. In many cases, feed formats are even considered metadata formats. That is why the connections to metadata formats have to be clarified. It also has to be clarified whether data in feed formats can coexist with other metadata. This requirement not only affects the syntax, but also (more importantly) the semantics of the documents.
- Feed formats belong among the publication technologies of the World Wide Web. Therefore, they have to consider the common procedures of the Web to transfer and publish messages, either by referring back to them or by specifying how and why they differ from them. This requirement concerns more the use of feed formats than the document structure. Without it, however, the syntax and semantics of the documents can't be determined.
1.4 Semantics: The RSS Model
The common basic functions of the syndication formats can be divided into four categories:
- Architecture: structure of information
- Content: description and reproduction of information
- Identification and linking: relocating to other information on the Web
- Metadata: description of important characteristics of the information
These requirements are so general that they could as well be listed for other, possibly for almost all, text formats on the Web. Specific to syndication formats, are restrictions within each requirement group.
Even if the different RSS versions clearly differ from each other, the semantics of the most important features of the language are similar. The model of a collection of updated information objects belonging to a resource that is identifiable on the Web forms the basis of all syndication vocabularies. The feed document is a snapshot of the resource.
The term "resource" is used here in the language of the World Wide Web consortium and the URI standard: "every object that can be identified through a URI (Uniform Resource Identifier)" is a resource. Roy Fielding has made the concepts behind this usage transparent in his dissertation "Architectural Styles and the Design of Network-based Software Architectures".
Independence of Topics and Original Formats
Most importantly, a feed document contains information about which information objects are to be found under a URI and when they were updated. In addition, it can include a description of the resource and the individual information objects, the specification of a unique identifier for the objects, information about the editor-in-charge and the webmaster, and other information. It is also possible that the information object described may be completely embedded in the feed document.
All feed formats have a basic model in common. This basic model, however, is serialized—that is, translated into strings of characters—differently in the syntax of the feed formats. You can consider the formats that are described in this book as modifications, specifications, and extensions of this basic model.
The RSS model generalizes all the specifics of the updated information; it works independently of the internal structure of the information, and the topics it concerns. It is so universal that RSS feeds of all kinds of content are possible. Newsfeeds can refer to a wiki as well as to a weblog, an information portal, a compilation of software updates, or new multimedia data. Any collection of information that is updated at any point in time can be the object of a feed document.
At this point, I would like to introduce the basic model of the various feed formats. For this purpose, I will use the names of the XML elements in the existing feed formats, such as channel or title, as the names for the components of the feed documents.
1.4.1 Minimal Information
Structure: channel and item or feed and entry
There are two kinds of information objects in all RSS formats, that is, collections of new information items and new individual items of information. The collections are called a channel (RSS 1.0, RSS 2.0) or a feed; an object within a collection is called an item or an entry. On both levels—that of the channel or feed and that of the item or entry—there is content information, metadata, and information about the identification and linking of information objects.
Apart from the two levels of the information channel and the individual information object, that is, the channel and the item respectively, all feed formats are characterized by three pieces of information. The RSS elements that hold this information are called title, link, and description. They can be found on both the channel and the item level.
Usually, a feed document describes another web resource, namely, the resource that is identified by the content of the link element. Because the feed document is not only the representation but also the description of a web resource; feed formats can be called metadata formats, even if the difference between data and metadata is difficult to grasp precisely.
The obligatory presence of an element called link, and with it, the ability to identify a document it refers to, distinguishes feed documents from other web formats like HTML. An HTML document element and a feed document, together with all other data that can be reached on the Web through the HTTP protocol, both represent a resource that is identified by the URI through which it can be reached. 1
The link element only states what the RSS document describes; it is not the description alone. Also, RSS defines the description as generally as possible: just simply as a description. All syndication vocabularies have an element that stands for the description as such; in RSS 1.0 and 2.0, it is called description. The only additional requirement is a title that identifies to people what the URI in link identifies for machines. These three elements then repeat themselves for the individual information objects that are described in the newsfeed as components of the resource. These objects can, but don't have to, refer to the information they describe through a link element of their own.
All syndication vocabularies repeat at the level of the item, and also at the component part of a feed, the minimal description of the entire feed. All additional elements are extensions; they build on the foundation of a model that could hardly be reduced any further. These additional elements make it possible to describe resources with "rich metadata" in a feed document and to transfer content within it.
1 This resource is not identical to the data that the server delivers to the client, but abstract in nature. This is most obvious with URIs such as www.yahoo.com that clearly identify something, but never directly refer to particular data and/or a specific server. But the URI of an individual image also identifies the image, independent of a particular location in the data system on a server; rather, a mechanism has to be defined in all cases to resolve the URI and to send the data to the user.
Presentation of Newsfeeds in Feed Readers and Aggregators
Documents with this simple basic structure—channel and item for the organization and title, link, and description for the descriptive content of a feed document—contain the minimum information a feed reader or aggregator needs.
The following screenshot shows how a feed document is presented by a common newsreader (the document source can be found in section 2.2.1).
Figure 1.1 Simple RSS 2.0 Document in a Newsreader (three-pane view)
On the left side you see a list of different newsfeeds, from which a sample document was chosen for display. On the right, in the upper field, the header (the content of the title element) and other features of individual messages are shown. The lower field displays the message that was chosen. Above are the news items, which are displayed one below the other including the headline of the message (again, the content of the title element); the content of the description element follows. Below the description the feed's title is shown; the date that follows was generated by the newsreader.
This so-called "three-pane view" is not the only possible way to reproduce RSS documents. The news items can also be displayed one below the other:
Figure 1.2 Simple RSS 2.0 document in the list view of MyYahoo!
Several other features of the entire channel are shown if the user opens the presentation of the feed's features in a context menu as the following screenshot demonstrates:
Figure 1.3 Display of RSS 2.0 channel features in FeedDemon
The pop-up window on the right shows the contents of the link and description elements of the channel. The window on the left displays the titles of several RSS feeds, which are preset in the newsreader that we use (FeedDemon). (The newsreader also works as an aggregator at the same time. With this program, it is also possible to share one's own subscriptions with others.)
You can see that the basic functions of a newsreader and a news aggregator can be realized, even if only a few elements of the feed vocabulary are used.
1.4.2 Other Content and Metadata
Content: Quotations and Pointers
Syndication formats are not content formats; they use existing formats for content: simple text, HTML, XHTML, other XML vocabularies, and also other text and binary media formats. These formats are used for titles, summaries, and the partial or complete reproduction of the content.
One of the characteristics of newsfeed models is that the description itself is defined in as generic a nature as possible. For this reason, it is possible to include any type of content in that description. In a syndication feed, any kind of web content can be sampled and further distributed. That is why RSS and its relatives are also suitable as a universal publication format on the Web.
Metadata in Syndication Formats
Syndication formats serve to exchange information and make it available in different forms. For this reason, they describe the information they contain in a way that allows other users to use it; at the same time, they also inform the users of the legal and other limits connected to using their information, like the identification of publication and update data, the categorization of content, and the identification of writers, authors, and copyright holders.
RSS as a Publication and Syndication Format
Even though all existing feed formats require an element called link, it is possible that the information in a news stream isn't to be found outside the RSS feed, meaning that the RSS feed not only refers to another resource, but also contains the original information. The description model of an addressable collection of updatable information objects on the Web, on which RSS is based, works no matter whether these objects exist only in the RSS document, or are referred on other resources on the Web. In principle, every resource on the Web that can be modeled as a collection of updated information objects can be the subject of an RSS feed.
1.5 Syntax: RSS as an XML Format
Many websites identify their newsfeeds through an orange-colored button labeled "XML". For many users and also for many developers "XML" and "RSS" are synonymous. In fact, all versions of the RSS feed format and Atom are XML applications. Since XML itself is a metalanguage to define languages for the exchange of information on the Web, the feed formats are also often called "XML dialects" or "XML vocabularies". To date, RSS is the most successful XML vocabulary—except for maybe XHTML, the XML version of HTML.
Standardization and Openness of XML
The biggest advantage of XML in the field of syndication is that XML is a simple, open, and standardized format to exchange information on the Web.
RSS has spread so successfully in recent years not only because it is a particularly effective format, but also because it has established itself as a standard. It acts like a lowest common denominator for updatable information of all kinds, and from the beginning it was accepted as such. Due to the fact that millions of Internet users use RSS to spread and receive information, applications are possible that profit from network implementation and become more useful, the more users use them.
This success would not have been possible without the fundamental features of the underlying technology, XML. XML is a text-based format: people can read XML documents without any great difficulty. The content of XML documents can easily be extracted. In addition, XML is not a proprietary technology that is controlled by any software provider. RSS has inherited these advantages from XML; without them, it would have not been able to spread explosively on the Web. The use of a binary format or a proprietary text format would have complicated the development of software that produces or processes RSS, and limited the market for RSS applications. XML makes it easy to define a format for specific needs. All RSS formats consist of a very small group of XML elements and attributes defined for this purpose, and of rules for the hierarchical connections between these elements. Due to this set of rules (executed as a Relax NG or XML schema), limits for the permitted content of RSS elements can be specified, such as for the format that provides calendar dates.
Separation of Content and Presentation in XML
XML allows for the content and the presentation of documents to be separated. Many XML formats are content formats; they contain no information about how the documents are supposed to be reproduced visually or acoustically. The DocBook vocabulary for technical documentation, for example, uses an emphasis element for important passages and terms. DocBook doesn't specify, however, how such sections are to be emphasized in print. Other XML languages are description or presentation vocabularies. SVG (Scalable Vector Graphics) describes graphics, SMIL (Synchronized Multimedia Interface Language) describes time-structured presentations, and XSL-FO (eXtensible Stylesheet Language-Formatting Objects) describes the layout of printed pages in detail.
RSS is a pure text format. An RSS document doesn't contain information about how a document should be presented to the user. RSS uses XML to semantically distinguish information. Additionally, it uses the possibility provided by XML to separate content and presentation.
All RSS formats are pure source-text-based content formats. This means that it is necessary to provide them with additional presentation instructions that can be adapted to the respective presentation medium. The presentation instructions make it easy to present RSS documents in different media or in different contexts.
The simplest method to present RSS is to convert it into HTML and then use an HTML browser or a toolkit to display the HTML. On the one hand, XSLT (XSL Transformations) can be used with this method to transform XML data into HTML; on the other hand, HTML fragments are frequently included as a part of the content of RSS documents, so an HTML rendering engine is necessary anyway to display them. Like all XML documents, RSS documents can also be formatted directly with Cascading Style Sheets. Moreover, there are many other presentation methods; Flash can be used, for example. One example of an RSS document using the latter is Gush.
Ability to be Validated
As XML documents, RSS feeds can be checked with standard procedures to determine whether they comply with the rules of the respective format. A document type definition or a schema contains the formal description of the rules that should be checked for compliance.
A document format that is defined as an XML format can use the methods typical to XML to solve problems of internationalization. XML consistently specifies Unicode as the default coding format for the character set. The Unicode standard assigns all the characters from all known alphabets, a number; and by doing so, is able to reproduce texts in any language. 2
If it is important for the process to specify the language in which a document is created, the
xml:lang attribute can be used XML-wide. The newer feeds make use of this option.
Extensibility and Namespaces
Extensibility is one of the key aims of XML; the acronym XML doesn't stand for "Extensible Markup Language" without a good reason. First of all, XML is extensible in that every user can define new element types and attributes, whereas a format like HTML determines the scope of the language.
The developers of all the RSS versions used this feature of XML to define element types like
rss (the document or root element of an RSS document),
However, elements and attributes won't be defined freely any more, if vocabularies like RSS 1.0, RSS 2.0, and Atom are determined and standardized for certain tasks. The formulated and consequently stipulated rules for such vocabularies—in the form of a DTD (Document Type Definition) or a Relax NG or XML schema—allow only certain elements and attributes with determined identifiers in a determined hierarchical order. The regulation of the content that is permitted for the elements (content models) can nevertheless at the same time, allow embedding elements of other vocabularies in certain locations of a document. This is fundamental for feed formats, in order to allow the inclusion of sections that are formulated in XHTML in a document.
In order to extend documents created in such a vocabulary by adding elements from other vocabularies, a method called the namespace mechanism was developed. All the feed formats described in this book use this mechanism. You need to understand it in order to be able to work productively with these vocabularies. The appendix contains a short introduction to the namespace mechanism (see appendix, section A.3).
2 In order to present Unicode texts, the characters have to be coded, then, the numbers that are determined by the Unicode standard are designated a certain string of bits. All XML applications have to support UTF-8 coding. UTF-8 assigns one byte to the first 128 characters, and two or more bytes to the following characters. In the coding of Latin letters, UTF-8 doesn't differ from the more popular ASCII coding. XML applications assume that an XML document is coded according to UFT-8, if the XML notification at the beginning of the document doesn't state a different coding format.
1.6 Feed Formats and other XML Formats
Syndication Formats are not News Formats
A comparison of news-specific formats used by news agencies and commercial publishing houses shows that RSS simply can't be called a news format. The combination NITF/NewsML is increasingly establishing itself there. NITF stands for "News Industry Text Format". NITF is an XML dialect to identify the components of news content, such as headlines, introductory texts, and names of people and organizations. NewsML which stands for News Markup Language, is a format for the "wrapper" of news, with information about release dates, the legal situation, etc. NewsML and NITF are based on the model of news in a journalistic sense. For feed formats, these semantics don't play an important role; their semantics are considerably more abstract.
NewsML and NITF are neither formats for information about the state of a—modifiable—web resource, nor formats for feeds, that is, for documents that summarize different information objects. RSS differs from NewsML and NITF in that all RSS messages refer to resources on the Web, which are identifiable through a URI. It is characteristic for an RSS document to be linked to a complete resource and that the individual information objects may or may not contain links as well.
Essentially, an RSS document is nothing more than a simple, two-level hierarchy of links that are provided with a title and a description. This pattern is so general that it refers to every resource on the Web that is identifiable, that is: which has a URI, which has components that can be labeled, and which changes with time.
Distinction of Message Formats
RSS can also be distinguished from those message formats that have been developed for the purpose of machine-readable data recently. Well-known formats of this kind are XML-RPC and SOAP. These formats mainly serve to exchange Web data that is normally seen by no one. XML-RPC addresses functions of program operation on distant computers. SOAP is a format for enveloping any complex message, for example, documents that are exchanged in e-business processes. For example, SOAP serves as a format for covering ebXML messages. (See Electronic Business XML Initiative (ebXML) and ebXML Ropes in SOAP.)
Surely it is no coincidence that the American developer Dave Winer significantly influenced RSS as well as XML-RPC and SOAP. These three XML vocabularies are formats for messaging on the Web. They don't need any exchange technology other than the HTTP protocol; SOAP and XML-RPC, as well, can be called end-to-end technologies. For Winer, especially, XML-RPC and SOAP are complementary to RSS in creating complete publication solutions.
RSS is a format for documents that are accessed by people, whereas SOAP is a format for data that is to be processed by machines. Due to their extensibility, all new RSS versions can in fact be used as envelopes for data. At the same time, the semantics of RSS remain: the messages inform about the state of a web resource that can be modeled as a collection of similarly structured information objects.
1.7 The Versions of RSS and Atom: Their Evolution and the Future
If I use the term "RSS" in this book without the version number, it acts as a collective term for "the different RSS versions and Atom" as a group, that is, as a synonym for "feed format". If I only talk about one of these formats, I use "RSS" with a version number, or the name "Atom".
In an ideal world, this book would just be an essay that describes a format for the syndication of content, which is easy to use and explain. In fact—apart from the various predecessors—we are dealing with at least three and a half newer formats, which were developed as alternatives for each other, namely, RSS 1.0 and RSS 1.1 (an RSS 1.0 update), RSS 2.0, and Atom.
Many websites still offer feeds in the predecessor formats of RSS 2.0; these feeds have version numbers 0.91, 0.92, and 0.93. In this book, I describe them along with RSS 2.0. The development and discussion of these formats isn't over; it is frequently discussed in a passionate and fierce manner. After all, because it concerns a key area of the Web's future development, it also involves influence and money.
Almost all RSS applications can process every, or at least the relevant, form of RSS feeds. The most important reason for this is the fact that the semantic models, which are the basis for the different syndication formats, overlap for the most part. In addition, documents in the syndication formats have a flat structure; they don't involve any deep and complex hierarchies. (Where do deeper hierarchies happen?—for example with quoted HTML markup—applications can usually leave the processing to an HTML Rendering Engine.)
The following table includes data in respect of the most important feed and news formats. With this, I follow:
Notes for the table:
- ICE is an industry standard for the automatic exchange of content. You can find more information on the ICE website[link broken], and on the Cover Pages.
- David Megginson defined XMLNews as a format for news content and metadata. The content format is a subset of NITF; the metadata format uses RDF. You can find more information on the XMLNews homepage, and on the Cover Pages.
- NITF is used in the news business as the format for news items content on a large scale. You can find more information on the NITF website and on the Cover Pages.
- NewsML is a format used for exchanging news in text and multimedia formats; it can be used together with NITF. You can find information on the NewsML website and on the Cover Pages.
- PRISM is an industry standard for the exchange of metadata between commercial content providers. You can find information on the PRISM website and on the Cover Pages. There is also an RSS 1.0 extension module available for the PRISM metadata vocabulary.
- RSS 3.0 is a text format for newsfeeds with no serious intention behind it. You can find information on the website.
In this book I discuss only the following three families of formats:
- RSS 2.0 and its predecessors (RSS 0.91, RSS 0.92, and RSS 0.93)
- RSS 1.0 and RSS 1.1
The news industry formats in the strictest sense (NITF, NewsML, ICE, and PRISM) have tasks different to that of the feed formats of the RSS and Atom family. They serve to exchange content and trade data between commercial partners. All remaining formats either didn't establish themselves or are irrelevant. This doesn't mean that they are not interesting. The appendix contains an overview of the Outline Processor Markup Language, OPML, which is used by many aggregators and newsreaders as an addition to RSS (see section A.2, Outline Processor Markup Language).
1.7.1 The Beginnings: MCF, Scripting News, and CDF
The disparate influences that subsequently led to the development of different RSS versions are pretty obvious in the history of the formats. A metadata format—the "Meta Content Framework" MCF—and news channel formats like the Scripting News format and Microsoft's Channel Definition Format (CDF) were the predecessors of RSS. For the description of RSS's case history, I follow primarily Ben Hammersley, Content Syndication with RSS, O'Reilly, 2003. In 2005, the second edition of the book was published (see bibliography).
The World Wide Web was developed as a net of texts, linked to each other. The protocols and standards to which the Web owes its astronomical rise, namely HTML and HTTP, describe how web documents are structured and how they are published, modified, and accessed. HTML doesn't take into account that many of these documents are often, and in many cases regularly, changed and updated. In the Web's infrastructure, which established itself in the first half of the 1990s, software developers and their clients were concerned with the demands posed by constant changes and updates in resources on the Web. In this manner, the first content management systems and browser add-ons, like the Netscape Sidebar and Java Applets with stock ticker messages, emerged. In the process, it became clear that common formats and protocols that support the constant updating of web resources, would simplify publishers' and users' lives and work on the Net. Such formats were developed in the mid 1990s.
Meta Content Format and Channel Definition Format
The origins of RSS reach back to at least 1995. At the time, Ramanathan V. Guha designed the Meta Content Format or MCF. Apple used the Meta Content Format in an experimental project called ProjectX, and later HotSauce. MCF makes it possible to describe sites with metadata that is found in an MCF file of its own. HotSauce presents this metadata in a format that allows three-dimensional navigation. In 1995, Guha switched over to Netscape and met Tim Bray, one of the most important developers behind the XML standard. Together they transformed MCF into an XML-based format. From this collaboration, the Resource Description Format (RDF) was developed—the basic technology of the Semantic Web.
Simultaneously, Microsoft, together with Pointcast and other companies, also developed an XML-based format to describe websites, which was called Channel Definition Format (CDF). CDF allowed the description of content, publication plans (scheduling), logos, and metadata of a site. It was incorporated in Internet Explorer 4 and acted as the technology basis for Microsoft's so-called Active Desktop.
UserLand's Scripting News Format
Perhaps the oldest syndication format in today's sense is the Scripting News format from UserLand.com. Dave Winer described it in December 1997 and implemented it publicly. A number of sites still offer newsfeeds in this format, in which every entry is a section with links. Winer tried to form the basic characteristics of writing on the Web, instead of offering only headlines, as in earlier RSS versions. In 1999, Winer included important elements of RSS 0.9 in version 2 of the Scripting News format.
In 1999, Netscape introduced RSS 0.9 as a format to describe information channels and aggregate content. RSS made it possible to publish snapshots of content in the portal "My Netscape". RSS soon proved to be an effective, simple XML format for the syndication of content beyond this application.
Initially, RSS channels contained only news, but soon new types of content were added. For example, RSS feeds started describing articles in discussion forums, wikis, and new software versions.
RSS was initially an abbreviation for "RDF Site Summary". (For information about RSS as "RDF Site Summary" see Chapter 3. For a detailed explanation of the term, see section 3.1 RDF Basics.) With RSS, it is possible to integrate headlines from other sites with links to these sites in the portal. The user could personalize the portal and subscribe to a number of sites that offered RSS data. In this manner, My Netscape had at its disposal a great amount of additional content, which kept users on the site longer; the providers of RSS data received additional traffic—the most important goal of many websites in the times of the dot-com boom. Since it is easy to convert RSS to HTML, other sites soon started using the same technology. Slashdot soon used RSS instead of its own headline format, and tools were developed to create and process RSS in the common scripting languages.
The first desktop headline viewers were released in 1999 (Carmen's Headline Viewer; compare this with Ben Hammersley's article in the Guardian). These applications made it possible to download RSS information and then read it without being connected to the Internet. Likewise, RSS directories like syndic8 and other aggregators were developed at about the same time.
Dan Libby developed the first version of RSS as a pure RDF application. At Netscape, however, that format was soon considered too complicated, and it was replaced by a simpler vocabulary, which was not usable RDF, but wasn't a really simple format either. Soon after, Netscape completely abandoned RDF in RSS 0.91. This decision provoked the first split in the development of the syndication formats, a split that lasts until today. One group of developers considers RSS an XML format to exchange news and other content that is updated often. The other group regards it as a metadata format, that is, an instrument to represent knowledge. The debate over whether newsfeed documents should be RDF documents at the same time isn't over yet.
In the first year of their existence alone, there were 4,000 different RSS feeds to be found on the Web. In 2002, the RSS directory syndic8 broke through the symbolic 10,000 feeds barrier.
1.7.2 RSS 0.91
Soon after, Netscape published RSS 0.91 under the name of Rich Site Summary. RSS 0.91 wasn't an RDF format anymore; it took on some elements from UserLand's Scripting News format, most importantly the
description element. This allowed RSS to evolve into a format for spreading content, for which it was developed in the first place. Netscape wasn't involved in further development of the format for very long. UserLand and especially its founder, Dave Winer, successfully propagated RSS as an element of the syndication framework and soon after published version 0.91 under their own copyright. Winer is among the founders of Weblogging and also belongs among the pioneers of the "Semantic Web".
RSS 0.91 and all its subsequent versions, as well as XML-RPC and the MetaWeblog API, owe their origins to UserLand and Winer. UserLand products like the content management system Manila and the service EditThisPage.com "brought together the world of content syndication and weblogs": to use the quote given in the introduction of the RSS 1.0 specification.
An important novelty of the Netscape RSS 0.91 version compared to RSS 0.90 is the possibility of validating documents of this format against a DTD. Abandoning the RDF characteristics, which couldn't be used any more at that point, simplified the language compared to its predecessor. The abbreviation RSS now stood for Rich Site Summary or Really Simple Syndication (for more information on the XML elements, see also section 2.5.1).
1.7.3 RSS 1.0
In the following years, the split came to a real head in the RSS developer community. Dave Winer's company, UserLand, controlled RSS 0.91. UserLand was above all interested in keeping the format simple and using it for personal publishing, particularly for the new publishing form of Weblogging.
Other important developers, however, among them Rael Dornfest, who was working as a chief technology officer at O'Reilly's, wanted to expand the scope of RSS to use it for other purposes and connect it with additional formats. Therefore, they reintroduced RDF and also introduced a new mechanism, the XML namespace. A related specification was published in December of 2002; the developers called the format that was described, RSS 1.0.
RSS 1.0, which is in no way just an additional RSS version, but an alternative language on its own, is more formally specified than RSS 0.91 and its successors. RSS 1.0 is defined not only as a syntax, but also as a data format. Due to its compatibility with RDF, the metadata framework of the W3C, RSS 1.0 makes the exact description of the relationship between RSS data and metadata of other RDF formats possible.
However, RSS 1.0 and RSS 2.0 don't differ much with respect to the embedding of content in other formats and the description or non-description, respectively, of the relationship between document formats and publication environments. (Chapter 3 gives a detailed description of RSS 1.0. You will find a reference of its XML elements in section A.4 in the appendix.)
1.7.4 RSS 0.92
Winer answered the publication of RSS 1.0 with RSS 0.92, within two weeks. RSS 1.0 was a modular and extensible syndication vocabulary that could be easily combined with other XML vocabularies and RDF formats. RSS 0.92, on the other hand, was an easy-to-use vocabulary whose limited features were sufficient for the needs of most users of syndication technologies.
From the users' perspective, RSS 0.92 and RSS 1.0 were compatible. Most RSS parsers could and can process documents in both formats. Parsers for the 0.9x formats, however, can't understand the RSS 1.0 extension modules, let alone extract RDF data from RSS documents.
All attempts to develop another RSS format, acceptable to representatives of both versions failed. Several RSS 1.0 fans held Dave Winer responsible for this. Not only did Winer refuse to define RSS as an RDF format or design it to be RDF compatible, but he also didn't accept the common practice of discussing a format on a mailing list in order to reach the widest possible consensus with other developers.
Instead, Winer wanted to turn weblogs into discussion forums for the further development of RSS. This procedure allowed him and UserLand to filter the articles. (For more information on the XML elements used by RSS 0.91, see section 2.5.1.)
1.7.5 RSS 0.93
RSS version 0.93, which was published by Winer a year later, already contained most of the elements that belong to today's up-to-date RSS 2.0. But RSS 0.93 doesn't have an extension mechanism. This format remains popular even today. (For more information on the XML elements used by RSS 0.93, see section 2.5.3.)
1.7.6 RSS 2.0
In September of 2002, Winer published the specification for RSS 2.0, again without making an effort to reach a consensus with those who participated in the rss-dev mailing list and helped develop RSS 1.0. (Just prior to this, he had published the same RSS 2.0 format as RSS 0.94.) At the same time, Winer declared RSS 2.0 a frozen standard; successor formats weren't supposed to be published under the name RSS any more. A little later, Winer assigned the rights of RSS to Harvard University—RSS was to be exempted from the suspicion of serving personal or business interests.
Today, RSS is the most widely used feed format. It is characteristic of this format to not specify, or to leave it to the application developers to specify: the connections between RSS data on the one hand, and other content formats, data/metadata formats, and publication environments on the other hand. Essentially, RSS 2.0 defines syntax, whereas meaning and use were determined through the use of examples. The supporters of RSS 2.0 consider this low level of specification one of the format's biggest advantages, whereas the supporters of alternate RSS versions see it as its prime weakness.
Other formats owe their existence to the fact that RSS 2.0 ignores a lot of problems. The enormous problems encountered during the formal definition of these formats are an argument for, as well as against, this strategy; an argument for it, because RSS 2.0 works in many different applications and is by far the most popular version, including its predecessor formats. The argument against it is the fact that, in practice, problems arise wherever the RSS 2.0 specification is unclear, for example, in the case of document validation. (Chapter 2 gives a detailed description of RSS 2.0. You find a reference list of the XML elements of RSS 2.0 in the appendix in section A.3.)
1.7.7 From a Syndication to a Publication Format: Atom, the New Alternative
In June of 2003, the Atom roadmap was published. (See this; concerning the date. Initially, the format was called "Echo" and "Pie".) The goals of this format were to be "100% vendor neutral, implemented by everybody, freely extensible by anybody, and cleanly and thoroughly specified". Previously, there had been intense debates about RSS 2.0 and the political implications of the fact that Dave Winer had control over the format. (Links for background material.)
At that point, it was clear that "weblogging would become an industry of its own", as Mark Pilgrim put it: in the future, interoperation would require more than "calling a friend or sending an e-mail". Mark Pilgrim and Sam Ruby developed the FEED Validator, which checks the newsfeeds of almost all known feed formats with respect to standard compatibility. In the process, they came across deficits of the RSS 2.0 specification and its predecessors. The specification is unclear on several important points, so in some cases it can't be decided whether a document complies with it or not. Winer's attempts to stay in control seemed to be "FUD" to the group of future Atom developers. (Fear, Uncertainty, Doubt: open-source supporters like to characterize this acronym as a generic strategy—used deliberately, but often in vain—to make someone insecure.) At that time, Mark Pilgrim considered RSS 1.0 more or less a failure, or even dead, and some of the people who had backed RSS 1.0 up to that point, supported Atom from then on as a new format.
In March of 2004, Dave Winer—unsuccessfully in the end—suggested combining RSS 2.0 and Atom into one format and naming the document element rssAtom. The new format would "differ from RSS as little as possible" and would be developed by an open IETF work group. The specification, which the Atom developers were promising, and the validation service could be used together. Winer's suggestion differed from the goals of the Atom developers only in the fact that he placed value on maximum backward compatibility towards older RSS versions. At that point, however, the discussion had advanced too far already, and Winer didn't participate. In fact, the Atom developers chose the IETF as the standard body. As the only feed format so far to be backed by an organization that is in part responsible for the development of the Internet, Atom has a good chance of becoming a standard.
The Atom work group followed the path of an exact syntactical specification that clearly defines the connections of Atom-specific information to other information included in the document. Atom is explicitly defined as both a syndication format and a publication format. The "Atom Publishing Protocol" will belong to the Atom standard as well, once it is completed. On the other hand, the connection with metadata formats is not the center of the Atom developers' attention. The Atom standard as such is independent of the specifications of the Resource Description Format; however, for some developers it is especially important that Atom and RDF stay compatible. (Chapter 4 gives a detailed description of Atom. You can find a reference list of the XML elements for Atom in section A.7 of the appendix.)
1.7.8 Which Format for Which Purpose?
All three—or four—up-to-date RSS versions offer the same basic functions for the user. The differences with respect to these tasks are easy to balance with modifications and extensions. The formats, however, vary notably in the amount of detail in the specifications, the processing of documents in these formats, and the additional functions they offer:
- RSS 2.0 and its predecessors were defined by referring to the latest technological implementations. The specification doesn't depend on the way RSS is treated, but—explicitly or implicitly—it refers regularly to the current practice. This is supposed to make the specification simple and easy to implement, and restricts the creativity of software developers as little as possible. (It is for this reason that it is so easy to accuse Dave Winer, one of the format's founders, of using the format definitions for personal interest or the interests of his company UserLand. It is a design principle of RSS 2.0 to abide primarily by the current practice; as a pioneer of this practice, Winer can't do anything other than to refer to his own developments.)
- RSS 1.0 and its successor RSS 1.1, on the other hand, are specified in such a way that the content of documents can be automatically processed. An RSS 1.0 or 1.1 document is nothing but a serialization of statements which follow the rules of the Resource Description Format (RDF). The format uses a semantic model that makes the formal description of the document's meaning possible. Information that is available in an RSS 1.0 or RSS 1.1 document can be easily connected with other RDF information and used together.
- Atom was defined considering the technological requirements of newsreaders and authoring systems for weblogs. (See also the site of the Atom Wiki concerning Use Cases.) However, in the specification the format is described abstractly and independently of how such systems are implemented. It is the goal of the Atom specification to describe the format and the rules completely and clearly for users. Software developers are supposed to be able to decide for certain what is allowed in an Atom document and how documents are exchanged between the client and the server. (This doesn't mean the importance of the language elements for a human user, that is, their social function, is clearly determined. It also doesn't mean that Atom meets its own expectations one hundred per cent. If it can't be decided in Atom and RSS 1.0 whether a certain construct in a document is possible or not, it means that there is a bug in the specification.) Another important difference between Atom and RSS 2.0 and 1.0 is the fact that Atom was also developed as a format for authoring documents. For that, the format is used in the context of the architecture of the web as described in the current specifications of the W3C.
If you read this book, you are probably using RSS yourself, or at least you will want to use it in the future. Considering the different RSS versions used on the Web, you will ask yourself sooner or later which one is right for you.
You will find here a long and a short answer to this question. The long answer is the book itself. As you will see, the advantages and disadvantages of the different syndication formats can't be summarized in just a few sentences. If it involves more than producing a simple newsfeed, several aspects have to be considered, like the existing software, the necessity to combine RSS with other vocabularies, the way of validating data, future extensibility, and the requirements that result from the use of web services.
The short answer is: users who want to use RSS only as a syndication format have to analyze what data they want to offer. The most important content elements are found in all RSS versions. Those who restrict themselves to these core elements can use any of the formats and automatically convert it into one of the other formats—either with software on their own system, or with a service that is offered on the Web, like, for example, Feedburner.
Those who are looking for more ways to express themselves have to evaluate, which one of the versions offers the features they are looking for and is at the same time supported by software that is supposed to process the data. In respect of the possibilities of expression, the modules of RSS 1.0 are still unmatched at the present time. Anyone who wants to offer multimedia data, for example, as a podcast, depends mostly on RSS 2.0 and its expansion modules. It is to be reckoned that the corresponding modules of both formats will soon be integrated in Atom as well.