Episode 5 - Semantics Part 1

Here we are, on, well, I guess my fifth episode. I had to give in to the gods of podcasting and call my introductory episode my first one as the numbering is beyond me. So no, episode one isn’t missing to the ether, I am just still learning how podcast RSS feeds work!

So today I wanted to start a dive into one of my favorite aspects of metadata – semantic capabilities. It is, in my opinion, a markedly good representation of the good use of metadata. It represents the mutual agreement of participants to comply with a standard, or set of standards, put forth by a governing body, which in itself represents the mutual cooperation of various groups. Semantic capabilities rely on linked data, specifically linked open data and the adherence of content creators to the use of that linked open data. I will dive deeper into this in a moment, but first, some history! Oh, and it’s is going to be all about the web. Today is episode number one where I will talk a bit about the history of the web and the evolution of HTML and web markup languages, frameworks and protocols in general, and that history and context might be all that we get to today, and then a future episode will dive into some of the newer semantic wrappers, consortia, schemas and frameworks that are used to create the vision for the semantic web.


So, semantics is used here in the context of the semantic web. The semantic web is a wed of linked data that allows machines to easily process information with context, without the need for human intervention. It is essentially a way of making computer “smart” by giving them context around information that they – being machines capable only of the processing of sets of 1s and 0s – would otherwise not be able to understand the way that a person would.

Now, I’m not here to try to bring back the early 2000s and make the semantic web sound like this cool thing of the future that is bound to happen someday. The truth is that people have been working toward the semantic web for many years now and the inherent complexities of creating semantic context around a global set of ever-increasing amounts of information make the semantic web an inherently challenging thing to launch. But it is happening and use of the technologies that would underly it are gaining traction, even if only by a few players and even if still in a very behind-the-scenes/ only known by the in-crowd kind of way.

How the Web Works

OK, so I mentioned before and should probably explain myself, that this episode is really an introduction to the history of the world wide web, and specifically the evolution of web markup and frameworks. And I really do think that an entire episode is required to introduce this information, just because knowing about how the web, and web documents and the hypertext transfer protocol (HTTP) work is foundational to having a conversation about semantic wrappers, linked data, shared schemas and all the things that make the semantic web what it is.

So the World Wide Web was invented in 1989 by Tim-Berners Lee a physicist working a CERN (The European Particle Physics Lab). He was working in the computing services for CERN when he came up with the idea for the two aspects of the web that would change everything, and remain in place for decades.  One of my sourses said that “This may seem odd” when citing that Berners-Lee worked as a physicist when he came up with this technology. But it isn’t surprising to me at all. Academic and research communities, outside of libraries of course, are the primary groups pushing a lot of good work forward for the standardization of information for mass sharing.

So although early computers couldn’t talk to each other, they could store or process information using ASCII (American Standard Code for Information Interchange), also known as "plain text." In ASCII, the numbers 0–255 are used to represent letters, numbers, and keyboard characters like A, B, C, 1, 2, 3, %, &, and @. Berners-Lee used ASCII to come up with HTTP (HyperText Transfer Protocol) and HTML (HyperText Markup Language).

HTTP is a way for two computers to exchange information through a "conversation." One computer running a program called a web browser asks the other computer called a server for the information it needs. The web browser sends requests for information and the server sends the information that it has. This is a conversation that happens through the HyperText Transfer Protocol – HTTP.

The computer also needs to be able to understand any files it receives that have been sent via HTTP. This is where HTML comes it – it provides a consistent, standard structure for documents being passed back and forth using specific tags. A Web browser can read these tags and use them to display things like bold font, italics, headings, tables, or images. The tag-pair structure of HTML was based on the existing standard for marking up text in structural units such as headings and paragraph’s etc. That standard was called “SGML” or Standard Generalized Markup Language. HTML, of course, added the “webifying” component of the href attribute – the hypertext link to other pages of information.

HTTP and HTML are "how the Web works” and they remain in place to this day.

History and evolution of HTML

In its nascency, HTML was all things – it was the structural layer AND the style layer AND – to the extent it existed, the functional behavior layer. But with time, and the introduction and formal adoption of CSS and Javascript, HTML was able to evolve, including an evolution through XHTML which was a blending of HTML and XML (a data packaging and transfer markup), but then ultimately back away from XML into HTML – which is in use now. So Berners-Lee began presenting his creation to the world across the internet and it began to gain traction. Browsers were developed to support it, HTML+ was defined, conferences were held to discuss and formalize it.

As is unsurprising to me as a librarian, the primary attendees of these early conferences were members of the scientific community who had a vested interest in being able to share information. I think I have noted in the past that this remain true, with museums, cultural institutions, research and sciences being at the forefront of today’s Linked Data despite its pervasive value. Anyhow I digress. So with the formalization of HTML and the growth of interest in the web and internet as they worked together, the web began to take on the character, the essence that we know today with the use of domains and URLs.

As time passed and different browsers added their own bits of HTML, the language began to lose its integrity – that is, it was become less standard. The World Wide Web Consortium was formed to take control of the language and deliver on the promise of standard markup, and over the years numerous versions of HTML were drafted and published, including HTML 2, 3, and 4, and then XHTML. Different browsers would update to support aspects of the new versions of the language, but there ended up being a lot of backward compatibility issues after so many different versions. Eventually HTML 5 was introduced that helped with many of those backward compatibility issues, and is the version in use as of this episode.


The Limitations of HTML

The World Wide Web as we know it today is an incredible outcome to have been invented one day by one guy, but then of course built upon and developed over decades by consortia of thinkers and innovators. But the web as it is today isn’t even as grandiose as the vision gets. Berners-Lee has a vision for a semantic web – a Web in which data in web pages is structured and tagged in such a way that it can be read directly by computers. But you may be thinking, “but Mindy – the web is already read and presented by computers! That’s what happens every time I use Google!” But it isn’t!

HTML has successfully returned search results for search engines because clever search engine creators have made use of all of those structured tags, The h1 tag indicating that the contents within that tag are the single more important declarative bits of information on the page, so if the h1 says “A Brief History of The Semantic Web” then the search engine is going to assume that THAT is what the content is about and index it accordingly. Great. Pretty smart on its own, I will agree. But that is not what the semantic web wants to do.

HTML (with its accompanying other scripts, markups and languages) allows you to present documents on the web with as much and whatever information you wish. Let’s say, for example, a roast chicken recipe. The HTML of this recipe page can make simple, document-level assertions such as "this document's title is ‘My Mom’s Roast Chicken’", but there is no capability within the HTML itself to assert unambiguously that, for example, that one section of the web page is for some history or context around the recipe, why it is offered here and how the author relates to it, or what section of the page is for the ingredients list, or the directions, the nutritional information, or how many it serves. HTML can say “Make the word ‘Ingredients’ an h3 header and list the ingredients in an unordered list”, but it cannot help the machine actually know which of the words on the page is an ingredient, or differentiate the ingredient from the units, measures or pre-preparation indication of the ingredient. These must be done with microformats to extend the HTML syntax to create machine-readable semantic markup about objects on the webpage. These could be a person, organization, event, product or more. Initiatives include, RDFaMicrodata and Schema.org.

And I am going to tell you all about those cool initiatives, next time.

So again, I wanted to give some history on HTML because it think it is important to understand that by itself, it is not meant to be, and should never be extended to try to be, a semantic markup language. But groups are coming together to work on this. Not least of which is the World Wide Web Consortium. So next episode I will talk about Linked Open Data, open schemas, RDF, Owl and all the interrelated languages that go into making up the complex world of the semantic web, and in that vein, why it hasn’t caught on.

Reading Recommendation

This week I am reading and recommending The Web was Done by Amateurs by Marc Aiello. For one thing I love the angle it comes from in looking at the history of the web, and for another it is just one of so few that are recently published. It feels like 2016 was when everyone stopped talking about the web!  But it’s very cool. It talks about the technologies that the ideas for the web came from and were built upon, and how it had some serious growing pains, got lots of patches, and has come into its own as THE technology for economic and social interaction.


The World Wide Web Consortium on content derived from "Raggett on HTML 4", by Dave Raggett, Jenny Lam, Ian Alexander and Michael Kmiec  https://www.w3.org/People/Raggett/book4/ch02.html

Explain That Stuff: https://www.explainthatstuff.com/howthewebworks.html

The Wikipedia pages on: 

Semantic Web: https://en.wikipedia.org/wiki/Semantic_Web

and HTML https://en.wikipedia.org/wiki/HTML