©2019 by metashop

Episode 6 - Semantics Part 2

Alright, so this is part two of my series on semantics, and specifically, the semantic web. I wanted to focus today on linked data, microdata, open schemas and the like, which are making the semantic web into something of a reality available to us today. Not a lot of people are taking advantage of the semantic web yet, so we can’t call it a fully realized reality, and that is because, compared to the web itself, and the approachability of HTML, getting semantic – gets complicated. I think you’ll agree with me once your head starts spinning as I get into some of the details.

I gave a little history of the web in the first episode of the series - before jumping down the road of the history of semantics, although they are arguably an inextricable history. Tim Berners-Lee, the inventor of the World Wide Web, has had a vision of a semantic web since the web was born back in eighty-nine. And HTTP and HTML were the standard protocol and markup language (respectively) that made it possible, both still very much in use today. HTML particularly has come A LONG way from the standard it was in the 90s, but even now it remains a language for structuring information.

The real bread and butter of the semantic web comes in linked open data and modeling knowledge. The semantic web is a web of linked data that allows machines to easily process information with context, without the need for human intervention.

So in order to create this semantic ideal for computers to be able to process information without human intervention, to help computer create context around words, HTML works with some new wrappers, or microformats. They extend the HTML syntax to create machine-readable semantic markup about objects such as people, organizations, events or products. They give that context around those minute details that HTML was never meant to handle. HTML has always been able to tell search engines what is the most important, second most important, third, fourth, fifth and even sixth most important header on a page. But then, after that it was all paragraphs and italics or bolds. And it was always able to present information in a way that people could decipher, for instance, if someone was presenting a recipe for chicken on their website, they could add headers, text, lists either ordered 1,2,3, or unodered with bullets. And this was all very decipherable by people, and if they used their header markups correctly then search engines could at least decipher what the webpage was about (a roast chicken recipe, for example)… but that was about where the computer’s understanding of the context of what was on the page stopped with just the use of HTML. But now, with linked open data, and these microformats, all that has changed.

 

There are four semantic annotation formats that can be used in HTML documents; MicroformatRDFa (an extension to HTML5), Microdata and JSON-LD

So before I go on, I want to talk about RDF, or the resource description framework, so that RDFa makes any sense.

  • So I first want to talk about RDF- the resource description framework. It is the standard model for data interchange on the web. Essentially, it creates the linking of things across the web by naming the relationship between them using Uniform Resource Identifiers (URIs). You’re already lost… Well, it’s complex! So what RDF allows you to do, for the computer that is trying to understand what is on the web page, is say something to the effect of Name: Susan <relationship type> has profession </> Profession: Chef. And your important details, Susan – being the person that your content is about; as well the concept of a chef – being the important detail that you want the computer to know about Susan. This information will be named using URIs that feed back to standardized information sources that detail who Susan is and what a chef is. And then the computer can use those context details to tell YOU that THIS Susan is a thing called a Chef.

    So these links, via these URIs, as modeled by RDF, form a directed, labeled graph database, where the edges represent the named link between two resources, represented by the graph nodes. If you’re confounded, do a quick Google search for “graph database” or “knowledge graph” and look at the images to see a visual representation of this graph view that I am talking about. This graph view is the easiest possible mental model for RDF and is often used in easy-to-understand visual explanations.
     

  • And URIs. I mentioned briefly that stands for Uniform Resource Identifiers. So these are used by groups like schema.org (who built and promote schemas for structured data on the internet, on web pages, in email messages, and beyond. – taken straight from their website) and they provide the actual link between the wrapped information on a webpage, the schema to which the semantic/ contextual information is adhering, and all the information about the thing being named. So it is how we can link the concept of “Susan” and “chef” to something useful and meaningful. Otherwise the word is just a word on a page and it doesn’t mean anything to a computer, but if we tell the computer to link this word chef to a schema that defines a chef and gives that word meaning, then this is how you create meaning and context for a machine. And that machine can do something very cool with that meaning and context.
     

So it’s kind of wild. And a bit of the reason why the semantic web hasn’t really taken off, and certainly didn’t back in the 90s when flaming tiger gifs were considered a way to add pazzaz to a web page. Like, schema.org provides the vocabulary. But then you have to wrap it in a microformat like RDFa or JSON-LD. You might make use of the Facebook graph in order to be recognized by them.

I don’t want to relegate JSON-LD to a sidenote, but I am also not going to spend a ton of time on it. JSON-LD stands for Javascript Object Notation – Linked Data. So it is another way to wrap information, link it to a definition or meaning, and create semantic context. It has elements and attributes designed to represent simple and complex content. But suffice it to say, it creates a lot of good context and is an excellent choice for wrapping your data. 

So, if you want to see an example and you aren’t terribly afraid of looking at some HTML, do a quick google search for “Roast Chicken Recipe.” Now if you’re on Google and your search history isn’t something really crazy, then you should get back a semantic search result from AllRecipes.com that highlights the first three steps in the directions for a “Juicy Roasted Chicken Recipe.” This is because AllRecipes has described their content using schema.org URIs. Don’t believe me? Left-click on the page and select “View page source (this doesn’t always work for me in Safari, so you may wanna try it in Chrome) and do a “control+F” search for “schema.org” and scroll through the 34 instances. You’ll see that they capture nutrition information, rating and reviews, the recipe information and some other objects on the page.

And this also partially answers for why you always get back AllRecipes results when you search for recipes online. Of course their well-known and trusted brand accounts for quite a lot, but this semantic wrapping of their information gives Google one more layer of trust and insight into the content of the AllRecipes pages that they know their algorithms can work with. To be honest I am always a bit blown away when a client tells me that they want to improve their metadata and their SEO, but then when I mention JSON-LD they say it is not a priority. Don’t get me wrong, I understand that getting to a baseline is a requirement before jumping to semantics, but I also struggle to see how the formats that are most likely to actually impact the bottom line are still being ignored. But again, it’s the level of effort to promise of reward thing, which is why semantics are still slow on the uptake.

So my next installment will dive a little deeper into semantic schemas, where to find them, how to use them, and the ways that you can see semantic data represented on the web, and maybe even talk about some tools that can help you get to semantics without having a deep technical development background.

Reading Recommendation

I am recommending a book called The Semantic Web Explained by Péter Szeredi, Gergely Lukáscy and Tamás Benkó. I like this book because it is meant for the non-developer at the same time as being a text-book for introduction to web semantics technologies. It introduces the web in its current state as of the time of writing (which was three years ago, agonizingly long ago for tech, but again, people aren’t talking about semantics as much anymore for risk of sounding passé), and then dives quite intricately into the mathematics around Description Logic  and web ontologies. It isn’t a very gentle book. I’ll be honest… But nothing about the semantic web really is, again, that is sort of why it hasn’t taken off, it’s just not as approachable as the web itself was, and people tend to get daunted when they try to discover it. And also it’s not very easy to find recent book content on the subject. People are, sadly, giving up, even as we’re just getting the gumption to get going with this stuff.

Sources

It’s interesting how old all of the hubbub about the semantic web is. Folks have given up on it despite the fact that we’re really entering the age in which I think we’re becoming ready to commit to the efforts required for it. We’re finally at a place with enough giant shoulders to stand upon and enough amassed smart data to be able to make something with it. I dunno, that’s my take.

Head on over to inevermetadata.com to find a transcript for this episode, the reading recommendations link where you can support the show by purchasing through my link, and you’ll also find links to my sources on the episode page at the bottom of the transcript. You can email me at inevermetadata@gmail.com with questions, concerns or corrections. I know my audience is still pretty small but I feel like I have to had made SOME errors by now. Correct me!