Semantic Lawyering: How the Semantic Web Will Transform the Practice of Law (Part 2)

(If you missed part 1 of the series, check it out here.)

What is the Semantic Web?

The Semantic Web is a way of making data smart. The idea is, rather than building smart applications that can analyze “dumb” data, you make the data smart in the first place. The problem with dumb data is that the ability of applications to make sense of human language is limited. Currently, the information in most web pages and text documents is “human language,” encoded in data formats that tell computers nothing about their meaning. What the standards that make up the core of the Semantic Web do is to provide data formats that can be used to make the meaning of information explicit.

Dumb data vs. smart data

So how is this done? What differentiates smart data from dumb data? If you view the source code of this web page (try it – it’s in View > Source in Explorer; View > Page Source in Firefox, View > View Source in Safari), you will see some text and a lot of “tags” between angled brackets, such as “<p>” and “<div id=‘header’>.” This is HTML, the mark-up language in which most information currently on the World Wide Web is encoded. It tells your browser how to display the text and images, and where to redirect when you click on a link – but not much else. Information encoded in plain HTML is dumb data. Let’s consider an example. In HTML, you might have the following text:

<p>Sun is a subsidiary of Oracle.</p>

The HTML tells your browser that text enclosed between the opening tag “<p>” and the closing tag “</p>” should be displayed as a single paragraph, and nothing more. A simple search engine might hit on this sentence even if I intended to search for the “sun,” as in the sun in the sky, or an “oracle,” as in the Oracle of Delphi. An application with advanced language-processing abilities might be able to deduce from the absence of an article (“a” or “the”) that “Sun” and “Oracle” are names. It might also deduce from the mention of “subsidiary” that the sentence in fact refers to names of corporations. In the current state of technology, this is likely to be a hit-and-miss process.

Making data smart

The idea behind the Semantic Web is to attach machine-readable metadata (data about data) to information that can be interpreted by any Semantic Web application. To better understand what this involves, imagine a mark-up language that enables you to specify what the things being referred to are. Imagine that this mark-up language enabled you to add tags to your data to specify things like:

<item this is a corporation> Sun </item>

<item this is a legal relationship between two corporations> is a subsidiary of </item>

<item this is a corporation> Oracle </item>

Even better, imagine that, rather than just labeling things, you could refer to a source of information on the web that tells you more about each of these things, e.g.:

<item see> Oracle </item>

The link referred to is a “resource” – a bundle of data available online that describes something. This resource contains data, encoded in a machine-readable format, which might state that Oracle is a Delaware corporation, that it is headquartered in Redwood City, California, that the current CEO is Larry Ellison, etc.

Now let’s take this one step further, and imagine that, when that “Oracle” resource states that Oracle is a “Delaware corporation,” it in turn refers to an online resource that defines the term “Delaware corporation.” That definition might specify that a Delaware corporation is a kind of legal person, that it should have a certificate of incorporation, bylaws, a board of directors, etc. Of course, these statements would also be machine-readable, and could in turn refer to other resources (defining “legal person,” “certificate of incorporation,” “board of directors,” etc.).

Classifications and rules

Where does it all end? It ends with “thing.” That is, a “corporation” is a “legal person,” which is a kind of “person,” which is a kind of “thing.” A “certificate of incorporation” is a “legal document,” which is a kind of “document,” which is a kind of “thing.” Everything is a thing, and so every “resource” is a kind of thing, which fits into a classification of things (a taxonomy). One of the most important aspects of the Semantic Webs is defining taxonomies of different kinds of things using machine-readable formats. There is no need for a single, all-encompassing taxonomy which defines every possible thing: partial taxonomies can define a few terms by referring to other taxonomies, and all of these interlinked taxonomies ultimately refer to the most general standards (remember, this can be done because they are all online).

The Semantic Web also goes beyond mere classifications, allowing you to specify rules for each kind of thing. For example, you could specify that a “director” of a “Delaware corporation” can be a natural person, but cannot be a legal person. You could specify that the property (predicate) of “having a subsidiary” must have a corporation as its subject and another, different corporation as its object.

The foregoing does not purport to be a technical exposition of the Semantic Web, but I hope you get the idea. The core of the Semantic Web is a set of precisely defined standards that can be used to make data smarter by making explicit the underlying structure of the information.[1] Online classifications and rules enable applications to identify and analyze the data in much greater depth and with much greater precision than existing alternative technologies.

The state of the technology

Not all of the pieces of the system outlined above are in place. The basic standards of the Semantic Web, including the Resource Description Framework (RDF) and the Web Ontology Language (OWL), are by now reasonably mature and stable standards. However, there is still a good deal of work to be done and problems to be ironed out before the vision of the Semantic Web is fully made a reality (see here and here). Nevertheless, an increasing number of big names have been adopting Semantic Web standards to structure their data (New York Times,, BBC, Thomson Reuters). Identifying the real-world future implications of the Semantic Web is no longer science fiction, even for the legal industry.

(Next up: Part 3 – A Machine Readable Version of the Law?)

[1] Siegel, Pull, The Power of the Semantic Web to Transform Your Business, p.13.

About the Author

Brian Harley

Brian is an LLM at Columbia Law School.
  • Luke P

    As a web developer I find it fascinating to view the practical outworkings of semantic web from the other side of the fence. Good read, thank you.

  • Harley on Semantic Lawyering « Legal Informatics Blog

    [...] Brian Harley, an LLM student at Columbia Law School, is publishing a series of posts (see Part 1, Part 2, and Part 3) entitled Semantic Lawyering: How the Semantic Web Will Transform the Practice of Law, [...]

blog comments powered by Disqus