(2022-02-11) Taylor Linked Data Presentation
1 preamble (0:00-0:45)
not just documents and files, but representations of people, organizations, places, things
some of these entities are virtual, like a company, or the concept of the colour red…
1.1 what is data, anyway? (00:45-2:09)
a datum (a word ordinary people don't often hear) can be understood as a specific fact, measurement, assertion, or claim about a thing, that is to say, about an entity.
so much information we are concerned with nowadays takes the form of symbolic representations of entities of all sorts
1.2 so what's the problem? (2:09-4:34)
we process information at different places and times
correctness is typically maintained by nominating a single authoritative source for a given piece of information,
different assertions about the same entities could themselves be spread across disparate information systems
in order to reason, computationally, about certain entities, it may be necessary to integrate information from two or more sources into a single working set.
2 bob likes cookies (or: the basic problem of representation; 4:34-)
take the following statement: "Bob likes cookies
In order to try to reason over this string of letters, we need to turn it into a formal representation:
likes: Bob -> Cookies (draw this as a graph)
2.1 some questions (5:12-6:22)
which Bob are we talking abou
what are cookies?
are we talking about the particular form of baked good?
or do we mean the pieces of data that get sent to your web browser?
what does it mean for Bob to 'like' something?
3 a more likely scenario (6:22-)
Bob likes cookies." is a single statement: a single datum, a claim about the entity Bob.
a familiar representation of such a bundling might be something that looks like a spreadsheet:
this two-dimensional representation is actually an illusion anyway: it's a one-dimensional sequence of one-dimensional sequences.
3.1 let's go shopping (for math)
it's the math people who always have the best structures.
there are a number of ways to represent these structures mathematically, but a particularly appropriate one is a thing called a tuple:
think single, double, triple, quadruple, quintuple, et cetera (triple-store)
many programming languages make this kind of structure more convenient by providing a mapping type,
"dictionary"
"hash"
inside a single program running on a single computer, these structures can take any form that's most convenient for the program's execution.
if you need to move these things from one computer to another (or different program on the same computer, or the same program that needs to exit and start again!) they have to be piled together into what amounts to a set of instructions for recreating the structure and its semantics internally.
3.2 fuckin json how does it work
serialization
3.2.1 baby's first json
this brings us back to the generalized Bob-likes-cookies problem: you can ingest this run of text into a program and it will generate the corresponding structure, but the structure will be useless to the program unless the program "knows" what it's looking at.
there are a zillion problems with this, a number of which can be yoked together by contemplating what happens when we can't rely on where it came from to identify it.
3.2.2 the symbol management problem
let's focus our attention for a moment on these left-hand-side labels
these may be suitable for labels, at least in English, but are altogether unsuitable as identifiers:
requirement on the part of the computer is that they are unique, which in practice also entails that they are exact.
the way to deal with identifiers is to put them into what is called a controlled vocabulary, which is exactly what it sounds like: a very strict dictionary of terms and their very specific meanings.
there are a number of conventional styles for dealing with multi-word identifiers given the aforementioned constraints:)
this is called camel case
but there is also pothole or snake case which uses underscores
3.3 remember what i was saying about uniqueness being important
so, you pile together all your terminology into a controlled vocabulary, you put that online in some ad-hoc form, and you repeat that process for every information system you create.
and everybody on the consuming side has to do this for your system and every other system they consume data from, and their own system.
3.4 who owns a word, anyway?
this is the kind of issue standards bodies, or at least industry consortia, exist to settle.
3.5 finally, fiiiiiiinally…
what if there was a way to create these controlled vocabularies unilaterally and put them online for whoever wants to use them?
it turns out that standardizing this capability has been in the works since 1996
this family of standards is called RDF, or more informally, linked data.
3.6 what does it consist of?
you take these identifiers and turn them into Web addresses
if the vocabulary authors follow best practices, the terms are de facto links to their own documentation and formal, machine-readable specification
I should also remark that we can give the same URL treatment to the identity of the resource itself along with the type of entity it's supposed to represent.
3.7 so why aren't these standards being used more broadly?
3.8 values lol
the principal business model of silicon valley (and its would-be mimics elsewhere in the world) is precisely to build silos in order to hoard and arbitrage data
3.9 what if you didn't care about hoarding data?
4 wrap it up, nice and neat
just like the free and open-source software movements of previous decades, the radical decoupling of data from data silos is a political agenda.
Edited: | Tweet this! | Search Twitter for discussion