Today I’d like to tackle the first “fundamental” question I raised in my November 23rd post: who will label all this data?
A big driver for “who” will label the data is “what” is to be labeled.
There is not just one kind of data out there, and for the purpose of
metadata creation I’d distinguish between at least 4 types of data:
basic building blocks (e.g. sentences in a text document), structured
fragments of documents (e.g. a paragraph), self-standing documents
(e.g. a speech), and groups of documents (e.g. set of conference
speeches). I have synthesized this in the slide below.
Each one of these data types calls for a different type of metadata.
Metadata for documents and groups of documents is mostly going to be
used as is to organize these documents and return search results to a
human user. This metadata needs to be provided in a synthesized format,
usually in the form of a few keywords or expressions. Standardization
of this metadata can remain relatively limited, as machine only need to
match these text strings in a mostly straightforward manner.
On the other hand, metadata for what I dubbed ‘building blocks’, the
most basic structured unit in a document, will be highly standardized
in order to be processed by algorithms, which will weave blocks together by relying on metadata and, if all goes well, turn all this into
‘intelligent’ answers. This metadata therefore is purely designed for
machine use.
Metadata for ‘structured fragments’ lies in between that for documents and for 'building blocks', as it can be
leveraged for direct human use or for machine processing, depending on
the need. Generally, however, I’d see it more aimed at human
use, due to the lack of standardization of the underlying data (the
computer will likely need to go down one level and still process the metadata for
building blocks to make sense of it all.)
So are machines better equipped than humans to create that lower-level
metadata for machine use? Just looking at the cheer volume of metadata
to be created, one would hope so. Indeed, the volume of metadata to be
generated is inversely proportional to the level of the data it relates to. This is evident: labeling each sentence in a document will
generate much larger volumes of metadata than tagging the overall
document. See the slide below for illustration.
Unfortunately, one problem remains with algorithms: accuracy. How accurate are the
metadata-weaving algorithms today? Overall, not very. To be accurate,
algorithms need to focus on a very small part of the problem. For
instance, recognizing addresses, or people, or events, in a document,
and generating RDF metadata for them.
But algorithms are fast improving. So I expect machines to
progressively climb up the metadata food chain. It is possible that
they may not even do this in the anticipated order. Algorithms may
emerge that may tag document accurately, before they even overlay
metadata on things like sentences accurately.
How fast will all of that metadata automation happen?
Here is where I part way with many out there…
A lot of folks in the space seem to ask themselves
optimistically how to best automate the task of building metadata, and not really how much
the task can be automated within their relevant timeframe. They work on
replacing users input as much as possible through mathematical models,
and anticipate them to be ready in six months or a year, when most
likely they will require another 5 or 10 years of efforts to get to
anywhere practical - if they do get there. By focusing instead on
building systems that best stimulate, aggregate and synthesize user
inputs (ironically, meta-systems!), they could within a year or so deliver a working solution, and then build on that potential success to
gradually increase the level of automation in their application.
In sum, I suggest here that solutions that (intelligently) incorporate human input
further will perform better over time. We need a
healthier balance between human input and automatic metadata
production. Given the poor performance of current metadata
applications, focusing on algorithms that enhance the collection of
user input and learn from it rather than autistically extract metadata
from the data itself is a better investment of one’s time.
Will the
differential between human performance and machine performance likely
remain wide enough to justify the investments in collecting human input
for years to come? A multibillion-dollar question, but I’d bet that it will.
Because it’s likely that metadata will become increasingly user-driven,
dynamic and volatile, in line with the ever- and faster-changing user
needs and mental frameworks. As long as the ultimate consumer of all
that metadata remains human, algorithms will need our inputs. So
building capabilities in the “wisdom of crowds” area today can only
help position you better in the space tomorrow.
Of course, it can be said almost with certainty that at some point,
user input collection will be fully automated and transparent, and
machines will create metadata with higher accuracy and speed levels
than would be possible through human processing. As of today though, no
algorithm I came across has proved capable of coating metadata
accurately and comprehensively without extensive human input. We seem to be years away from intelligent systems that will "get it". And guess what? Getting to those systems will require the same thing as trying to do without them: focus on better ways to
stimulate, aggregate and synthesize user inputs!
In a future post I’ll attempt to look into (1) which "users" will provide those inputs: programmers, experts, mainstream users? (2) how user input can
be collected and integrated into metadata-generating solutions.


