The Truth about Content
August 17, 2009
Well now that I have thrown down the gauntlet in my previous post about the Trials and Tribulations of Content Management, I feel that I should take the next step and posit some core definitions for terms such as content, information, data, publishing and even knowledge. Hopefully, others will contribute superior alternatives or hack these contributions to pieces so as to expose something better.
And why not start with the term content. There is an entire industry dedicated to its management so we would hope that it is a term for which we have a ready definition. In reality this does not seem to be the case. So let’s toss one out.
Content is potential information.
Content is the raw material from which information is fashioned. It encompasses all of the constituent pieces that, when assembled together, constitute a specific information transaction. So it is that when we undertake a content analysis we are in effect rewinding the communication process that produces the information product. Working backwards from the envisioned audience and the intended outcome, we itemize the many pieces that will need to come together to forge an effective information event. Many of these pieces are recognizable to practitioners in the content management and publishing field – text components, media assets, assembly maps, governing models, processing rules, audience profiles, personalization filters, formatting stylesheets, metadata structures, relationship links, security controls and probably others. All must come together to form an information output which, in being delivered, becomes an information transaction.
This leads us to a point where a second definition is needed – one for information. In many environments, the terms content and information are used almost interchangeably but not here because to do so would be to conceal an all-important difference.
Information is the meaningful organization of data, communicated in a specific context and with the purpose of informing others and thereby influencing their actions.
In other words, information is transactional. It is an action. A given information product will therefore have specific goals, take on a specific physical form, occur in a specific context, and be exchanged between specific participants. Information is also authoritative, meaning that someone is responsible for it just as someone is responsible for an action. This fact explains why information management has been receiving so much attention in this time of heightened accountability. Information, it is important to note, is formatted and physical – it has been fashioned into a form that is fit for its purpose whether that be as a book, a text message, a voice prompt or a web page. The transactional unit of information, the artifact that is ultimately encountered, is known as a document.
Once enacted, information, as we well know, persists and accumulates - in some ways taking on a life of its own quite separate from the context in which it originally occurred. Past information transactions, as an illustration of this process, become inputs to, and reference points for, content creators and information actors. As information events are actions they will give rise to consequences and this experience can become part of what is known about each information transaction (its outcome or results). In this way, past information events become a source of content that can inform, and potentially improve the effectiveness of, future information transactions.
So how does content become information? How does it change from being merely the potential for action to being a real event? By what mechanism do all the constituent pieces assemble together to form an information transaction that will suit a specific circumstance? This is the role of publishing.
Publishing is the process of transforming content into information, of converting the latent assets into concrete actions that will inform people, impact performance and influence outcomes.
As a sweeping generalization, I would be inclined to say that publishing technology stands as the most challenging aspect of the entire content management domain. The variety and sophistication of transformations, necessary to support the increasingly wide range of information transactions that organizations seem to need today, represents a very demanding requirement and especially when everyone seems to want access to relevant information at the touch of a button.
In our posited definition for information, we see another term deserving of attention – data.
Data is the meaningful representation of experience.
More specifically, data is the method of encoding that we use to represent the basic building blocks of communication, with these providing the substance for what we call content. It would be a distraction to dig too deeply into the subject of data now but suffice it to say that the various components that make up content will each be associated with a form of representation for which rules have been specified. One such component would be a formatting stylesheet which will, in addition to a basic encoding of its own characters, possess a specific syntax that governs how formatting instructions are to be framed as well as how they are to be applied and to what. So not only are the details arrayed in a table in a final document considered data, so should the formatting stylesheet that governs its rendition.
It is at this point that it is worth pausing to consider the subject of openness and portability. These considerations are effectively determined at the level of data representation and it is here that the adoption of, and adherence to, open standards will establish whether content components will themselves be open and portable or not. It is also here that the degree of intelligence that can be exhibited in the content components will be decided. How effectively and efficiently information transactions can traverse boundaries associated with different media and different devices is, in turn, governed by the openness, portability and intelligence of the content that stands behind it - and, looking deeper still, the data representations that make up the content itself.
Most recently, I have come to think of content as a transformation layer (or perhaps mediation layer) that separates data representations from information transactions – the layer in which we, as the content owners and information actors, plan the many different ways in which we may want to engage audiences and to lay out our resources and processes accordingly.
Finally, the relationships between these concepts are important because they illuminate the mechanisms whereby organizations and societies express, share and evolve their knowledge. So as a final step, I offer a somewhat controversial definition of knowledge itself, with it being controversial because it emphasizes the physicality of knowledge (in emerging from the meaningful interplay and organization of information transactions) and its standing as an evolving understanding that is publicly communicated and therefore subject to testing, refutation and even validation. This definition explicitly links knowledge to our definition of information, and thereby to our definitions of data, content and publishing.
Knowledge is the meaningful organization of information, expressing an evolving understanding of a subject and establishing a justified basis for judgment and the potential for effective action.
As we might expect, content is what is contained in an information transaction, within a document. When we look at the lifecycle that leads up to the formation of information actions, we see that it is content, in its rich physicality, that is being fashioned, evaluated, assembled, endorsed and eventually incorporated into the transactions. We also see that the content of past transactions can be referenced, reused or revised in the framing of new information events. Data resources, of varying types, can be seen in evidence in each and every content component and, likewise, the imprint of background knowledge, whether acknowledged or not, can be seen in the representational data schemes and in the structural patterns underlying each information product. And knowledge itself achieves a degree of persistence and portability when it is instantiated in networks of inter-related information transactions that trace out, over time, the emergence and evolution of a shared understanding of a subject.
This overall system, or content ecology, becomes, in no time, almost unfathomably complex and this is one of the reasons why content management and publishing solutions have, historically, struggled to put it politely. When we really think about it, it should come as no surprise that the management of content, together with its publication and continuous evolution, is phenomenally difficult and the one mistake that must be avoided is assuming, as all too many technology implementations have, that content and its associated processes are simple.
This is an awful lot to pack into a relatively short declaration but I believe we have managed to float definitions for content, information, documents, publishing, data and knowledge. And without sound working definitions for these concepts, content management, as an industry and as a technology domain, cannot hope to be successful. As this is something that we, as a community of content management practitioners, have a stake in, I am hopeful that others will weigh in to challenge, correct, change or confirm these definitions.
If this is a topic that the reader finds interesting, then the following references might be worthwhile:
- Content Integration
- Fear of Content
- The Road to Intelligent Content
- Intelligent Content in the Green Desert
- The Business of Intelligent Content
- Architecting Information and Engineering Content
- Seven Steps to Intelligent Content
- The Challenge of Managing Intelligent Content
- The Emergence of Intelligent Content
- Content in the Wild
- Connecting with Content
- Structured Information Systems
- Managing Information
- On the Management of Content
- The Trials and Tribulations of Content Management
- The KM Uncertainty Principle
- The Great KM Divide
Now to make the context for this specific post, this information transaction, explicit, I will say that the notes for it were prepared in the above pictured quadrangle at Pembroke College, Oxford. It so happened that at the time organ music was pouring out from the small chapel that would be off to the right of the above photo. The blog entry was finalized and posted from my well-appointed room and desk as shown below. It is noteworthy that perhaps Pembroke's most famous alumnus is none other than Samuel Johnson, who among many other things produced a highly influential dictionary of the English language. Perhaps it is his spirit that spurred me into a fit of definitions and inclined me to sweeping generalizations and serpentine sentences. To what extent this context might colour the meaning of this post will be left open to debate. Of course, it might also have been the wine.
Have you seen Davenport & Prusak's definitions of data, information, and knowledge (in 'Working Knowledge')? The don't specifically account for content; I would put it after data, information, and knowledge and say that efforts to *contain* any of these three gives rise to *content*. That definition, however, may not meet the needs of methodological rigour you have in mind.
Posted by: Milan Davidovic | August 17, 2009 at 06:30 PM
Hi Milan and thank you for this timely comment. I knew that this post, despite the grandeur of its title, would not be the final word on the topic of "what is content". Indeed, it was immediately obvious that further thoughts on content were necessary. (In fact, I have since updated the post as a result of this exchange.)
Now that I am back in my library, I am able to consult "Working Knowledge" to refresh my memory on the specific definitions of data, information and knowledge that Davenport & Prusak put forward. Now that I have done so, I am reminded that this is one of the books about "knowledge management" that I have found to be the most level-headed.
Wisely, Davenport & Prusak declare that their intention, in framing some working definitions, was to be practical and useful in the context of discussing how organizations and people create and use knowledge. I say wisely because, as I perhaps illustrate all too graphically, venturing into deeper philosophical waters can be more trouble than its worth. That said, I find the definitions in "Working Knowledge" to be exceptionally good. While it may be difficult to discern at first, the definitions offered by Davenport & Prusak are not incompatible with those I have put forward, although mine perhaps wade more deeply into potential philosphical complications. Some examples might serve us well here.
Davenport & Prusak define data as follows:
"Data is a set of discrete, objective facts about events."
This can be one of the derivations that would be possible from my admittedly more abstract definition. I would probably pause over the use of the term "fact" and seek to bring a little more formality to how it is used, but this is a small quibble and one that would not, practically speaking, provide much help.
Information, in "Working Knowledge", becomes "data that makes a difference",..."it is a message" and "it must inform". This too, I would see as being a quite practical application of my more generalized definition.
Finally, with knowledge we hit what appear to be some differences, but these in fact turn out to be somewhat less important than the similarities. In "Working Knowledge" we find knowledge defined as follows:
"Knowledge is a fluid mix of framed experience, values, contextual information, and expert insight that provides a framework for evaluating and incorporating new experiences and information."
Excavating the similarities between this definition and my more austere generalization would take a little time. My whitepaper, ominously titled "The Anatomy of Knowledge", hopefully provides a background that, taken as a whole, illustrates that Davenport and Prusak's definition of knowledge is not incompatible with mine. The differences that do crop up turn on the perspective taken - whether you choose to view knowledge from the perspective of the person knowing or from the perspective of the thing known. When we consider the question from the perspective of the person knowing, the knowledge that this person already has must be acknowledged as playing a massive role in how new information will be interpreted or framed as utterances. In my whitepaper, I referred to this "form of knowledge" as "accepted knowledge", although perhaps "active knowledge" might be another way to put it.
And this all brings us to the question of "content" and how it really fits with these three concepts of data, information and knowledge. Seeing that these three concepts represent "conceptual artifacts", with semantic import that can pass from one person to another sometimes intact, sometimes intact but then to undergo change, and sometimes in a manner that sees it change in the exchange. I do believe that as soon as we start using the word "management" we are obligated to select definitions for the items being managed that make some sense. And so it is that content is usefully understood as the physical instantion of data, information and knowledge, how it is packaged and transacted, how it is contained, and the form in which it is meaningful to talk of its being managed.
Of course, when we talk about the content of a message, we are generally referring to its semantic import, so I am not 100% comfortable that we are out of the forest yet. At different times, I want to talk about the content as what is inside, while at other I want to talk about content strictly as physical artifacts, including their inter-relationships. There are even times when I am inclined to see the physical representation as what is "inside", and thus available to interpretation by recipients, and thereby rejecting the notion that there is anything more "inside" that is being carried along for the ride. But this now shows what can happen if you start into the more slippery slope of philosophic investigation.
So I think, or perhaps hope, that we are zeroing in on an understanding of content that will be useful, and as with the example provided by "Working Knowledge" usefulness in definitions should count for something.
Posted by: Joe Gollner | August 22, 2009 at 08:37 AM
I believe you have managed somewhat comprehensively to define those key terms. I found it a bit limited how “enterprise information management people” define or should I say categorize Content versus Structured Data. Suggesting that Content is unstructured information – plain text? – and data is structured, mainly fields in data bases of number of back end systems.
Ref. Simplifying Information Architecture, Creating An IA Program That Works by Alex Cullen
Which otherwise to me is a well thought paper. Layers in the frame work picture make sense at least.
your Finnish collaborator.
Posted by: Heimo | August 28, 2009 at 04:36 AM
I think that you have hit the proverbial nail on the head.
To many observers, content is simply what we have not yet taken the time and effort to properly structure. I encounter this viewpoint regularly - almost on a daily basis. To these people, content is further classified as either material that simply does not merit the investment associated with applying structural discipline or material that secretly wants to structured data and that has not been so elevated simply due to a lack of time or resources. The understanding of what constitutes structured data is invariably defined, for these people, by what can be managed using mainstream database technology.
Now for the content that has been classified unimportant, at least from a classical Information Technology (IT) perspective, the technology allocations that are made tend to be "broadbrush" measures such as more storage diskspace, perhaps a search tool, and maybe even a repository where these holdings can be dispatched (and perhaps hopefully forgotten). The attitude towards these resources often seems odd when you actually look at the materials being relegated to the infrastructure periphery as they certainly look important - policies, procedures, proposals, plans, and so on - very often with the signature or endorsement of someone who is relatively senior.
The second classification of content is in fact my personal favourite. This viewpoint is often associated with projects that set out to elevate the content, which has been designated as worthwhile, to the level of data that can be stored and managed along with other structured assets. These are my favourite because they are so frequently associated with projects that can only be described, even charitably, as disasters. In these projects, sometimes massive investments are made to construct database environments and business applications that can handle these newly reclaimed data resources. These investments become unstuck when the intrinsic complexity of content refuses to obey the often laughable restrictions that are associated with data-centric systems. Most entertaining of all is the fact that the proponents of these projects do not see that the problem lies in a fundamental category error - assuming that the content in these cases simply needs to be properly structured so as to become data. These proponents refuse to see the nature of their error and often make second and third assaults upon the problem only to meet with renewed frustration. Because the fundamental error is effectively invisible to these people, the sourse of these failures is always elsewhere and someone else, usually the user community, is at fault. The truth is that given the fundamental nature of the error being made, in attempting to see content as aspiring data, these people will never succeed no matter how much analysis they direct at the material, or how much money they spend building applications, or how many new product features they leverage in their database technologies.
Content, I am contending, straddles and encompasses the full range of communication levels - data, information and knowledge - and as a consequence exhibits complexity, and unpredictability, for which relational database technology is hopefulessly ill-suited or, more correctly, to which relational database technology must be applied in highly selective ways and with suitable limits placed on the implementers' ambitions for control and precision.
In a recent project, we pursued the question of where does the product data really come from? Where does, indeed, the product itself come from? It turned out that a significant proportion of the data held within the product lifecycle management system was in fact references to documentary sources, such as engineering standards, from which specific data items were drawn and from which these items took their authority. In this case, the products in question existed in a highly regulated industry and the data behind the product and its manufacture was subject to extensive control. The data, it turned out, derived its authority within this regulated industry by virtue of the fact that it first, and primarily, existed within document content. So the initial impulse of some to rescue the data from these ancient artifacts ran straight into a fundamental brick wall - the data had to exist in the context of a document before it could be legitimately used in a database and product modeling environment. The document came first and took precedence and my data-oriented colleagues on this project have been uneasy, even distraught, ever since.
Posted by: Joe Gollner | August 29, 2009 at 09:57 AM
Interesting discussion, Joe - thank you.
I have some reservations however, about making distinctions between data, information, knowledge and content; just as I have about so-called 'unstructured' and 'structured' digital assets.
The reasons for my hesitance can be rendered down to these three:
1. One application's data is another's information, and there are many other examples where the diferences between all forms seems to be arbitrary.
2. The 'structure' said to be associated with relational databases or hierarchies to name two prevalent examples, is more arbitrary than most people think. Extended to a logical (or illogical, depending on your point of view) extreme, it means that almost any structure can be applied to any collection of data or information making the primary reason for having them (meaningful communication) less than optimal.
3. The lack of structure said to be associated with other digital assets is not as problematic as the experts would have us think. Proof is in this communication, which by most standards seems 'unstructured', but by the contracts of effective communication: agreed upon grammar, semantics and context is highly evolved.
Data, information and knowledge are all phases or manifestations of the same thing - like ice and steam are all water. It doesn't seem effective to say that one is Hydrogen and the other is Oxygen and they may or may not evolve into water.
Posted by: John O'Gorman | October 19, 2009 at 02:35 PM
Thanks for your note. It comes at an opportune time as I have been thinking more about how my rubric hangs together and specifically how content fits into the mix. I think that I will be returning to this topic shortly and, owing to a number of sources including some of my own past presentations, I suspect I will be looking at content from the perspective of how it might be associated with "narrative" communication. But this is for another post.
On the points that you raise, I think we might actually be in closer agreement than it might appear - although that fact may have been buried under my sea of words.
On the first point, and this will return in looking at your third point, I agree that separating data from information is not a especially practical or possible task. If information is a meaningful organization of data, as I posit, then the organizational structures embodying that meaning would invariably be data. It's a bit like what Yeats once asked - "who can tell the dancer from the dance?"
On the second point, I do see the arbitrariness of structuring schemes but also the utilitarian nature of that arbitrariness. Or perhaps I hope that there is a utilitarian force guiding the formation and application of structuring schemes, knowing of course that they are frequently carried, willy nilly, from one domain to another and applied (or forced) in ways that don't necessarily make sense. By using the loaded term "meaningful" in my definition of data (meaningful representation of experience), I am invoking this "intentionality" although I am not one to immediately, or unequivocally, assume that it is a conscious intentionality.
Although I feel compelled to take a slightly different tack on the meaning of data than he does, I am inclined to reference Max Boisot's Information Space and its treatment of data as that which any given agent can, and does, perceive as meaningful within the stream of experience it participates in.
On your third point, I must say that this is the point that comes closest to what I have been thinking about most recently. So thanks for that. In my working day, I am routinely in the position of saying something like "You have designed an entire system predicated on the availability of neatly-packaged, uniformly-structured, and universally-accessible data nuggets. I am sorry to be the one who has to break this to you but this is not what you have to work with. You have a serious case of content, a more fluid mix of data, text patterns, media assets all flowing in a structure of sorts that we can only call narrative." In each of the layers, in my model, I stress the term "meaningful" and each definition both invokes and bears upon communication. In fact, I center all discussions of data, information and knowledge on communication as opposed to biological or cognitive processes (which I find to be the more popular approach and one I am repeatedly bumping into). Along the lines of my response to your first point, I don't see how we could untangle data, information and knowledge in any one "transaction", or instance, and I am not sure that such an effort would yield us much.
Posted by: Joe Gollner | October 19, 2009 at 06:31 PM