Data Silos are Killing Data Flow!

Situation Analysis

Ironically, I am re-posting the content of an earlier post here, verbatim. Why? Because I've just realized that my chosen blog platform (https://blogger.com) is also a data silo! It basically publishes mangled content as "text/html" and doesn't offer an RSS or Atom feed option either :(

Data Silos are bad. Even worse, they are growing exponentially!

It's no secret that Big Data is a conventional stool that consists of three legs: data volume, velocity, and variety. Unfortunately, data volume and velocity increasingly receive too much time, attention, and energy at the expense of variety.

Today, we have a NoSQL ("Not Only SQL Relational Tables" Relational DBMS World View) vs SQL ("SQL Relational Tables Only" World View) raging across DBMS vendors in either camp. The aforementioned camps have a simple value proposition: I can process more data at lower costs, with regards to data volume and velocity challenges. Thus, if you want to get some dead-silence in this noisy realm, simply introduce the issue of heterogeneously shaped and disparately located data. Basically, how does one reference data across database management systems (SQL or NoSQL)?

What is a Data Silo?

Technology that's inherently constructed with an internalized view of data, in relation to how its represented, accessed and manipulated.

Data Silo vectors include:

Myopic views of structured data representation e.g., the notion that "unstructured data" exists when that's a subjective view held by those who don't understand or care about loosely coupling data, data representation notations, and data serialization formats
Query languages tightly coupled too a specific overreaching view of data representation (e.g., SQL imposition in a world where nobody sees entity relationships in Tabular form, when thinking)
Use of Literals as opposed to References (e.g., Hyperlinks) for identifying (naming) entities.

What's the problem with Data Silos?

They impede the pursuit of data-driven agility. Ironically, we tout (with fervor) the imminence of a data-driven Internet of Things, a Web of Things, Big Data, and the like, while completely overlooking the inevitable impact of heterogeneously shaped data on this fine-grained mesh of machines, data, and people. Data Silos are also extremely expensive, in every sense of the word.

How do we address the Data Silo problem?

Simply step back and look at the World Wide Web (Web) abstraction over the Internet. Basically, if we were to go back 26 years (prior to its emergence), using today's dominant thinking about data matters, the meme of the day would be "Big Documents" and a race amongst vendors to provide the fastest "Big Documents" processing system. The fact that document content format varies wouldn't matter since vendors would simply pursue the misguided notion that the fastest document management system wins and all alternatives die -- en route to a single document content format that serves all purposes.

Luckily for all of us, the Web emerged instead. It provided fundamental infrastructure, via sound architecture, for document content creation, sharing, and integration. It delivered this virtue without confining the world to a specific content format -- courtesy of "content type negotiation" which is backed into its core.

What worked for what would have been "Big Documents" will also work for "Big Data" using the very same infrastructure of the Web -- thanks to the underly dexterity of its core architectural components (URIs and HTTP).

Core Web Architecture: Hyperlinks (URIs + HTTP)

If we can identify documents using hyperlinks, we can do the same for other entity types (people, places, music, and other things that make up our experiential existence). Likewise, if we can use hyperlinks to signal the fact that one document is related to another, we can apply the very same approach to identifying how a variety of entities are related, and the even describe the very nature of different entity relationship types.

The Web fundamentally demonstrates the power of Data as the new Electricity conducted via hyperlinks. Thus, in this noisy world of DBMS technology (SQL or NoSQL) and its "Big Data" meme, we must pay attention to the role hyperlinks should be playing in regards to data representation. For instance, to what degree (if any) are hyperlinks used to identify entity relationship components (i.e., an entity, its attribute names, and associated attribute values) or the subject, predicate, and objects aspects of a sentence (re., parts of speech)? Ignoring this fundamental step is a recipe for data silo explosion, and that's exactly what's happening today.

Data De-Silo-Fication Example

Be it within the confines of an enterprise intranet or the public echelons of the World Wide Web. The data silo induced data-flow-inertia problems remain the same i.e., we need to increase data flow across data silos, using methods that go beyond Tables, Forms, and Graphics (pretty silos). Basically, we need to add hyperlink enhanced sentences to the mix, using an approach I call nanotation.

Nanotation is simply about the ability to create controlled natural language sentences in any medium that accepts plain text. That's it. In my specific case, I prefer to use Turtle Notation due to its closeness to controlled English and the visibility it brings to relationship type semantics. It also doesn't hurt that one of its creators also invented what we know as the Web.

Here's a simple Nanotation example that basically creates a webby structured data island right within this post.

{
<> a schema:BlogPosting .
<> rdfs:label "Data Silos are Killing Data Flow!" .
<> rdfs:comment """Simple sentences that systematically encode
                   information [data in some context] in reusable 
                   form.
                 """ .
<> schema:author <https://www.linkedin.com/in/kidehen#this> .
<> schema:about <https://twitter.com/hashtag/DataSilo#this>,
                <https://twitter.com/hashtag/Web#this>,
                <https://twitter.com/hashtag/RDBMS#this>,
                <https://twitter.com/hashtag/BigData#this>, 
                <https://twitter.com/hashtag/NoSQL#this> .
<> schema:mentions <https://twitter.com/hashtag/Nanotation#this> .
<> skos:related <http://www.slideshare.net/kidehen/understanding-29894555> .
}

In the example above, <> identifies this post using a relative HTTP URI which surmounts the fact that an actual document location on the Web doesn't exist for my content until I save and publish this post. Anyway, once published, I will use the comments section associated with this post to showcase the effects of hyperlink enhanced data representation.

Actual sentence visualization, courtesy of our Structured Data Sniffer Browser Extension:

Here are some additional links that showcase the effect of nanotation-style digital sentences or statements as an effective vehicle for alleviating current and future challenges posed by data silos:

Subjects of RDF Sentences embedded in Twitter posts (tweets) as a mechanism for Linked Open Data generation
Objects of RDF Sentences embedded in Twitter posts (tweets) as a mechanism for Linked Open Data generation
Predicates of RDF Sentences embedded in Twitter posts (tweets) as a mechanism for Linked Open Data generation.

Conclusion

Database management system performance and scalability are not the most important aspects of the Big Data meme. They are simply an aspect of said meme. Data variety, privacy, and security are also extremely important issues that cannot be ignored, during the process of product design, development, acquisition, and deployment.

The issue of data de-silo-fication shouldn't be the topic used to invoke silence in a noisy space. It should actually be the issue around which the most noise swirls :)