Data Access & Integration Challenges — 2019 and onwards
Data Access & Integration Challenges — 2019 and onwards
{Placeholder for our friendly retort of Ruben’s post i.e., we have this stuff in Virtuoso etc..}
From a post by Ruben Verbourgh
Querying the Web
This one is actually my Big Question for the years to come, but it doesn’t hurt to state it explicitly:
- How can we query data on the Web?
In particular, I’m interested in small data rather than Big Data:
- How can we query a large number of small data sources instead of a small number of large ones?
By incorporating de-reference of URI-variables and URI-constants in the body of SPARQL Queries. Basically, a processing pipeline that incorporates intelligent Web Crawling into the Query Solution Production Pipeline.
Virtuoso has offered this since inception, courtesy of integration with its built-in Sponger Middleware.
Examples shared via Tweets:
Querying data in decentralized networks
The following questions are inspired by the Solid project, where people store data in their personal data pods instead of inside applications:
- How can we query across personal data stores, taking into account privacy?
- How can nodes in a decentralized network help each other with caching and querying?
Since querying decentralized networks takes time, the following also become important:
- How can we conceal latency in applications that need data from decentralized sources?
- How can we approximate query results and incrementally improve them?
In particular, I am interested in reviving link-traversal-based querying. Instead of blind traversal, we should leverage knowledge about the data shape and structure.
- How to perform informed link-traversal-based querying, given machine-interpretable knowledge of data shapes and linking structures?
- How can additional knowledge improve the efficiency of link traversal?
Facilitating Linked Data application development
In my 2018 blog post, I argue that the developer experience is crucial to accelerate the creation of user-facing apps, which are a major pain point of the Semantic Web. Rather than hand-wavingly considering this a trivial engineering matter, we should look into new abstractions that hide complexity. Think beyond JSON-LD.
- How can we expose Linked Data to developers, hiding the complexities of RDF but keeping flexibility and unboundedness?
- How to leverage composable data shapes that apps can bind to, as opposed to custom data models?
- How to write Linked Data according to a specific shape?
Making personal data linked
I see the GDPR legislation as a godsend for innovation, and a way to break the Semantic Web’s chicken-and-egg problem for personal data. Under GDPR, we can contact any company or organization and retrieve our data in a structured format, giving us plenty of eggs to work with.
- How can we extract personal data from third parties?
- How can we link personal data from different third parties together?
- What kind of vocabularies and shapes should we be using?
Can we easily move from one to another?
Read–write public and private Linked Data
The main Linked Data has success stories are about reading public/open Linked Data. These are important stories, but the opportunities for Linked Data extend far beyond them. Tim Berners-Lee has always called for a Read–Write Web, and such a Webcontains public data, private data, and everything in between.
- How can we build read–write websites based on Linked Data, accessible to humans and machines?
- How can we meaningfully combine public and private Linked Data?
Personalized Linked Data experiences
Finally, we need to rethink the interaction between people and data. Currently, we follow a strong question–answer paradigm, where — if lucky — we get what we ask for. I am interested in personal agent/assistant interactions, where people are given the information they need, when they need it. This is also an alternative approach to latency concealment, in that we predict needs earlier and thus have more time to find answers.
- How can we continuously assist people with data needs, as opposed to answering demand-driven questions?
- How can we prepare for upcoming data needs to provide answers faster?