AI Needs Data Permission & Provenance

Machine Learning, now popularly called AI, offers great promise for a wide variety of applications, however it also has a number of serious problems, many of them less severe than the total extinction of humanity.

One of the most fundamental problems with many of the implementations of AI is the lack of any reliable chain of attribution for the origin of knowledge. All of western academic science and scholarship rests on the requirement for rigorous attribution and citation of sources through footnotes. Without this, the entire edifice of knowledge could not be trusted.

Thus far, it appears the that Large Language Models are fundamentally incapable of delivering that basic requirement for verifiable knowledge, at least by themselves. The disturbingly frequent factual errors that are observed, which have been generously called “hallucinations”, might be more accurately described as “confabulations”. A word drawn from Latin roots meaning to put together through story telling.

Confabulation – Psychiatry, Psychology. The replacement of a gap in a person's memory by a falsification that they believe to be true: The report concluded that while the information elicited under hypnosis may be accurate, it may also include confabulations and pseudomemories.

A human academic author guilty of such behavior would be rapidly discredited as an unreliable source.

Another word for the general property of information produced by citations would be provenance. The term data provenance means a digitally verifiable chain of custody or origin for information. In a world where Large Language Models feed on data output from other models, and confabulation goes unchecked, the only defense we will be left with is data provenance, the ability to trace and verify the origin and therefore quality of information.

As a pragmatic solution, data provenance is achieved today using cryptographic keys to sign data, and then digital fingerprints of data events called hashes, can be recorded on immutable ledgers. These technologies have been associated with blockchain, but they are in reality the core building blocks of data infrastructure completely independent of any use with blockchains or tokens.

Another challenge with machine learning in general, separate from unconstrained LLM, is the relationship to intellectual property and creative work. Not only is provenance needed, but also some mechanism for agreement and then permission to be granted under that agreement before data that may represent the legal, moral or personal property of one party is simply gobbled up by another party's LLM.

Where data may include health or medical records there are even more nuanced issues of permission and attribution that must be resolved for AI to realize its full potential and avoid societal backlash. Here, and in many other personal data contexts, what is needed is not only a mechanism for delivering signed provenance, but also a way to associate signed agreements with data which can be exchanged and used to govern the ingestion of data by machine learning systems.

Jim Fournier

Founder & CEO JLINC Labs

  

About JLINC

JLINC is automated data sharing technology for delivering both signed provenance and data governance under human readable contracts. It uses the same core technology components as web3, cryptographic keys, hashes and ledgers, but in a totally new way to exchange human readable contracts between agents representing any type of entity, from a natural person to an organization or to an AI.

The JLINC technology is validated by an exceedingly broad patent issued in January, which covers the method required to provide not only auditable signed provenance, but cryptographically signed contractual agreements, that can be read by humans, as well as machines, governing the control and use of data after it is exchanged.

JLINC Blog- JLINC Patent Enables 2-Way Permissioned Internet Data Exchange