Some Requirements for a Universal Asset Graph :: Theory of Predictable Software

The problem

We are blind.

Take a piece of software and try to learn its true and complete provenance. It doesn't exist, it never has. So far the profession of software and the society which depends on it have gotten by on goodwill and a fair amount of good luck.

But as more and more software relies on more and more upstream components, the emphasis of malicious actors will inevitably turn more and more to attacks on the supply chain.

The volume, velocity and variety of vectors continues to grow. More dependencies are taken, software is developed more rapidly, systems can be attacked at multiple levels to inject malicious code. This problem is already bad; as more and more things become programmable, it will grow to truly catastrophic proportions.

Questions that grasp at the underlying issues are plentiful: How do we know what systems are affected by an attack? How do we know to trust a given software asset? How do we update the state of our knowledge as new information comes in? How do we focus further investigations onto the narrowest possible scope of at-risk systems?

We cannot do these things without a system of accounting. In the interests of capturing imagination, I call this system of accounting the Universal Asset Graph.

Requirements

Having discussed the problem, I want to sketch features and characteristics that I think are essential to any meaningful solution. I am sketching these with a sweeping, even breathtaking, disregard for most existing prior art, which has grown rapidly in recent times. My intention here is partly to sort out my own ill-formed thoughts, and partly to motivate more efforts working towards a solution.

It is a database

Put another way: there needs to be some system where querying is the primary function of the system. Without such primacy, each agent needing to understand the Graph will need to assemble various sources of evidence independently, at its own expense. These efforts will be disjointed, incompatible and relatively incomplete.

A more concrete way of putting the requirement is that file formats are not sufficient. We need an actual, real system which separates storage representation from query expression.

I do not decide here whether the Universal Asset Graph needs to be stored in a graph database or a triple store. I myself am partial to relational models for expressing storage: they are far more widely understood and supported than graph and triple store alternatives.

Rooted in some formal conceptual model

We must, first of all, agree on what is. What is software? What is an asset, and what is a process? How are these related? What relationships are exclusive, which inclusive? Which are singular and which plural? What rules can be used to induce, deduce and abduce conclusions?

Ontology is the general field which addresses the problem of describing logically and formally what we are talking about¹.

As engineers many of us first think of APIs, with the underlying data model a distant second. Ontologies are a more distant country again. Unfortunately this means we wind up forever lashing together shambling monstrosities out of duct tape and livid curses. Under normal circumstances, inconsistencies and impedance mismatches between systems are a costly inconvenience. But security is not a business of normal circumstances: it is the business of optimising agents who will zero in sharply on any point at which "meh, good enough" was the prevailing culture.

Consider, for example, the question of containment, of part-whole relations, a subfield called mereology². For something like a filesystem, it is easy to talk about a given file as being, or not being, part of the filesystem. It is present or it isn't. It's contained or it's not. But then we might ask another question: are two files the same file? At one level, yes, they have identical bytes. At another, they are not, as they are at different locations. And if the bytes are not identical, but nearly identical, how do we capture that? Do ranges of bytes have identity and if so, what is it? Containment as the file in a filesystem is like an apple in a bag. Ranges of bytes are more like water in a bottle. Both things are containment, but they are different.

Perhaps it sounds like I am asking for too much, that our field is too complex for such an endeavour. I am not convinced by this argument. Biology is a field in which the simplified models look like this³, but for which multiple large ontologies exist. Software componentry isn't a simple field, by any means, but it is nowhere near as complex as molecular genetics.

The form of this formal model might not be a strict ontology. It's more likely to be a semi-formal model, expressed mostly in structured natural language. But we should strive for the most complete and reliable structure we can, because once the boulder begins to roll downhill it will be hard to stop.

Records both asset data and process data

Any acceptable system of accounting needs asset data, to know what is contained in or connected to what, and in what fashion. We also need process data, to know how, where, by what and by whom our assets are called into being.

Asset data must enable us to answer questions such as:

Does dependency X contain file Y?
Is vulnerability 123 present in any running process we control?
What dependencies contain file X?
What files contain range-of-bytes Y?

Process data must enable us to answer questions such as:

Who⁴ deployed version X?
When was vulnerability Y introduced into our supply chain? By what event?
How long does it take to patch 100% of our systems?
How long does it take for an upstream patch to be picked up by downstream software? If I have a transitive dependency, how long before my direct dependency is safe?
When was vulnerability Y first reported? Has it changed? What are the changes? Did a file that was previously OK become not-OK (or vice versa)?

Asset data and process data will be needed together to let us answer questions such as:

What source code change for dependency X introduced vulnerability Y?
What files derive from source code that I no longer trust?
How did a given range-of-bytes wind up in my systems?

What kinds of asset data should be recorded? I believe that at least:

Assets are ultimately formed out of ranges of bytes, and that these ranges of bytes are identifiable via content addressing.
Ranges of bytes can be rolled up into larger and larger assemblies: files, directories, archives, filesystems etc.
Each kind of an asset remains content addressable, because ultimately it can be traced back to a range or ranges of bytes, or constructed from a Merkle tree of assemblies and ranges of bytes.
Each kind of asset can have multiple forms of address, or identity, or versioning. But these are not the canonical address, identity or version. The content address is the universal identifier, to which other names and handles are related by assertions.

As for process data, I think we can think in terms of 4 basic operations on assets⁵: Movement, Transformation, Assembly/Disassembly and Inspection.

Assets can be moved. A copy is taken from one inventory of assets and delivered to another inventory of assets (eg, via download, on a USB stick, as a torrent). This can be viewed as a shipment or delivery from an original inventory with infinite units in stock. Movement is necessary to determine how a given range of bytes came to be in a particular location (in memory, on disk, etc).
Assets can be transformed while retaining some constant identities across the transformation. Transformation takes some range or ranges of bytes A and outputs a different range or ranges B. But A and B retain some shared identity or relationship across the transformation. Transformation is necessary to be able to relate an asset to its earlier forms (eg, to go from a binary to its sourcecode via a compilation).
Assets can be assembled or disassembled. Without irreversibly transforming any range of bytes, they can be aggregated into a single unit, or disaggregated from a single unit into the original ranges of bytes. For example, forming a compressed archive file.

Importantly, this is not the same as transformation. Each creates a mereological or containment relationship, but they are different relationships. Transformation is the relationship of source materials which participate in a process to create a new material; assembly is a much more prosaic relationship of physical containment, of one thing being inside another thing.
Assets can be inspected. Some agent can examine an asset and render some opinion on them; perhaps multiple opinions (eg. a test suite can render many opinions on a single file). The inspection could render nominal opinions (eg. X passed / failed / errored / was skipped; Y contains / doesn't contain CVE-2020-2942), ordinal opinions (X is faster than Y), interval opinions (X passed 198 out of 200 tests) or ratio opinions (X is 20% faster than Y). Inspections will mostly be by tools, but some will be made by humans.

Figure 1 shows how these operations can be organised into a simple taxonomy.

A taxonomy is represented as a top-down tree. At the top is 'Operation', forking into 'Changes containment' and 'Does not change containment'. Under 'Changes containment' it forks into 'Retains same range(s) of bytes', which then leads to 'Movement'. Under 'Creates new range(s) of bytes' it leads to 'Assembly/Disassembly'. Then from 'Does not change containment' it forks into 'Creates new identities', which leads to 'Inspection', and 'Retains identity across operation', which leads to 'Transformation'. — Fig 1. The taxonomy of operations.

Federatable

A truly universal asset graph needs some common pool to which all public data will drain. But while all will want to access and join queries over public data, not all data will be public. Each participant in the scheme is likely to have private data (such as internal product builds) that they do not wish to share with others.

This means that the Universal Asset Graph must allow federation. Each participant must be able to manage some subset of the Graph, with that subset determined by local needs. Ideally there would be at least one participant dedicated to creating the shared public graph to which others are federated.

A public good

That last point requires expansion.

In economics, goods can have the properties of rivalry and excludability⁶. A good is rivalrous if my use of the good diminishes your use of the good. For example, if I am using a hammer, you do not get to use the hammer. If I am transmitting at 100.1Mhz, then you cannot transmit at 100.1Mhz without interference. A good is excludable if consumption can be innately limited by a seller as part of a trade for the good. For example, if I own the tool and wish to sell it to you, I can ensure that you pay me by withholding the tool until payment has been made. But for a non-excludable good I cannot enforce that exchange. Similarly, if I transmit a plain FM signal on 100.1Mhz, I cannot prevent anyone from tuning into the signal⁷.

Public goods are those goods which are non-rivalrous and non-excludable. That is, the use of the good by any person does not diminish another's value, and there is no easy mechanism to exclude anyone from enjoying the good. Public goods are quite common in practice. Some examples include fire fighters, policing, public fireworks displays and so on.

Public goods have a critical problem, however. Without excludability, it is possible for anyone to "free ride" on the good -- to receive value without paying anything for it. For providers of goods, the free rider problem makes it rational to provide less of the good than would otherwise be provided. Perhaps none at all. This is called underprovisioning.

Take public fireworks as a classic example. Suppose you live in a small town which has decided to hold a public fireworks display on a popular holiday. If everyone contributes $10 towards the display, then a large and impressive display can be seen and enjoyed by all at a low cost per contributor. But the fireworks are non-rivalrous, since each person can enjoy them from different vantage points; they are non-excludable because they are visible throughout the township. So for any particular individual there is the temptation to be a free rider. You're left $10 better off with a minor, perhaps undetectable, degradation in the value of the good you experienced. But if everyone comes to the same conclusion, there will be no fireworks for anyone.

The public part of the Universal Asset Graph is a classic public good. It is non-rivalrous: my query over the data does not greatly diminish your query's value (up to limits on computing resources⁸). It is non-excludable: the data is public and the provider cannot charge for access⁹ in a way that prevents others from accessing the data. Therefore there is no agent for whom it makes strict rational sense to provide this service, since for all other agents the rational decision is to free ride.

There are two common solutions to the underprovisioning problem. The first is through direct command-and-control: some hierarchy which can tell agents what to do forces them to act in the best joint interest of the group. The second resolution is subsidy: some agent or agents pay for the good and accept that they will not be able to capture all the value generated by their subsidy, allowing free riders.

Since a Universal Asset Graph cannot be constructed through command-and-control, it must rely on subsidy. This means that the expense of assembling information, storing information and serving queries over the information must be absorbed by one or more well-resourced agents. Such a structure would probably need to be a consortium with mechanisms of mutual enforcement of subsidy agreements.

Can represent imprecise and incomplete information

Databases can be divided into two rough classes: OLTP (online transaction processing) and OLAP (online analytical processing).

OLTP systems are the systems of work and typically the sources of truth. By their nature OLTP systems must be consistent after each change to the data. The happy side effect is that data is consistent. The unhappy side effect is that data which doesn't fit the model gets rejected or degraded.

OLAP systems are dedicated to answering questions. Their content, organisation and origins are typically dictated by what can be extracted from OLTP systems, but in general any source of data will do and any kind of massive data tooling (MPP databases, Map-Reduce systems, you name it) can be used. The key differentiation is that data is typically accepted as it comes and converted to a useful form either when being loaded or at query time. The full cost of consistency is not imposed at the time the original data is stored.

There's no realistic world in which a Universal Asset Graph is maintained on an OLTP basis. That would imply that all software supply chain activity would be mediated through and gated by a single consistent data store. Such an enterprise would be untractably vast in scope, compliance would be an onerous expense on all concerned, and no agent can ensure compliance in any case. It won't happen that way.

That leaves us with assembling data after the fact, from heterogenous upstream sources. Whether from build systems, source control systems, issue trackers, download sites, package managers and so on, a Universal Asset Graph will always be incomplete and at a lower fidelity than reality. It's an OLAP system, writ large.

The key is that such a system is progressively improveable, but must from the beginning acknowledge and record its own imperfections. This additional effort to record known limits, an epistemic confessional, allows us to reconstruct acceptable answers to our queries at later times.

The alternative is to pretend that we have the easier world of OLTP. We could reject anything that's too hard to fit, reduce fidelity, adopt coarser grains. But that will leave us with undetected divergences from reality, which attackers will be happy to exploit. The effort to attain perfection would guarantee imperfection.

Ambiguity

Data is ambiguous when we cannot, from the data, identify the specific entity or relation under discussion. This is common to software, where floating strings of text (versions, tags, releases) are used to identify particular assets (archives, files, packages).

For example:

How should we account for all binaries cut from a git tag with the value "2.3.4"? Are the Linux, Windows and MacOS binaries all the "same" version? Or are they three different versions which share a tag in common?
How should we account for moving tags or releases? If a binary is referred to by a URL, but the binary retrievable from that URL changes, is it the same version? Two versions with the same string?
A software project uses a complex CI/CD pipeline to convert source into shipped binaries, but we do not yet have a way to consume the information coming from that system. We cannot say where in the process an asset was transformed with precision, we can only say it occurred somewhere between the source code and the asset.

Ambiguity needs mostly to be dealt with on a case-by-case basis in the data model. For example, the concept of a version cannot be one-to-one with assets without eventually rejecting truthful statements about versioning. We need to be able to express facts like "this file has these two versions" and "these archive files are distinct, despite having the same name".

Uncertainty

Uncertainty represents that we have a stated fact but lack evidence as to the truthfulness of the fact. For example, I don't know if it's raining in Darwin, Australia. I could say "it is probably raining" or "probably not raining". Either way, I am expressing a fact with some degree of uncertainty. Until I gather more observational data, the fact retains uncertain.

Some facts are almost certainly true. These can be stored directly in a normal fashion. For example, it is highly likely that computing the cryptographic digest of a file will not produce the same result as the digest of a different file. Rather than storing information about the likelihood of collision, we treat collision as impossible.

Some facts are less certain. For these the Graph needs to represent the uncertainty in some form; the most likely will be statements of probability.

Vagueness

Vagueness represents that we have a fact which doesn't allow resolution into crisp values. For example, "it is raining heavily in Darwin, Australia" is a matter of degree, ranging from a tropical storm through a tropical depression up to a tropical cyclone dumping a hundred millimetres of water per hour. Another classic example, from the "Sorites paradox", is to decide when a collection of sand grains constitutes a pile of sand. When one grain is removed, it is still a pile of sand. When a second is removed, it's still a pile of sand. At what point, then, does the pile of sand cease to be a pile of sand?

One software example of vaguness is vulnerability severity. Two assets have the same CVSSv3 score, so which problem is "worse"? The nature of worseness of vulnerabilities is itself fairly vague. We impose crisp boundaries, but these do not truly encode the vagueness of the underlying problem.

Two ways to encode vagueness are fuzzy logic (where entities belong to sets with a degree of membership) or belief functions (where facts are stated to be true or false to a certain level of belief). Both of these have well worked out mathematics of how they can be represented, combined and calculated. Where we encounter vagueness, it should be explicitly represented as fuzzy sets or belief functions.

Subjectivity

Information about assets and processes has to be reported by some agents, and queries over the information will be made by other agents. The querying agents need to be able to assign different levels of trust to reporting agents. This trust may need to be transitive, so that assertions by agent X relied on by assertions from agent Y can be weighed according to the trustworthiness of both X and Y.

This is subjectivity: that the value of data is a function of who reported it and who's querying it. Put another way, value depends on the provenance of the information, as distinct from the provenance of assets the information describes.

For example, I may prefer one container scanning vendor, FooScanner, to others. When a FooScanner inspection is made in my private graph, I prefer that to private or public inspections made by other tools and agents. But when no FooScanner record is available, I will still want to fall back on other available inspection records. By assigning subjective weights according to subjects, I'm able to do this automatically.

Stores data in bitemporal form

Data in the Graph will frequently be incomplete or inaccurate, but incompletion and inaccuracy will vary over time. Last week we possibly believed that a vulnerability has affected foolib since 2.7.3. This week we discover that it has actually been present since 2.6.9. We want to record both the update to our current knowledge without destroying the record of our previous state of knowledge.

Bitemporalism is the design concept¹⁰ that makes this kind of flexibility possible. A bitemporal database stores time on two axes¹¹. The first is "valid time", the span of time during which a fact is held to be true (eg. "true from 4th March until 6th March", "true from version 2.6.9 until version 2.7.11"). The second is "transaction time", which represents changes in belief about valid time (eg. "we believed this valid time range until 11th March, then changed it on 12th March").

Critically, bitemporal data lets us make strong assertions about when a fact was not known. This isn't true of standard updated-at fields, which can only imply when a fact ended and can only signal when a fact changed, without saying from what. Traditional audit table designs intermingle valid time and transaction time in a way that makes certain queries difficult or impossible ("did we ever believe that foolib was vulnerable, and if so, during what periods?").

Storing data bitemporally is important for reconstructing events, but also for ensuring efficient updates to security investigations. When the scope of a vulnerability is widened or narrowed, that information needs to be conveyed to interested parties efficiently and quickly. Bitemporal queries and modern database technologies make this potentially highly performant vs designs based on scraping durable logs to derive state.

Questions and answers

Do you mean a Software Bill of Materials? No. SBoMs are fine as far as they go, but they are necessarily a best-effort snapshot taken at one point in time. SBoMs will be very useful feedstock / originating documentation for an asset graph, however, and it should be possible to project SBoMs out from the Graph upon demand.
Do you mean In-toto? I see In-toto in the same role as SBoMs. They would provide critical feedstock where available, and it would ideally be possible to project In-toto layout and link documents from a Universal Asset Graph. However, they would not be the only acceptable form of source evidence and they would not be the format of storage.
Do you mean a transparency log? Not entirely. I see transparency logs -- Merkle trees, essentially -- as necessary for establishing an ordering of events. But I don't think they should be the primary storage system. Instead such logs should authenticate that every record inserted into the trusted primary store was known about at insertion time, that no records have been deleted, and that no records have been inserted out of sequence.
Where does code signing fit into the picture? Signatures are attached to the assets they verify. Their existence and validity can form part of the Graph.

¹ A good introduction is C. Maria Kent's Ontology Engineering, available online.. The Wikipedia article on ontology is useful with many pointers to further reading.

² The Stanford Encyclopedia of Philosophy entry for mereology is a little dense, but certainly gives a broad overview of how surprisingly complex part-whole relations can be.

³ From the Roche Biochemical Pathway Maps.

⁴ The idea of "who" is ambiguous and controversial. Some people only participate via pseudonyms for reasons of physical or psychological safety. Any "real name" policy would exclude them unfairly, so any concept of identity will need to allow for pseudonymity.

⁵ There is some overlap here with terminology used by the In-toto project. In-toto describes "artifact rules" which can describe (but do not distinguish between) transformation and assembly/disassembly. It also describes an "inspections" concept, but only allows a single nominal inspection scale with two values: pass/fail. So far as I can tell, it does not consider movements as a standalone concept.

⁶ Try not to think of these as binary classifications. They are like a continuous dimension, varying in degree.

⁷ If I am granted a legal monopoly on transmitting at that frequency, I can create excludability for units of time and sell those. Namely: advertising slots.

⁸ There's a case to be made that a public graph would be rivalrous, since computing resources are constrained and overuse by many agents degrades performance and perceived value. To the degree that this is true -- remember, they are not binary classifications, but fuzzy sets -- it transforms the Graph from being a public good into a "commons" good. A classic example of a commons good is a shared fishery. It's not excludable, but consumption by any one agent limits consumption by other agents. Commons can be governed by systems of mutual agreement and voluntary governance; Elinor Ostrom won the Nobel Memorial Prize in Economics for investigating how a variety of commons are governed. For our purposes the distinction is not critical, as the same core requirement will persist: some agent or agents will have to pay the bills and that agent or those agents will not capture all the value generated by doing so.

⁸ Note that excludability is key to the business model of some vendors. They curate private databases to which non-subscribers have no access. They are still non-rivalrous. Together this makes them "club goods", which are a distinct category from public goods.

⁹ Martin Fowler gives a passable brief introduction to bitemporalism, but the best resource is still Richard Snodgrass' Developing Time-Oriented Database Applications in SQL, which is freely available online. A more recent book is Tom Johnston & Randall Weiss' Managing Time in Relational Databases.

¹⁰ There's a further extension possible. Many dimensions of interest are ordered dimensions, with before-after relations for every pair of points. An example are semver numbers, on which there is a total order from 0.0.0 to ∞.∞.∞. Such ordered dimensions can be treated as behaving like valid time, where truth about a statement (eg. the presence of a vulnerability) can be expressed as a range (from-version, to-version).