SWHID and pURL

This is the fourth article in a series about SWHID, an open standard for identifying software artifacts. The previous three articles covered the following topics:

Article 1: introduces SWHID and explains why precise identification of software is becoming essential in the context of software supply chains and emerging regulations.
Article 2: describes the syntax of SWHID and shows how it enables reliable comparison between software artifacts.
Article 3: explains how SWHID is governed as an open standard and introduces swhid-rs, its reference implementation.

This article focuses on a question I receive often: what is the difference between SWHID and pURL? Many people believe they serve the same purpose. Some think they are competing identifiers. I will explain some points that they have in common and where they differ, from the SWHID perspective.

Disclaimer: I do not intend to provide a feature by feature comparison between SWHID and pURL. Feel free to contact me or write directly in the comments section of this post additional aspects that I have omitted or that you might consider unfair.

Extrinsic identifiers

The previous articles introduced SWHID as an intrinsic identifier. To understand pURL, we need to look at the other side: extrinsic identifiers.

An extrinsic identifier is a unique label assigned to an entity by an external authority or system. It is not derived from the natural properties of the object itself. It is an artificial tag used to track, manage, or categorize an entity within a specific context or database.

Extrinsic identifiers are everywhere. Some well-known examples in the software industry are IP addresses, IBANs, DOIs, pURLs, SWIDs (Software Identifier, do not mistake it with Software Hash ID), and CVEs.

What is pURL?

pURL stands for Package URL. It is an extrinsic identifier for software packages created late in 2017 by Philippe Ombredanne, CTO of nexB Inc. and the mastermind behind ScanCode Toolkit, among other relevant contributions.

pURL was designed to address several challenges

“Ambiguity in Package Identification: With diverse naming conventions across ecosystems, identifying software packages reliably has historically been a challenge. PURL eliminates this ambiguity by creating a universal identifier with a predictable structure.
Cross-Ecosystem Interoperability: Developers, organizations, and tools often work across multiple ecosystems, each with its own package management systems. PURL harmonizes these differences, enabling seamless interoperability.
Enhanced Traceability and Risk Management: In an era where supply chain security is critical, PURL provides the foundation for identifying and tracing packages to their origins, dependencies, and potential vulnerabilities.
Tooling and Automation”

A pURL is assigned to a package by the community or ecosystem that hosts it, such as npm, PyPI, or Maven. The identity of the package depends on the registry, not on the content of the code. If the same code is published in two different registries, it will have two different pURLs. If a package is removed from a registry and re-published, the original pURL may no longer be valid. This is the nature of extrinsic identifiers. Their validity depends on an external authority.

Where is pURL used?

pURL is used to identify software packages in many tools and processes. It is widely adopted in supply chain and compliance workflows.

Its most important use case is in Software Bills of Materials, or SBOMs. Both major SBOM formats support pURL: SPDX and CycloneDX, just like SWHID. SBOMs use pURL to record which version of a package is present in a system. Open Source detection and license compliance tools use pURL to match packages, for instance.

Regulations such as the Cyber Resilience Act are increasing the demand for this kind of traceability. The OpenChain SBOM Telco Guide, mentioned in the first article of this series, is a concrete example of how pURL is used in the telecommunications industry.

pURL as an open standard

pURL is an ECMA standard, published as ECMA-427. Its specification is publicly available, like in the case of SWHID.

pURL is also part of the Open Source Security Foundation, or OpenSSF. OpenSSF is a cross-industry initiative under the Linux Foundation. Its goal is to improve the security of open source software. The presence of pURL in OpenSSF reflects how central it has become for software supply chain security tooling.

The adoption of pURL is growing. This growth is driven by the increasing use of SBOMs and by compliance requirements from new regulations, among other factors.

pURL syntax

According to the Scope section of the specification…

“A PURL is a valid URL and URI composed of seven components to identify a software package. The PURL type component defines the ecosystem-specific structure and meaning for the other PURL components.” A pURL has the following structure:

pkg:type/namespace/name@version?qualifiers#subpath

It is made up of several components:

scheme: always pkg. It indicates that this is a Package URL.
type: the package ecosystem. Examples are npm, pypi, maven, and github.
namespace: optional. A type-specific prefix, such as an organisation or group name.
name: the name of the package within the ecosystem.
version: optional. The specific version of the package.
qualifiers: optional. Key-value pairs that provide additional context.
subpath: optional. A path to a specific component inside the package.

The type and name fields are always required. All other fields are optional. Components are designed such that they form a hierarchy from the most significant on the left to the least significant components on the right. A PURL shall not contain a URL Authority, i.e. there is no support for username, password, host and port components.

Here is an example. This pURL identifies version 12.3.1 of the Angular CLI package in the npm registry:

pkg:npm/%40angular/cli@12.3.1

Comparing SWHID and pURL

SWHID.org and pURL share some characteristics. Both are software identifiers designed to improve traceability. Both are supported by SPDX and CycloneDX and used in SBOM workflows, although pURL is more popular. Both are open standards with freely available specifications.

The differences, however, are fundamental:

The most important one is the type of identifier: SWHID is intrinsic, so anyone can create, parse and verify them, while pURL is extrinsic, assigned by a package registry.
SWHID is derived from the content of the code itself, therefor:
- SWHID works for both open source and proprietary software. pURL is designed for package ecosystems, which are mostly open source.
- SWHID also offers much finer granularity. It can identify a snippet, a file, a directory, a commit, or a full snapshot. pURL identifies a package as a whole or a specific part of it, through “subpath”.
- SWHID provides single-bit tamper detection: any single-bit change—whether in a source code file or a compiled binary—will alter the SWHID. pURL is not designed for integrity.
SWHID is verifiable. You can recompute it from the artifact and confirm that it matches. pURL cannot be verified this way.
SWHID is resilient to link rot, when combined with the Software Heritage archive. pURL depends on the continued availability of the registry and the stability of the package URLs.
SWHID has a huge success story in Software Heritage archive. More people uses pURL and applied to more use cases.

There are other differences but the above are the most relevant for most people.

Creating pURL equivalents with SWHID

One thing that many people do not realise is that SWHID can carry similar information to pURL. Using SWHID qualifiers, you can point to the origin repository, the specific snapshot, the path, and the version of a package.

pURL

pkg:pypi/requests@2.31.0

scheme: pkg
type: pypi
name: requests
version: 2.31.0

SWHID

			
swh:1:rel:0106aced5faa299e6ede89d1230bd6784f2c3660;origin=https://github.com/psf/requests;visit=swh:1:snp:005b3bb9f7915e8a1f5525c350665ba41035b993

			
swh:1:dir:7be4f5113896a87e2d1ed58a00c237881ae79520;origin=https://github.com/psf/requests;visit=swh:1:snp:005b3bb9f7915e8a1f5525c350665ba41035b993;anchor=swh:1:rel:0106aced5faa299e6ede89d1230bd6784f2c3660

In this example I have added as SWHID qualifier link to the SwH archive in the form of SWHID. This does not mean that SWHID replaces pURL in practice. It means that the overlap between the two might be larger than most people think.

Conclusion

pURL and SWHID are fundamentally different types of identifiers, although they share some characteristics.

pURL is widely adopted and, in my view, it will not be replaced by SWHID, at least not in the near future. SWHID will complement pURL for those who need intrinsic, verifiable identification.

SWHID is content-based. This makes it license-agnostic so suitable for both open source and proprietary software. It is verifiable, as explained in the second article of this series. Combined with the Software Heritage archive, it is resilient to link rot for open source. It offers a wider range of granularity. It also prevents the ambiguity that comes from relying on package names alone, which causes problems in security and compliance contexts.

My recommendations are simple. Use SWHID if you need to identify both open source and proprietary software. If you already use pURL for open source, use SWHID to complement it.

The main goal is to use identifiers to manage software activities at scale. It is not about choosing one over the other. I even think that more standardised identifiers will likely appear to cover additional use cases.