AI And The Ship of Theseus - Felix Xu's Thoughts and Writings

Because code gets cheaper and cheaper to write, this includes re-implementations. I mentioned recently that I had an AI port one of my libraries to another language and it ended up choosing a different design for that implementation. In many ways, the functionality was the same, but the path it took to get there was different. The way that port worked was by going via the test suite.

Something related, but different, happened with chardet. The current maintainer reimplemented it from scratch by only pointing it to the API and the test suite. The motivation: enabling relicensing from LGPL to MIT.

The Ship of Theseus Paradox

This raises interesting questions about software identity. If you rewrite a piece of software piece by piece, at what point does it become a different program? This is the Ship of Theseus paradox applied to code.

The Ship of Theseus is a thought experiment that goes back to ancient Greece. If you replace every plank in a ship one by one, is it still the same ship? At what point does it become a different vessel? And if you take all the old planks and build another ship, which one is the "real" ship?

When an AI rewrites your code, even if it preserves the same functionality, is it still your code? The implementation details, the particular choices made along the way, the subtle bugs and their fixes — all of these contribute to the unique fingerprint of a codebase.

// Original implementation
function detectEncoding(buffer) {
    const detector = new CharsetDetector();
    return detector.detect(buffer);
}

// AI-generated port with different design
const detectEncoding = async (buffer) => {
    const result = await analyzeBuffer(buffer);
    return result.encoding;
};

The Role of Tests in Code Identity

What's fascinating about the AI-driven porting process is that it uses tests as the specification. The tests become the identity of the code — they define what the code is, not how it does it. This is a profound shift in how we think about software.

When I write code, I'm making thousands of small decisions. Should this be a class or a function? Should this error be handled here or passed up the stack? Should this data structure be an array or a map? Each of these decisions is informed by my experience, my mental model of the problem, and my aesthetic preferences about code.

An AI making these decisions based on a test suite is free from all of that baggage. It might make different choices, some of which might be better. But are these choices still "mine"? Does the code still reflect my intentions as a programmer?

Implications for Open Source

This has profound implications for open source licensing. If an AI can be pointed at a codebase and generate a clean-room implementation that passes the same tests, what happens to the original author's rights and attribution?

"The question is not whether AI can reproduce your code, but whether we consider the output to be a derivative work."

Current open source licenses assume that code is copied, modified, or extended. They don't account for the possibility that code could be "read" and then "reimplemented" by an AI. The legal framework we have wasn't designed for this scenario.

Consider the implications:

Copyleft licenses like GPL require derivative works to be licensed under the same terms. But is AI-generated code based on GPL-licensed code a derivative work?
Permissive licenses like MIT require attribution. But if an AI "learns" from MIT-licensed code and produces similar code, does that attribution requirement transfer?
Trade secrets are protected by not sharing code. But if an AI can reverse-engineer functionality from tests or public documentation, what happens to that protection?

The Future of Code Ownership

As we navigate this new landscape, we'll need to develop new norms and possibly new legal frameworks around code ownership and attribution in the age of AI-generated software.

Perhaps the solution lies in being more explicit about what we consider the "essence" of our code. Is it:

The specific implementation details?
The algorithmic approach?
The API surface?
The test suite?
Or something else entirely?

My suspicion is that we'll need to think about code on multiple levels, with different rules for each. The API surface might be one thing (with its own copyright considerations), while the implementation might be another. Tests might occupy a middle ground — functional specifications that are neither purely API nor purely implementation.

What's clear is that the old ways of thinking about code ownership are being challenged. The Ship of Theseus isn't just a philosophical puzzle anymore — it's a practical question that we, as an industry, need to answer.

And unlike the ancient Greeks, we can't afford to debate it indefinitely. The AI systems are already here, and they're already rewriting our code, one function at a time.

因为编写代码的成本越来越低，这也包括了重新实现。我最近提到过，我让 AI 把我的一个库移植到另一种语言，结果它为这个实现选择了不同的设计。在很多方面，功能是一样的，但实现路径却不同。这次移植的工作方式是通过测试套件来完成的。

一件相关但不同的事情，发生在 chardet 上。当前维护者仅通过 API 和测试套件，就从零开始重新实现了它。其动机是：实现从 LGPL 到 MIT 的重新授权。

忒修斯之船悖论

这引发了关于软件身份的有趣问题。如果你逐个部分地重写一个软件，它在什么时候变成了一个不同的程序？这就是应用于代码的忒修斯之船悖论。

忒修斯之船是一个可以追溯到古希腊的思想实验。如果你逐一替换船上的每一块木板，它还是同一艘船吗？在什么时候它变成了另一艘船？如果你用所有旧的木板建造另一艘船，哪一艘是"真正"的船？

当 AI 重写你的代码时，即使它保持了相同的功能，它仍然是你的代码吗？实现细节、沿途做出的特定选择、微妙的错误及其修复——所有这些共同构成了代码库的独特指纹。

// 原始实现
function detectEncoding(buffer) {
    const detector = new CharsetDetector();
    return detector.detect(buffer);
}

// AI 生成的移植版本，采用不同设计
const detectEncoding = async (buffer) => {
    const result = await analyzeBuffer(buffer);
    return result.encoding;
};

测试在代码身份中的作用

AI 驱动的移植过程令人着迷的是，它使用测试作为规范。测试成为代码的身份——它们定义了代码是什么，而不是它如何做到的。这是我们对软件思维方式的一次深刻转变。

当我编写代码时，我在做成千上万个小决定。这应该是一个类还是一个函数？这个错误应该在这里处理还是传递到上层？这个数据结构应该是数组还是映射？每一个决定都受到我的经验、我对问题的心理模型以及我对代码的美学偏好的影响。

基于测试套件做出这些决定的 AI 摆脱了所有这些包袱。它可能会做出不同的选择，其中一些可能更好。但这些选择仍然是"我的"吗？代码仍然反映了程序员我的意图吗？

对开源的影响

这对开源许可产生了深远的影响。如果 AI 可以指向一个代码库并生成通过相同测试的干净室实现，原作者的权利和归属会发生什么？

"问题不在于 AI 是否能重现你的代码，而在于我们是否认为输出是衍生作品。"

当前的开源许可证假设代码被复制、修改或扩展。它们没有考虑到代码可以被 AI"读取"然后"重新实现"的可能性。我们现有的法律框架不是为这种情况设计的。

考虑这些影响：

版权许可如 GPL 要求衍生作品在相同条款下获得许可。但基于 GPL 许可代码生成的 AI 代码是衍生作品吗？
宽松许可如 MIT 要求署名。但如果 AI 从 MIT 许可代码"学习"并生成类似代码，署名要求是否转移？
商业秘密通过不共享代码来保护。但如果 AI 可以从测试或公开文档逆向工程功能，这种保护会发生什么？

代码所有权的未来

当我们在这个新领域中探索时，我们需要在 AI 生成软件时代围绕代码所有权和归属制定新的规范，可能还需要新的法律框架。

也许解决方案在于更明确地定义我们认为是代码"本质"的东西。是：

具体的实现细节？
算法方法？
API 表面？
测试套件？
还是完全不同的其他东西？

我的猜测是，我们需要在多个层面上思考代码，每个层面都有不同的规则。API 表面可能是一回事（有其自己的版权考虑），而实现可能是另一回事。测试可能占据中间地带——既不是纯 API 也不是纯实现的功能规范。

很明显，旧的代码所有权思维方式正在受到挑战。忒修斯之船不再只是一个哲学难题——它是一个我们作为一个行业需要回答的实际问题。

与古希腊人不同，我们不能无限期地争论这个问题。AI 系统已经在这里，它们已经在重写我们的代码，一次一个函数。