Getting Claude to complete the spec
Mark Elvers
4 min read

Categories

  • ocaml

Tags

  • tunbury.org

With Claude Code, perhaps we are now at the point where the test suite is actually more valuable than the code itself.

I’ve been experimenting with Claude quite successfully and have evolved a working IMAP and SMTP server implementation in OCaml. I say evolved there as Claude generated the code from the RFCs, in a single pass, but what followed was an extensive period of debugging. I added the account to Apple Mail, and it didn’t work at all! Claude dutifully debugged the code, with much back-and-forth, until we had a working version. Or, at least, a version that worked with Apple Mail. What about Thunderbird? I didn’t try. My point, though, is that the more agentic coding we do, the more testing we inevitably need.

In the case of IMAP, I could have asked Claude to use a third-party, command-line IMAP client, and then the testing and debugging could have been automated. What about in cases where there is no client?

I decided to reimplement the IMAP daemon, breaking it down from a single prompt into an actual software project. Claude did the legwork, reviewing the RFCs and creating an architecture of the libraries/modules needed.

 ┌─────────────────────────────────────────────────────────────────┐
 │                        imap-server                              │
 │  (Connection handling, state machine, command dispatch)         │
 └─────────────────────────────────────────────────────────────────┘
          │           │            │            │           │
          ▼           ▼            ▼            ▼           ▼
 ┌─────────────┐ ┌─────────┐ ┌──────────┐ ┌─────────┐ ┌──────────┐
 │  imap-auth  │ │ mailbox │ │  search  │ │condstore│ │ tls-layer│
 │  (SASL,     │ │ (UID,   │ │ (SEARCH, │ │(QRESYNC,│ │ (ocaml-  │
 │   LOGIN)    │ │  flags) │ │  SORT)   │ │ modseq) │ │   tls)   │
 └─────────────┘ └─────────┘ └──────────┘ └─────────┘ └──────────┘
          │           │            │
          ▼           ▼            ▼
 ┌─────────────┐ ┌─────────────┐ ┌─────────────┐
 │   maildir   │ │ mime-parser │ │ imap-parser │
 │  (storage)  │ │ (RFC 5322)  │ │  (ABNF)     │
 └─────────────┘ └─────────────┘ └─────────────┘
                       │               │
                       ▼               ▼
               ┌─────────────────────────────┐
               │         imap-types          │
               │  (Shared type definitions)  │
               └─────────────────────────────┘

I asked for a design document per module with the intention that these specifications could be completed in parallel by N Claude instances. Each Claude should write an extensive test suite for their code. I opted for a serial approach.

Starting with imap-types, I noticed that a message UID was defined as an int32, which I knew was wrong because it should be an unsigned int32. Anyway, an easy fix.

IMAP messages start with a tag so that responses can be aligned with messages when multiple commands are sent without waiting for a response. Apple Mail uses a tag format of 1.1, which was the first thing which needed to be fixed in the original server implementation.

The imap-parser passed all the tests. I asked for a specific test to be added, which covered a tag with a dot. It failed. The new parser wasn’t any better than the original one.

Prompts to Claude can be submitted on the command line. e.g. echo "hello" | claude --print.

How about creating two Claude instances, with one implementing a server and the other a client? Each instance could declare what features it supports, and an orchestration script could run the tests. Or, taking this further, have a third Claude instance act as the moderator and generate the test suite based on the features implemented by the server and the client.

This worked reasonably well.

  • 86 tests generated by moderator
  • 85 passed, 1 failed

The tests covered:

  • Basic protocol (greeting, capability, noop, logout)
  • Authentication (valid/invalid logins, multiple users, quoted passwords)
  • Mailbox operations (select, examine, create, delete, rename)
  • LIST, LSUB, SUBSCRIBE, STATUS
  • FETCH (flags, uid, body, envelope, headers, bodystructure, etc.)
  • STORE (add/remove/replace flags, silent mode)
  • COPY, MOVE, EXPUNGE, CLOSE, UNSELECT
  • SEARCH (all, unseen, flagged, subject)
  • UID variants (fetch, store, search, copy)
  • APPEND (simple, with flags, literal+)
  • Extensions: IDLE, ENABLE, NAMESPACE
  • Adversarial tests: malformed commands, missing tags, garbage input, null bytes, rapid commands
  • Concurrency: multiple simultaneous connections

And this found a real bug where the server did not reject a fetch sequence starting at zero, which is not allowed in the RFC.

Interestingly, neither the client nor server supported STARTTLS. Both had a copy of RFC 8314 but chose not to implement the feature. I put this down to a poor choice of wording in the prompt. I’d said “production-ready”, which to me implies TLS, but “feature-complete” removes the wiggle room. I specifically didn’t want to say “implement TLS” as this is specific to IMAP and wouldn’t apply in other projects.

The next generation of the script provided the three Claude instances with a copy of the RFCs. Client Claude, Server Claude and Moderator Claude were tasked with implementing the client, server and testing and moderation entirely from the RFCs. The script ran iteratively, with more testing being added at each pass, and client and server fixes.

Did I get TLS? No. The dune file called for it, but the library wasn’t opened in the code. The test checked for STARTTLS, and the server replied with “OK Begin TLS negotiation now” but that’s as far as it got in five cycles.