The 2024 Wheel Reinvention Jam just concluded. See the results.
Syntax-agnostic documentation & AST extraction for C/C++

About MetaDocs


TL;DR

Existing documentation systems are either underpowered or awkward to read and write, tying you to a particular syntax. MetaDocs will use clang's -ast-dump to extract documentation written in any syntax along with the actual structures of your code.

Existing Tools

Current code documentation systems suck.

They're either annoying to read and write in the file with a syntax that makes perl look readable, or they're entirely ignorant of their surrounding context, incapable of referencing the code around each chunk of documentation.

Doxygen

Let's look at Doxygen as the prototypical example of C/C++ documentation systems. First of all, if you're happy with it then I'm glad for you, and I don't really want to convince you to be sad.

With that out of the way, let's look at an approximation of the pipeline Doxygen uses and my issues with each stage.

It takes source code with inline source docs, parses the code structure and the docs with either a custom parser or with clang, optionally outputs XML describing the internal format, and generates an HTML website. MetaDocs Pipeline Doxygen.png

Source: Syntax

Doxygen has a custom syntax that requires learning separately and makes inline reading and writing difficult. To become more user-friendly they added support for a custom variant of Markdown. It doesn't have all the extended features of Markdown that I'd like: There's no way to write data definition lists or non-headed tables/multiline cells in a way that's readable inline. Don't tell me that inline HTML is readable. This is not a minor thing, data definition is a large component of documentation!

This does raise an interesting question though: Why can we have 2 different syntaxes but not others?

Parsing

  • Doxygen by default uses a custom parser. This is not ideal in that it will - infrequently, but often enough to be irritating - disagree with your compiler about what types are what.
  • There is an option of using the clang parser with a compilation database, but this is still not ideal. It's slow.
  • There is no incremental doc generation.

XML

  • XML. Need I say more?
  • There seems to be some data excluded from this that would be useful. This means that tools like m.css, which otherwise improve on Doxygen's HTML output, are limited in what they can show.

HTML

  • Doxygen's stock output isn't great. It prioritizes the auto-generated docs over human-authored ones. This seems backwards.

clang

  • I think in order to do things like compare actual parameter lists to documented paramater lists, clang has picked up the ability to parse Doxygen-style coments. Importantly, this means that it will add all doc-like comments that Doxygen would pick up into its AST, attaching it to the appropriate symbol. We can make use of this...

NaturalDocs

NaturalDocs is an interesting experiment in making the source docs readable

It has its own custom syntax with some useful grouping features. It's nice enough to read, but again takes a bit of time to get used to writing & doesn't have all the features you might want (although it does include definition lists!). We already have a lot of general-purpose human-readable markups... Do we really need more that are specialized for code docs? (I'm not entirely excluding the possibility that we actually do, but let's explore that space...)

NaturalDocs has different feature-sets for different languages. For C/C++ it exclusively parses comments and can't reference symbols that aren't explicitly documented. As a result it requires quite a lot of repetition of information already there in the code.

An alternative pipeline

MetaDocs Pipeline New.png

Goals

  • Extract docs without requiring they be written in a specific format
  • Generate an easily-readable format as an intermediate step for turning into documentation
  • Have an example of this being generated for at least 1 syntax

Bonus Goals

  • Be very fast. Orders of magnitude speed improvements change the ways in which you can interact with tools.
  • Minimise duplication between code & documentation - use code where possible as it can't go stale.
  • Incremental doc-gen

Anti-goals

  • Create a new inline documentation syntax

Proposed solution

  • use clang's -ast-dump=json to generate a JSON-format AST with docs attached (this can be incremental, and could possibly be done as part of the normal build process)
  • read this AST, extracting the docs and turn the AST into something more usable: a docs-focused AST ("DST"/"DAST"?). This could be any format, but we'll probably stick with JSON initially. I'm keeping a binary format in mind when making decisions though..
  • convert it into the article document format, possibly using the AST to structure it and generate extra details in that format.
  • publish to HTML using the standard tools for that format

Closing thoughts

There are 3 sources of documentation I think need capturing:

  • standalone articles (tutorials, overviews, etc)
  • notes that are tied to a specific code feature/scope, directly next to the associated code
  • AST structure (autogenerated), particularly type dependencies & function parameters

These should be able to refer to each other, ideally using the same syntax for both inline docs and articles. The inline docs should be fairly readable, and the generated HTML should very readable, with pleasant typography, additional images, diagrams etc.

Whereas Doxygen enforces the inline-docs syntax on articles, I think it'll make more sense to allow article-doc syntax into inline docs.

So I said before that current code documentation systems suck. Will MetaDocs not suck? To be honest, probably not. But it'll be a different flavour of suck, and one that I hopefully prefer!

N.B. I've put about 25 hours into this already, almost all streamed on Twitch & YouTube:

Read more
Filters

Recent Activity

I had a few minutes spare this morning so I rushed out a summary video of how MetaDocs is looking after the jam (apologies for lack of polish, I didn't have time to plan it out fully): https://youtu.be/kKppU0zcXjM

Highlights:

  • reading clang's AST from JSON into C structures, deduplicating repeated nodes
  • converting that into a more convenient structure for navigating symbols and their attached documentation comments
  • starting to autogenerate a markdown-like syntax for symbols and attaching the documentation text
  • evaluating constant expressions from the AST for enum values

Still to finish:

  • representing primitive types in the same way as custom types
  • finalising binary and JSON outputs
  • complete example translating to a final HTML
  • library methods to assist with references between documentation and symbols
  • proper scoping of types