Notes on Building a Live Coding Tool for C99

Posted on September 28, 2015 by Richard Goulter
Tags: , , , ,

I’ve been invited to talk at NUSHackers Friday Hacks on October 9th, about my FYP of building a ‘worksheet’/‘live coding’ environment for C.
This post is to note down anything pertinent / thinking-out-loud as to what to say.

This is a draft. Here is a link to the presentation draft.
(Obviously, the colours need to change.).

What is my FYP?

Let’s have a GIF explain that:

demo

(Here is the same gif, slower speed).

i.e. show the results of the computation directly beside the C program itself, inside the editor. (This saves having to quit the editor, type gcc <whatever>.c, ./a.out, then try and figure out what’s going on).
Another way to describe this as a ‘REPL on steroids’.
You might be aware of other such tools around:

Loosely related is also, e.g. GDB/LLDB. A more sober way to describe my plugin is “auto-generated printf debugging”. – GDB/debuggers provide ‘inspect the value at some point in execution’ and ‘can easily evaluate expressions’ functionality. (GDB is known to be easy-to-use, widely used by the CS1010 kids, etc.).

My project on GitHub: rgoulter/c-worksheet-instrumentor (the main logic), and rgoulter/c-worksheet.vim (a plugin for Vim).
(It’s got a handful of stars on GitHub. You just know I’m amazing at publicity.).

Goals for Presentation (Meta)

Hopefully, (esp. if you play around with it yourself), you’ll think that’s kinda cool.
I’ll know if I did a good job with my talk if you walk out thinking “that’s it, huh? I was expecting something complicated.” / “I could do that, chump.”. (I’ll’ve done an even better job if some of you go out and try and make this kindof stuff).

It’s kinda lame, for me, to say “hey, what I did for my FYP is quite simple”. (“If you can’t explain it simply, you don’t understand” / “it’s only simple if we simplify out the details” aside). The fortunate thing for this is that C has relatively few features. (e.g. No closures / functions within functions). But it’s not a sit-down-and-finish-it-in-an-hour kindof project. It is a lot of fun. (How many kids continue working on their FYP after they’ve presented in their free time?). You’ll also get very good at the language; I got somewhat familiar with the C99 language standard, and certainly stumbled upon the more esoteric aspects of (simple) C99, my favourite being:

int isWhitespace[256] = { [' '] = 1, ['\t'] = 1, ['\n'] = 1 };
isWhitespace['a'];  //> 0
isWhitespace[' '];  //> 1
isWhitespace['\t']; //> 1

“What’s In It For You” / Motivation For Caring About the Presentation

Something like ‘Example of a cool Programming-Language Application project you can make, once you have knowledge of how to make a parser & roughly you could implement it’, esp. if the inspiration is ‘I want that for my favourite language’.

On the other hand, PL applications needn’t be like this, and can be more modest, like, ‘organise imports’ or ‘indent properly’ or so.

Parsers and PL Applications

– It can become clear by the terminology, the way someone talks, how much they know or don’t know about a subject. “Parsers” is one of these areas I don’t know all that much about. (I never took the compilers/parser course.).

But in terms of “writing a worksheetify program”, roughly formula for this is:

  1. Parse the source code.
  2. ???
  3. Profit.

(Which applies to writing any PL application in general.).
There are different ways to do it. (And of course, there are trade-offs for different approaches.).

But a key point I learned is, if you want to show results within functions / nested scopes / control structures, it’s not enough to just pass it off to a REPL. You must be able to parse the source code.

Making Use of the AST

In situations where you’re writing programs whose data is other programs, you’re going to want to make use of techniques a level abstracted from directly traversing the AST.

AST

ANTLR (see below) provides two ways to handle this: Tree Listeners, Tree Visitors.
With Tree Listeners, you implement listener methods which get called as the AST is walked.
With Tree Visitors, you implement methods which then control which parts of the tree are ‘visited’; the intent is to evaluate the AST to some value. (e.g. Numeric expression to a number, like in the simple math expression grammar example).

– The Tree Visitor is kindof more FP-ish, except having to explicitly call ‘visit’ on the children nodes of the AST is also a bit tedious. - This can be abstracted another level, where you have e.g. AST -> Option[T], and some combine :: [T] -> T, then it’s possible to write a general fold_ast :: AST -> (AST -> Option[T]) -> ([T] -> T) -> T function for the AST. (The AST -> Option[T] means “maybe extract value from this AST node; if this node returns Some thing, use that, otherwise keep looking.”.). - Manipulation of AST could be done in a similar way.

Anyway. The dynamic (in ANTLR) is usually that the Tree Listeners can be used to ‘grab what you want’ out of the AST, Tree Visitor is better at ‘evaluating’ some part of the AST to a value.

My Choice of Parser / Technique

For the ‘parser’, I made use of ANTLR parser-generator (like flex/yacc; just give it a BNF-esque grammar, and it generates the lexer/parser for you). Mostly because this can generate a parser in Java, and I was using Scala for the project (as specified by FYP description – Though I think since FP/‘strongly-typed’ languages are kinda nicer to write parsers in, I think the hope was I’d hand-write the parser..)

(See my adapted C.g4).

Making use of gccxml, or an LLVM/Clang pass might’ve also had it’s advantages, ‘but’, then I’d have to write in C++ (for the latter), it makes the executable size for my program much larger (which is fine if your plugin is like YouCompleteMe), and I more/less wanted to keep the approach ‘general’, rather than tied to a specific language.
[Note: a disadvantage of not using gcc/clang is that there may be divergence between what they parse and what I parse; I mitigate this in my project by first compiling the given input C file. i.e. anything I can parse, they can parse (better?)].

In terms of “in my case”:
One approach might be to just pass the code of to some C interpreter anyway. (Cling is a C++ interpreter developed by the folk at CERN, a successor of an older CInt (C Interpreter) project of theirs. ‘Cling’ because it’s built upon LLVM/Clang.). This’d require potentially renaming variables so that everything is unique, and ‘unfolding’ control structures (like loops, etc.). – Quite a messy way of doing things.

Another approach (which I never investigated) would be to generate a sequence of commands to give to GDB/LLDB so as to get the information (i.e. values of variables at some point in execution). (I did investigate using GDB/LLDB insofar as ‘will this let me avoid having to learn about the PL stuff?’; it’s not suffice because you still need to know for a statement e.g. x = y = 4 what information you’d like to get from GDB/LLDB). – This isn’t necessarily a bad approach.

Instrumenting

The approach I did take was to ‘instrument’ (or ‘augment’) the code.
i.e. insert statements into the C program so that I can run the program to get the output.

Roughly like:

int main() {
  int x;
  x = 3;
}

becomes

int main() {
  int x;
  x = 3;
  OUTPUT(x, STRING_OF(3))
}

and as-if compiled with a pre-processor which handles OUTPUT, STRING_OF. OUTPUT outputs in a format my Scala program intercepts. (I just use STDOUT; it could also be done using, say, sockets, or using something like Apache Thrift / or some other IPC/RPC framework – With C, this is a bit annoying, e.g. becomes more difficult to compile on Windows. [My tool, & the Vim plugin, work on Windows, btw]).

The pseudocode for the instrumentor is more/less:

def instrument(source):
  ast <- parse source
  types <- ctypes from declarations in AST

  for each declaration:
    insert(output declaration.toEnglish)

  for each expression statement:
    // n.b. assignments are expressions
    insert(output declaration.toString)

Aside from ANTLR-specific stuff (e.g. ‘inserting’ into a token stream), the key challenge here is getting the ctype from a declaration. Furthermore, you need to be able to get the ctype of an expression, and from the ctype be able to generate ‘toString’ output for it.

Considerations Made When Doing This For C

Specifically for C, my program basically outputs for declarations (like cdecl), and outputs values for expression statements.

Representing C Kinds

A C variable/expression will be of some type: * a primitive type (e.g. int, float, char), * a pointer to some C type, * an array of some C type, * a struct/union (consisting of a sequence of identifiers, each of some C type) type, * an enum type, * a function (consisting of a return value of some C type, and a sequence of parameters each of some C type).

If you can see how the above might be represented as an ‘Algebraic Data Type’, congratulations. (If not, you probably haven’t heard of ADTs.).

e.g. in ML languages like OCaml/Haskell, you might write it something like:

type CType = Primitive of String * String
           | Pointer   of String * CType
           | Array     of String * CType
           | Struct    of String * String * ((String, CType) List)
           | Function  of String * CType * ((String, CType) List)

In Scala, the code (simplified) is something like:

sealed trait CType;
case class PrimitiveType(id: String, type: String) extends CType;
case class PointerType(id: String, to: CType) extends CType;
case class ArrayType(id: String, of: CType) extends CType;
case class StructType(id: String, tag: String, members: List[(String, CType)]) extends CType;
case class FunctionType(id: String, rtnType: CType, params: List[(String, CType)]) extends CType;

(See CType.scala).

In Java … ha ha ha, just kidding. (Just because you can, doesn’t mean you should.).

Making use of this ADT

Again, often in FP programming, once you have the types, the rest is fairly straight-forward. In my case: * I want this ‘CType’ from the AST. (kinda because C99 doesn’t have typeof(x)). * I want to output e.g. string-of some ‘CType’. (because C99 doesn’t have stringof(x)).

Declarations

The devil is in the detail, but (in terms of ANTLR): * You can use a TreeVisitor to evaluate a declaration statement to a list of CType(s). * You can then use a TreeListener to traverse the AST and pick up the declarations.

The main complication here is that the listener has to care about scope.
There’s also things like forward declarations, types declared in header files, & typedefs.

It can be a bit tedious to figure this out, since you have to care about all occurrences of whatever AST node you’re working with. e.g. declaration can appear in the global scope, outside of functions; the identifiers of a struct’s members will look the same as a ‘normal’ declaration, etc.
– What helps here is comparing the ASTs of declarations which are quite similar. int *(*a)[5];, int **b[5];, int (**c)[5] have similar-ish trees, but very different semantics.

The ADTs make it easier

In case the above diagram doesn’t make that obvious, then it’s roughly “collect the type specifiers and any point-to-(type-specifier) info; this is passed to the declarator as the AST is traversed. Something something recursion, and that’s how simple it is.
Again, roughly, since each alternative for the declarator corresponds to some alternative of the above ADT, it does help write the program logic.

CDecl

The actual cdecl (e.g. at cdecl.org, or as a programming exercise in the “Expert C Programming” book) is more/less a heuristic, doesn’t actually take care of all C declarations.
But going from (String ->) AST -> CType, it’s not so hard to go CType -> String to ‘explain’ what a C declaration is.
Figuring it out takes the same kindof logic as figuring out how to do AST -> CType stuff.

Expressions

This is more/less straightforward, following the CType ADT.

In particular, I use StringTemplate as a template rendering engine. (Like Python’s Jinja). So you have a string_of_primitive, string_of_pointer, string_of_array, etc. etc. and pass values to StringTemplate such that it can recursively generate the C code for you.

The ADTs make it easier

Since what’s instrumented is C code, the goal is to come up with C code which constructs a String for some ctype. Again, since it’s recursive, then the code to generate this C code is going to involve recursion. So it’s just a matter of writing code to construct a string for a primitive type, and array of ctype, etc.

See constructs.stg See

Instrumenting: Putting it All Together

I don’t reconstruct the AST directly, but just inject the instrumenting C code into a TokenStreamRewriter (one of ANTLR’s things, for a stream of the lexed tokens).

Cool stuff: Segfault Handling

That’s all ‘nice’, but something I’m quite proud of (& in terms of ‘useful for a C beginner’) is it catches SEGFAULTs, and marks the output there:

  int *px;     //> px is pointer to int
  px = 0;      //> px = (nil)
  *px = 5;     //> SEGFAULT

‘How’ involves a bit of knowledge about C99:

So, to catch a segfault, you instrument each expression-statement (includes assignments & function calls, excludes declaration statements) with the setjmp (like try { }); if a SEGFAULT signal is thrown, catch that in signal, and then use longjmp (like throw ..); in the place you jump to (like catch { }), output SEGFAULT to STDOUT, then exit.
– Of course, the instrumented C code then becomes something no sane human would ever write.

Cool stuff: Type Inference on Arbitrary C Expressions

Being able to infer the type of arbitrary expressions e.g. 4*12 or p[q] enhances the usability of the tool/plugin; the use case becomes “want to show the value of an expression at some point? Simply write that expression.”.

The Type Inference takes a bit of thought, e.g. in p[q], it can either be an array lookup, or a pointer offset.
But, again, while the logic is somewhat tedious, the code to do this inference straightforwardly follows the grammar/AST for expressions; and I’m not sure how you’d do that without an ADT like the one above.

Limitations

As discussed on the repository:

‘Contributors Wanted’

If you find any bugs, I’d be delighted for you to file an issue.
– Keep in mind, though, that the Scala Worksheets tool doesn’t work when the program uses semicolons to delimit statements. - So I think I’ve not done a bad job.

– For selfish reasons, of course I’d love to see plugins for other editors like Emacs, Sublime Text, Atom, LightTable, etc. (and it’s a bit brutal to try and learn the APIs if you don’t use the editors).
The vim plugin is fairly lightweight, since all it needs to do is communicate over a socket to the Scala server.
– Certainly it’d be interesting to see, in terms of UX, how the tool / plugin could be improved.

– Originally it was supposed to be an Eclipse plugin.
I scrapped that since dealing with dependency-jars within OSGi is not straight-forward to the uninitiated. - Particularly frustrating is that the plugin would work in the dev environment, but wouldn’t work when exported (due to e.g. classpath issues.).
– Either dealing with OSGi stuff, or just imitating the ‘lightweight’ approach of tool-as-external-dependency would fix that.

But of course, I’d be much happier to see people try and do some PL project with the language of their choice. (Doesn’t have to be a worksheet.).

Now That I Have Your Attention - Project Note Taking

I think it’s fair to say no particular bit there is all that ‘hard’, but I can see that altogether it’s not exactly trivial.
FWIW, I had like half a year to make it. (Half, because who does most of their work in the first half of their FYP?, and because I didn’t know squat about PL in the first half..).

I really like tools. (This doesn’t really sound like a ‘cool’ FYP to do unless you’ve at least some love for tools.). – e.g. Not ‘vim vs emacs’ so much as ‘your editor should be able to do these things’.

For note-taking, I kinda-sorta made use of OneNote.
Briefly, OneNote is nice for collecting details/links on a topic (it has a nice hierarchy of sections -> pages -> bullet points). [These days, if I’m thinking hierarchically with a keyboard, I kinda-sorta make use of Emacs’ Org-Mode. – In the hands of a power user, Org-Mode is something to behold. I am not such a user.].

Physical pen & paper is king, though. (or queen, whatever).
Pens, use colours: e.g. black for thoughts, green for questions, red for “wtf, shit’s not working”, blue for facts (& purple for ‘todo’, if you’ve a purple pen). YMMV, but I find this useful. (esp. if you’re using one of those Bic 4-in-one); compared to monocolour, I reckon 1. helps thinking (like de Bono’s thinking hats?) 2. quicker to ‘parse’ the page 3. can write much more densely on the paper.
Overall, I keep a ‘programming journal’. I’ve heard “if you’re not writing, you’re not thinking”. – “how many pages I filled” makes for quite a nice metric for “how much work you’ve done”. (e.g. -if- when you encounter bugs, or other problems, you’re going to write down that you had this trouble, and alternatives as to how to overcome it.). – As an example, the “CType from arbitrary expression” system was 10 double-sided refill pages worth of work; the loop visualisation was 4 double-sided refill pages.

For this project, I kept a folder full of notes. It’d have things like:

– Point is, printed out, you can write on these, highlight things and in the margins ask questions. (“marginal notes” or “marginal questions”, as such).


Newer post Older post