Notes on Building a Live Coding Tool for C99
Tags: fyp, programming.scala, programming.c, programming.vim, presentation
I’ve been invited to talk at NUSHackers Friday Hacks on October 9th, about my
FYP of building a ‘worksheet’/‘live coding’ environment for C.
This post is to note down anything pertinent / thinking-out-loud as to what to
say.
This is a draft. Here is a link to the presentation
draft.
(Obviously, the colours need to change.).
What is my FYP?
Let’s have a GIF explain that:
(Here is the same gif, slower speed).
i.e. show the results of the computation directly beside the C program itself,
inside the editor. (This saves having to quit the editor, type gcc <whatever>.c
, ./a.out
, then try and figure out what’s going on).
Another way to describe this as a ‘REPL on steroids’.
You might be aware of other such tools around:
- Scala IDE’s “worksheets” were the ‘inspiration’ for this. (EclipseFP copies this idea with what it calls a ‘worksheet’).
- Apple’s Playground does a similar kindof thing for Objective-C and Swift.
- LightTable (whose demo was very impressive, but the $300k KickStarter hasn’t given us the complete thing yet) is big on doing this for Clojure, and to a limited extent Python.
- IPython notebooks are kinda-sorta similar.
- ‘Alive’ is bringing the same thing for C# in Visual Studio.
- jcartledge/sublime-worksheet is an example of an editor plugin which provides a ‘worksheet’ interface for any REPL language. (Easy to see ‘REPL on steroids’ here).
- Apparently TextMate can show results for Ruby.
- TMux + Vim, with something like slimux (‘send output to TMux shell’) kinda-sorta helps out with this, too. (‘slimux’ being inspired by Emacs’ SLIME).
Loosely related is also, e.g. GDB/LLDB. A more sober way to describe my plugin is “auto-generated printf debugging”. – GDB/debuggers provide ‘inspect the value at some point in execution’ and ‘can easily evaluate expressions’ functionality. (GDB is known to be easy-to-use, widely used by the CS1010 kids, etc.).
My project on GitHub:
rgoulter/c-worksheet-instrumentor
(the main logic), and
rgoulter/c-worksheet.vim (a
plugin for Vim).
(It’s got a handful of stars on GitHub. You just know I’m amazing at
publicity.).
Goals for Presentation (Meta)
Hopefully, (esp. if you play around with it yourself), you’ll think that’s
kinda cool.
I’ll know if I did a good job with my talk if you walk out thinking “that’s it,
huh? I was expecting something complicated.” / “I could do that, chump.”.
(I’ll’ve done an even better job if some of you go out and try and make this
kindof stuff).
It’s kinda lame, for me, to say “hey, what I did for my FYP is quite simple”. (“If you can’t explain it simply, you don’t understand” / “it’s only simple if we simplify out the details” aside). The fortunate thing for this is that C has relatively few features. (e.g. No closures / functions within functions). But it’s not a sit-down-and-finish-it-in-an-hour kindof project. It is a lot of fun. (How many kids continue working on their FYP after they’ve presented in their free time?). You’ll also get very good at the language; I got somewhat familiar with the C99 language standard, and certainly stumbled upon the more esoteric aspects of (simple) C99, my favourite being:
int isWhitespace[256] = { [' '] = 1, ['\t'] = 1, ['\n'] = 1 };
['a']; //> 0
isWhitespace[' ']; //> 1
isWhitespace['\t']; //> 1 isWhitespace
“What’s In It For You” / Motivation For Caring About the Presentation
Something like ‘Example of a cool Programming-Language Application project you can make, once you have knowledge of how to make a parser & roughly you could implement it’, esp. if the inspiration is ‘I want that for my favourite language’.
On the other hand, PL applications needn’t be like this, and can be more modest, like, ‘organise imports’ or ‘indent properly’ or so.
Parsers and PL Applications
– It can become clear by the terminology, the way someone talks, how much they know or don’t know about a subject. “Parsers” is one of these areas I don’t know all that much about. (I never took the compilers/parser course.).
But in terms of “writing a worksheetify program”, roughly formula for this is:
- Parse the source code.
- ???
- Profit.
(Which applies to writing any PL application in general.).
There are different ways to do it. (And of course, there are trade-offs for
different approaches.).
But a key point I learned is, if you want to show results within functions / nested scopes / control structures, it’s not enough to just pass it off to a REPL. You must be able to parse the source code.
Making Use of the AST
In situations where you’re writing programs whose data is other programs, you’re going to want to make use of techniques a level abstracted from directly traversing the AST.
ANTLR (see below) provides two ways to handle this: Tree Listeners, Tree
Visitors.
With Tree Listeners, you implement listener methods which get called as the AST
is walked.
With Tree Visitors, you implement methods which then control which parts of the
tree are ‘visited’; the intent is to evaluate the AST to some value. (e.g.
Numeric expression to a number, like in the simple math expression grammar
example).
– The Tree Visitor is kindof more FP-ish, except having to explicitly call
‘visit’ on the children nodes of the AST is also a bit tedious. - This can be
abstracted another level, where you have e.g. AST -> Option[T]
, and some
combine :: [T] -> T
, then it’s possible to write a general fold_ast :: AST -> (AST -> Option[T]) -> ([T] -> T) -> T
function for the AST. (The AST -> Option[T]
means “maybe extract value from this AST node; if this node returns
Some thing, use that, otherwise keep looking.”.). - Manipulation of AST could
be done in a similar way.
Anyway. The dynamic (in ANTLR) is usually that the Tree Listeners can be used to ‘grab what you want’ out of the AST, Tree Visitor is better at ‘evaluating’ some part of the AST to a value.
My Choice of Parser / Technique
For the ‘parser’, I made use of ANTLR parser-generator (like flex/yacc; just give it a BNF-esque grammar, and it generates the lexer/parser for you). Mostly because this can generate a parser in Java, and I was using Scala for the project (as specified by FYP description – Though I think since FP/‘strongly-typed’ languages are kinda nicer to write parsers in, I think the hope was I’d hand-write the parser..)
(See my adapted C.g4).
Making use of gccxml, or an LLVM/Clang pass might’ve also had it’s advantages,
‘but’, then I’d have to write in C++ (for the latter), it makes the executable
size for my program much larger (which is fine if your plugin is like
YouCompleteMe), and I more/less wanted to keep the approach ‘general’, rather
than tied to a specific language.
[Note: a disadvantage of not using gcc/clang is that there may be divergence
between what they parse and what I parse; I mitigate this in my project by
first compiling the given input C file. i.e. anything I can parse, they can
parse (better?)].
In terms of “in my case”:
One approach might be to just pass the code of to some C interpreter anyway.
(Cling is a C++ interpreter developed by the folk at CERN, a successor of an
older CInt (C Interpreter) project of theirs. ‘Cling’ because it’s built upon
LLVM/Clang.). This’d require potentially renaming variables so that everything
is unique, and ‘unfolding’ control structures (like loops, etc.). – Quite a
messy way of doing things.
Another approach (which I never investigated) would be to generate a sequence
of commands to give to GDB/LLDB so as to get the information (i.e. values of
variables at some point in execution). (I did investigate using GDB/LLDB
insofar as ‘will this let me avoid having to learn about the PL stuff?’; it’s
not suffice because you still need to know for a statement e.g. x = y = 4
what information you’d like to get from GDB/LLDB). – This isn’t necessarily
a bad approach.
Instrumenting
The approach I did take was to ‘instrument’ (or ‘augment’) the code.
i.e. insert statements into the C program so that I can run the program to get
the output.
Roughly like:
int main() {
int x;
= 3;
x }
becomes
int main() {
int x;
= 3;
x (x, STRING_OF(3))
OUTPUT}
and as-if compiled with a pre-processor which handles OUTPUT
, STRING_OF
.
OUTPUT
outputs in a format my Scala program intercepts. (I just use STDOUT
;
it could also be done using, say, sockets, or using something like Apache
Thrift / or some other IPC/RPC framework – With C, this is a bit annoying,
e.g. becomes more difficult to compile on Windows. [My tool, & the Vim plugin,
work on Windows, btw]).
The pseudocode for the instrumentor is more/less:
def instrument(source):
ast <- parse source
types <- ctypes from declarations in AST
for each declaration:
insert(output declaration.toEnglish)
for each expression statement:
// n.b. assignments are expressions
insert(output declaration.toString)
Aside from ANTLR-specific stuff (e.g. ‘inserting’ into a token stream), the key challenge here is getting the ctype from a declaration. Furthermore, you need to be able to get the ctype of an expression, and from the ctype be able to generate ‘toString’ output for it.
Considerations Made When Doing This For C
Specifically for C, my program basically outputs for declarations (like cdecl), and outputs values for expression statements.
Representing C Kinds
A C variable/expression will be of some type: * a primitive type (e.g. int, float, char), * a pointer to some C type, * an array of some C type, * a struct/union (consisting of a sequence of identifiers, each of some C type) type, * an enum type, * a function (consisting of a return value of some C type, and a sequence of parameters each of some C type).
If you can see how the above might be represented as an ‘Algebraic Data Type’, congratulations. (If not, you probably haven’t heard of ADTs.).
e.g. in ML languages like OCaml/Haskell, you might write it something like:
type CType = Primitive of String * String
| Pointer of String * CType
| Array of String * CType
| Struct of String * String * ((String, CType) List)
| Function of String * CType * ((String, CType) List)
In Scala, the code (simplified) is something like:
sealed trait CType;
case class PrimitiveType(id: String, type: String) extends CType;
case class PointerType(id: String, to: CType) extends CType;
case class ArrayType(id: String, of: CType) extends CType;
case class StructType(id: String, tag: String, members: List[(String, CType)]) extends CType;
case class FunctionType(id: String, rtnType: CType, params: List[(String, CType)]) extends CType;
(See CType.scala).
In Java … ha ha ha, just kidding. (Just because you can, doesn’t mean you should.).
Making use of this ADT
Again, often in FP programming, once you have the types, the rest is fairly
straight-forward. In my case:
* I want this ‘CType’ from the AST. (kinda because C99 doesn’t have
typeof(x)
).
* I want to output e.g. string-of some ‘CType’. (because C99 doesn’t have
stringof(x)
).
Declarations
The devil is in the detail, but (in terms of ANTLR): * You can use a TreeVisitor to evaluate a declaration statement to a list of CType(s). * You can then use a TreeListener to traverse the AST and pick up the declarations.
The main complication here is that the listener has to care about scope.
There’s also things like forward declarations, types declared in header files,
& typedefs.
It can be a bit tedious to figure this out, since you have to care about all
occurrences of whatever AST node you’re working with. e.g. declaration
can
appear in the global scope, outside of functions; the identifiers of a struct’s
members will look the same as a ‘normal’ declaration, etc.
– What helps here is comparing the ASTs of declarations which are quite
similar. int *(*a)[5];
, int **b[5];
, int (**c)[5]
have similar-ish trees,
but very different semantics.
The ADTs make it easier
In case the above diagram doesn’t make that obvious, then it’s roughly “collect
the type specifiers and any point-to-(type-specifier) info; this is passed to
the declarator as the AST is traversed. Something something recursion, and
that’s how simple it is.
Again, roughly, since each alternative for the declarator corresponds to some
alternative of the above ADT, it does help write the program logic.
CDecl
The actual cdecl (e.g. at cdecl.org, or as a programming exercise in the
“Expert C Programming” book) is more/less a heuristic, doesn’t actually take
care of all C declarations.
But going from (String ->) AST -> CType, it’s not so hard to go CType -> String
to ‘explain’ what a C declaration is.
Figuring it out takes the same kindof logic as figuring out how to do AST ->
CType stuff.
Expressions
This is more/less straightforward, following the CType ADT.
In particular, I use StringTemplate as a template rendering engine. (Like
Python’s Jinja). So you have a string_of_primitive
, string_of_pointer
,
string_of_array
, etc. etc. and pass values to StringTemplate such that it can
recursively generate the C code for you.
The ADTs make it easier
Since what’s instrumented is C code, the goal is to come up with C code which constructs a String for some ctype. Again, since it’s recursive, then the code to generate this C code is going to involve recursion. So it’s just a matter of writing code to construct a string for a primitive type, and array of ctype, etc.
See constructs.stg See
Instrumenting: Putting it All Together
I don’t reconstruct the AST directly, but just inject the instrumenting C code into a TokenStreamRewriter (one of ANTLR’s things, for a stream of the lexed tokens).
Cool stuff: Segfault Handling
That’s all ‘nice’, but something I’m quite proud of (& in terms of ‘useful for
a C beginner’) is it catches SEGFAULT
s, and marks the output there:
int *px; //> px is pointer to int
= 0; //> px = (nil)
px *px = 5; //> SEGFAULT
‘How’ involves a bit of knowledge about C99:
setjmp
/longjmp
are kinda like aGOTO
statement, can be used to simulate exception handling in C.signal
let’s you ‘catch’ thrown signals.
So, to catch a segfault, you instrument each expression-statement (includes
assignments & function calls, excludes declaration statements) with the
setjmp
(like try { }
); if a SEGFAULT
signal is thrown, catch that in
signal
, and then use longjmp
(like throw ..
); in the place you jump to
(like catch { }
), output SEGFAULT
to STDOUT, then exit.
– Of course, the instrumented C code then becomes something no sane human
would ever write.
Cool stuff: Type Inference on Arbitrary C Expressions
Being able to infer the type of arbitrary expressions e.g. 4*12
or p[q]
enhances the usability of the tool/plugin; the use case becomes “want to show
the value of an expression at some point? Simply write that expression.”.
The Type Inference takes a bit of thought, e.g. in p[q]
, it can either be an
array lookup, or a pointer offset.
But, again, while the logic is somewhat tedious, the code to do this inference
straightforwardly follows the grammar/AST for expressions; and I’m not sure how
you’d do that without an ADT like the one above.
Limitations
As discussed on the repository:
No I don’t do squat with regards to
malloc
, etc. (Particularly since there’re other ways to allocate memory, e.g.calloc
,alloca
, not to mention custom ones likexalloc
.).Obviously I don’t do squat with regards to concurrency, threads, etc.
The tool doesn’t necessarily play nicely with C PreProcessor stuff.
It’s easy to come up with examples which show how tricky it could be to deal with that.
‘Contributors Wanted’
If you find any bugs, I’d be delighted for you to file an issue.
– Keep in mind, though, that the Scala Worksheets tool doesn’t work when the
program uses semicolons to delimit statements. - So I think I’ve not done a bad
job.
– For selfish reasons, of course I’d love to see plugins for other editors
like Emacs, Sublime Text, Atom, LightTable, etc. (and it’s a bit brutal to try
and learn the APIs if you don’t use the editors).
The vim plugin is fairly lightweight, since all it needs to do is communicate
over a socket to the Scala server.
– Certainly it’d be interesting to see, in terms of UX, how the tool / plugin
could be improved.
– Originally it was supposed to be an Eclipse plugin.
I scrapped that since dealing with dependency-jars within OSGi is not
straight-forward to the uninitiated. - Particularly frustrating is that the
plugin would work in the dev environment, but wouldn’t work when exported (due
to e.g. classpath issues.).
– Either dealing with OSGi stuff, or just imitating the ‘lightweight’ approach
of tool-as-external-dependency would fix that.
But of course, I’d be much happier to see people try and do some PL project with the language of their choice. (Doesn’t have to be a worksheet.).
Now That I Have Your Attention - Project Note Taking
I think it’s fair to say no particular bit there is all that ‘hard’, but I can
see that altogether it’s not exactly trivial.
FWIW, I had like half a year to make it. (Half, because who does most of their
work in the first half of their FYP?, and because I didn’t know squat about PL
in the first half..).
I really like tools. (This doesn’t really sound like a ‘cool’ FYP to do unless you’ve at least some love for tools.). – e.g. Not ‘vim vs emacs’ so much as ‘your editor should be able to do these things’.
For note-taking, I kinda-sorta made use of OneNote.
Briefly, OneNote is nice for collecting details/links on a topic (it has a nice
hierarchy of sections -> pages -> bullet points). [These days, if I’m thinking
hierarchically with a keyboard, I kinda-sorta make use of Emacs’ Org-Mode. –
In the hands of a power user, Org-Mode is something to behold. I am not such a
user.].
Physical pen & paper is king, though. (or queen, whatever).
Pens, use colours: e.g. black for thoughts, green for questions, red for “wtf,
shit’s not working”, blue for facts (& purple for ‘todo’, if you’ve a purple
pen). YMMV, but I find this useful. (esp. if you’re using one of those Bic
4-in-one); compared to monocolour, I reckon 1. helps thinking (like de Bono’s
thinking hats?) 2. quicker to ‘parse’ the page 3. can write much more densely
on the paper.
Overall, I keep a ‘programming journal’. I’ve heard “if you’re not writing,
you’re not thinking”. – “how many pages I filled” makes for quite a nice
metric for “how much work you’ve done”. (e.g. -if- when you encounter bugs, or
other problems, you’re going to write down that you had this trouble, and
alternatives as to how to overcome it.). – As an example, the “CType from
arbitrary expression” system was 10 double-sided refill pages worth of work;
the loop visualisation was 4 double-sided refill pages.
For this project, I kept a folder full of notes. It’d have things like:
- Example C program printouts (so, e.g. figuring out what kindof instrumenting would be needed),
- Printout of the C BNF, (both ANTLR’s file, and from the C99 standard).
- Reference sheets for GDB/LLDB, Thrift, etc.
– Point is, printed out, you can write on these, highlight things and in the margins ask questions. (“marginal notes” or “marginal questions”, as such).