When considering minimal implementation, one will often try to choose between the easiest thing to parse (i.e. something like b8) or easiest thing for human comprehension (i.e. copy-to-EAX). Why is b8/ in there? Are you using b8 + stop at whitespace for easy parsing while leaving ignored characters in between only for readability? And were there any drawbacks when you considered just dropping the b8/'s and such.
"(SubX doesn't support floating-point registers yet. Intel processors support an 8-bit mode, 16-bit mode and 64-bit mode. SubX will never support them. There are other flags. SubX will never support them. There are also many more instructions that SubX will never support.)"
You also might want to say this near the beginning since it's an introductory statement. You also don't really say why. I know you're bootstrapping with minimal implementation a goal but that isn't mentioned in intro. This might give newcomers wrong idea. Intro still mostly gives them right idea of it, though.
"The Intel processor manual is the final source of truth on the x86 instruction set, but it can be forbidding to make sense of, so here's a quick orientation. "
"The good news is that internalizing the next 500 words will give you a significantly deeper understanding of your computer."
The ultimate value of SubX might be eliminating the need to learn x86 the hard way. I had one article describing a few instructions that I was thinking of cheating with. Also thinking of going straight to ARM on a Pi or MIPS on a router since their subset are simple and more flexible (eg stack vs register vs whatever). SubX brings x86 complexity down to their level.
Only drawback is all my stuff ix x86_64. A SubX for it with same instruction coverage might be useful.
"Here's a sample of what a trace looks like"
That looks neat and very helpful. Mainly debugging but potentially high-assurance tools, too. They have to prove object code does what source says. Traces with equivalence proofs and tests are one method.
Your choice of primitives is well-thought, too. For bootstrapping, I identified scalar variables, arrays, trees (or just linked lists), a while statement, and ability to read/write a file. An interpreter for just those hand-done in x86 should be able to bootstrap all low-level tools that step one up. Modifying the interpreter to output rather than execute the assembly creates primitive compiler. The one thing I'm unsure on, which I didn't see in yours (maybe not applicable) was structs. They were the one change in C that let UNIX be ported from assembly.
Another thing from a year ago I don't know if I brought up. People are starting with TCC partly because it's so small. That makes whatever amount of C we need to handle small or, if we rewrite into primitive language (eg the interpreter), there's less labor. That assumes we do everything by hand. The amount of code in TCC or the C-based GCC, esp if legacy code not made for us, might argue a tool-assisted approach isn't really cheating so much as basic economics.
The idea I had was to use an untrusted tool for easy rewrites of C code. I've seen lots of tools used for compilers or static analyzers but easy is an unknown for now. Quicker and more interesting than tedious recoding, eh? The developer creates a mapping from whatever C statements and structuring is in GCC to the primitive language. The tool transforms GCC into that with either line-by-line diffs or before and after shots with explanations of each change. Verifiers review however much they want to make generated code trusted. They run the primitive interpreter on the trusted code to bootstrap GCC.
Might save a lot of time on top of eliminating need for TCC. Additionally, the tech might be reused for other projects. Maybe use a term-rewriting language combined with existing tech for producing the parse trees for C.
What you think of generating simpler version of large compilers for hand-coded interpreters? Also, people not verifying the code might just verify the parser and translation rules. An ancient, GCC backdoor probably wouldn't anticipate those components' code.
> When considering minimal implementation, one will often try to choose between the easiest thing to parse ('b8') or easiest thing for human comprehension ('copy-to-EAX'). Why is 'b8/' in there? Are you using 'b8' + stop at whitespace for easy parsing while leaving ignored characters in between only for readability? And were there any drawbacks when you considered just dropping the b8/'s and such.
You kinda deduced the reason. I'm trying to walk a fine line between keeping the implementation small and keeping the interface at least in some realm of ergonomic. This isn't a notation to bootstrap as quickly as possible and then treat like a red-headed stepchild. I want it to be habitable all by itself. A uniform but strict syntax with lots of room for free-form comments is the best approach I've found so far. The nice thing about restricting readability concerns to comments: error handling stays simple. And error handling is often where a compiler spends a lot of LoC (and still ends up with a highly sub-optimal result).
> The one thing I'm unsure on, which I didn't see in yours (maybe not applicable) was structs. They were the one change in C that let UNIX be ported from assembly.
I'm already starting to think about type definitions once I finish bootstrapping the SubX implementation. Definitely the #1 high-level feature I want. The plan is to grow a small notation that occupies the same niche as C, but with heavy emphasis on keeping the implementation small and transparent, and zero emphasis on portability. That should eliminate C's problems of undefined behavior.
It may look a lot like C, but that's still open. I think having it look different may help reinforce the right expectations: it's not going to be C, there's never going to be any sort of bug compatibility. I'm also planning to avoid complex nested syntax for arithmetic or pointers. I'd like to preserve a 1-1 correspondence with machine code as long as possible. Makes error messages much easier to design.
Your idea of showing before/after is awesome. I'm never going to do it in the context of C :) but I am going to totally steal it at some point.
Brings me to another point. We focus a lot on how much code something takes and how readable it is. The alternative is generating boiler-plate like code which isn't readable but declaration and generator is. Think grammars fed through parser generators. I was thinking extra lines of code might be worth it for automating away parsing and error handling parts with DSL's. I mean, does it really matter if those parts aren't readable if we know they're just tedious bullshit that readable code generated in a structure you understand reasons for?
Maybe just personal preference here. I'm doing generative approach like Kay/STEPS did for anything tedious anywhere possible if I can. That lets me dodge overhead of error handling and...
The main advantage of them is they reduce cognitive overhead of juggling nearly meaningless details. You even mention understanding as a goal of SubX. Seems like it's still worth more exploration. Maybe it's a declaration that has the offsets etc, your interpreter saves it, and you have create/set/get/delete syntax that gets translated into simpler stuff by interpreter. Basic, rewrite rules. Back when I did it, I just add fields to struct name with .syntax as strings in translation for output code or traces. So, your syntax might let you do a one liner. The executable code and traces will show full thing it's doing with a comment on top containing the mapping. Maybe it shows you that on first run as you write the program so you can check it while everything still fresh in mind. Like REPL's.
You seem mostly to be doing ASM here. I get why. Maybe consider allowing one, small layer of indirection with still clear mapping. Hyde's HLA is inspirational here. If you want simple asm, you can get it. If you want understandable code (!= asm), you can selectively use the "high-level" stuff which still isn't high-level like C++ or Java. Obviously, yours would be simpler than Hyde's.
That's different. It will stop most undefined behavior. Also, that there's just one implementation.
re type definitions
Definitely look at Cyclone and Typed Assembly for ideas on attaching different, semantic meanings to low-level things like pointers. Ada's combo of type and bit represenation is worth trying. Maybe look at whatever Zig guy is doing, too, since he's aiming for safety and simplicity. If you want to get wild, you can throw a whole Prolog interpreter in there like Shen guy. I don't suggest that.
Although clueless about them, I do notice similarities to basic pattern matching and rewrite rules when I look at type systems in papers. That's what also powers macros and HLL-to-LLL conversions. There could be a set of primitives you could adapt to handle them all by just attaching the context somehow. Maybe also with templating where the primitive expresses core idea but is adapted to setters/getters, bitsize, and ranges. Pulled right out of type/bit definitions.
re side by side
Glad you love it. Look forward to seeing what use you come up with.
> does it really matter if those parts aren't readable if we know they're just tedious bullshit that readable code generated in a structure you understand reasons for?
Corollary to "all abstractions are leaky": No boilerplate is ever entirely bullshit.
Don't get me wrong, DSLs can be useful (provided somebody's watching the big picture to ensure there aren't multiple notations for the same niche). But I'm still operating at too low a level for them. I'm digging myself back out to the sun as fast as I can :)
> You seem mostly to be doing ASM here. I get why. Maybe consider allowing one, small layer of indirection with still clear mapping.
Like I said, I will definitely have a notation for structs. But the implementation of the notation for structs is not going to itself use the notation for structs. That kind of circular dependency is hell on global comprehension of the big picture. The local convenience of having offset names is far too small a benefit to outweigh the global issues.
I won't be doing ASM forever. But as I add layers of notation, each implementation will only be allowed to use earlier notations. Notations will have a strict dependency ordering.