David, I'm glad you're self-aware about your uncertainties up top. I think your inclination to elevate formal methods over other approaches is throwing the baby out with the bathwater.
"Agile sacrifices long term vision for short term gains and TDD optimizes for writing more code to achieve correctness.. Bugs are correlated with lines of code and TDD forces writing more code so how can it reduce bug counts?"
Tests help gain control over programs (I avoid using words like "correctness") by creating redundancy. Yes, it's more code, but it's in a distinct program. Conceptually every test case is a unique program. It takes some skill to ensure that you aren't just repeating your production logic in your test, but not very much. By computing the same thing in two different ways, you increase the odds that the program does what you want (in some scenario). Because for it to fail, you'd now have to have bugs in both code and test -- and the bugs would have to have the same effect.
Every new test you write slashes this probability further. In this way a series of tests cumulatively help pin down your program's precise 'shape' so that it is exactly where it needs to be in 'program space'.
Interestingly, Agile works the same way. Pairing is a way for two independent people to work towards agreement that a program does what they want. Sprints and demos help the programmers and end users agree on what they want. In each case, redundancy helps manage degrees of freedom and pin down details precisely.
Formal methods work the same way. An independent line of reasoning helps gain confidence that a program does what you want. In Cleanroom the independent line of reasoning is in the comments that accompany each line. In design by contract the redundancy comes from pre- and post-conditions, and the assertions sprinkled at intermediate points. In Dijkstra's discipline of programming the redundancy comes from clearly articulating invariants for loops and deriving weakest preconditions backwards from the desired goal to the input. And so on and so forth.
None of these techniques is foolproof. All of them rely on redundancy and the increased confidence that independent techniques are unlikely to have holes in exactly the same places.
Math works the same way. Proofs aren't magically verified. Mathematicians come up with proofs and share them with other mathematicians to gain confidence in them. Redundancy. Yes, they're often much more rigorous than the things we programmers do, but they are also more narrowly applicable. Each new proof technique is precious and helps prove a few more theorems. Then they have to go back to the drawing board and think up something new.
Incidentally, the reason I avoid the word "correctness" is precisely this fact, that everything we humans can conceive of can mislead us, down to our very senses. It seems more rigorous to talk about your narrow desire in the context of a single scenario. Is this behavior in this situation what you want? If it isn't, it may be what somebody else wants. I prefer to use the adjectives "correct" and "incorrect" in narrow situations rather than the generalization of "correctness".
I'm not opposed to formal methods. See the thread started at https://mastodon.social/@akkartik/101476051170905653 for some of my attempts to grapple with what they are good for. I'm sure you understand more about them than I do. But the alternatives also have value.
Thanks for the thoughtful response. The redundancy angle is something I will add to my list of arguments for why we want different/divergent ways of tackling a problem. Along with redundancy I think we also want decoupling and orthogonality when pinning down the shape of the problem. In that sense I think tests and agile methods are inadequate because they're non-orthogonal ways of addressing the issues of correctness and more generally maintenance and evolution. Using your probability analogy we don't get multiplicative effects because our methods are not independent and I think in practice there is too much overlap between tests to pin down the space of correct behavior in most large codebases.
It's unclear to me how much value tests really provide after having built a CI system to run 20k+ of them 100+ a day just so the engineering organization could keep shipping code. The tests at that point were akin to ritual sacrifice and were more about maintaining the illusion of agility than actually being agile. I really want the TDD advocates to specify how many tests are too many and at what point tests should be removed to maintain sanity of the feedback loop.
I've been in situations that seem similar to the ones you describe, and "how many tests are too many" feels like the wrong question to ask there. The number of tests isn't usually the real bottleneck in these situations. It's one or more of:
a) Programmers writing software for somebody else, and not having much intrinsic motivation for what they work on.
b) Unreasonable external demands from their schedules.
c) Unreasonable external demands for reuse. Not noticing that requirements have changed enough to start from scratch rather than try to reuse the ball of mud created so far.
d) Unreasonably rapid churn in personnel. Constantly shifting and changing the people responsible for a program, moving headcount here then there.
In such circumstances I suspect your formal methods wouldn't work either. They seem rosy because so far they're only used in situations of intrinsic motivation. Bring them to the gutter, and they'll get soiled as well.
Reliability requires optimizing for a sense of ownership. Many environments create selection pressure for other things besides sense of ownership. Best not to over-train on them when thinking about reliability.
Hmm, there may be value in an anti-coverage tool. Instead of telling me lines of code that have no tests touching them, tell me lines of code that have too many tests touching them. Requires a counter at each branch rather than a simple boolean.
On the other hands this may give false positives. What you really want is to find redundant paths through the whole program. Two tests that hit one line may be ok if they diverged earlier or later. That's a harder problem. There are some good tricks for managing counters for whole-program paths at https://dl.acm.org/citation.cfm?id=301678.
But again dipping into my personal experience, when the tests seem poorly designed there isn't actually much actual redundancy between tests. The real problem is that people who are not intrinsically motivated are too willing to blow up the space of possibilities. Let's add another field to this protocol buffer, and let's make it yet another extensible message that others can add to. Parsimony takes time after you get to a solution. If you aren't intrinsically motivated you just ship the first shit that passes acceptance tests without thinking about how to make it more elegant. The codebase becomes an externality.
I agree not being vested in the outcome is part of the problem but that is exactly why I want a formal spec instead of more code. A formal spec will not lie and cheat to me to ship the software yesterday but an engineer that just wants to collect a paycheck will do it day in and day out.
Really it's mostly a selfish desire to take as many perverse incentives out of programming as possible and I know that a formal spec does that because I can't cheat a formal spec the same way I can cheat a test framework.
Maybe you're right and there are ways to cheat the formal spec system as well but folks in the field seem to think that it scales and people enjoy the process enough to not cheat like they do with tests.
Yeah, I think that's my fundamental disagreement here. You're just never going to be able to combat a real, live, lying human with some dead sequence of bits. You have to start with intrinsic motivation, not try to make do without it.