What even is testing?

Hey everybody, welcome to another fishbowl!

8

10:06

The topic for today is testing. Anyone who has ever worked a programming job knows that testing is Important, but there's a lot of disagreement in industry about how to test. Within this community, I think there is often a sentiment against testing, or at least that the types of testing most people do are ineffective.

10:08

But despite this I think we all know that some kind of testing is in fact important. So the goal of this fishbowl is to examine our experiences with testing, and to try to learn how to think about testing and how to write effective tests.

10:09

Our participants today are: - Kartik Agaram (@Kartik Agaram) - Phil Homan (@Phil H) - Andrew Reece (@Andrew (azmr)) - Demetri Spanos (@demetrispanos) - Phillip Trudeau (@Phillip Trudeau) - and myself, as moderator Side conversation is in #fishbowl-audience starting here: https://discord.com/channels/239737791225790464/708458209131757598/1112064639564517457 (edited)

10:10

To start, I'd love if everyone could introduce themselves and talk briefly about their experiences with testing in the past!

10:10

I'll start...

10:11

I'm Ben, community lead, and most of my experience with testing was at a web development company for the past several years. We had automated tests for both our frontend and backend, but mostly for our frontend for some reason. We also had manual QA and large-scale automated end-to-end tests we would run before shipping. EDIT: As of recently I'm now at Mozilla on the WebAssembly team, and am coming up to speed on our test practices for Firefox and SpiderMonkey in particular. (edited)

10:11

The frontend testing in particular was really bad in my opinion, and we might have opportunity to talk about that later...

Good morning, I'm Phil. I will mainly be talking about testing from the perspective of a live service engineer. I worked at Microsoft for ~3.5 yrs on some of the core web services for O365. Most of the testing was focused on live monitoring of the health of the deployed services, verifying the functionality of new features, and constantly improving our processes with weekly reviews. I currently work at a small indie game studio (Pocketwatch Games) in the earlier stages of development where writing tests is not part of my day to day work.

I'm a computational mathematician working mainly in AI/ML, but also sometimes drones, robots, and sensing sytems. I've also worked alongside web/mobile programmers for many years, so I have some osmotic knowledge. Broadly, I consider testing to be part of a spectrum of "static analysis" tools. I had a brief phase where I took the popular unit-test/TDD style for a trial and decided it definitely was not for me (my work often has problems without clear correct answers). Nowadays I evaluate testing in terms of likely programmer time saved. Usually that means integration/soak tests, and rarely fine-grain testing like unit tests.

Hey all, I'm Andrew. I'm the lead dev on WhiteBox (a tool for showing how your code behaves as you write it / a live debugger / a runtime data timeline). Our codebase makes liberal use of statically defined assumptions & (custom) runtime asserts. Some of the more "infrastructure" code has unit tests, but we primarily rely on "integration tests" for testing user-facing behaviour. The WhiteBox team spent some time looking into fuzz-testing & property testing, which both look promising, but they dropped in priority and we never fully implemented either. We also have a couple of ideas about how the tool itself could improve testing, primarily around where testing fits in the workflow, and how to save/generate complex data contexts to test against as an alternative to mocking. (edited)

Hi, I'm Phillip, I'm a game dev. I co-wrote a multiplayer physics game with Miles ( @notnullnotvoid ) where part of the problem statement was to implement low-latency rollback netcode for Box2D's chaotic rigidbody dynamics (!!). We employed a scattershot of different Ways To Find Bugs, which from my perspective was what the testing I did was mainly about for that project. Last year, I joined Andrew working on WhiteBox, where I've actually been contributing to the testing system recently - which as you can see is much more systematic! (edited)

I've been all over the place in my life: * pre-test-enlightenment: in academia working with microprocessor simulators, starting to build little CGI web pages * test-enlightenment, moving to industry and building web apps, strong TDD * gravitating towards the backend, offline jobs (decent number of tests, but they hindered as much as helped) * going extremely low level (to design for testability from the ground up rather than wrap layers for things like browser testing) * more graphical projects in the past year (certain aspects extremely hard to test) I tend to be pretty pro tests. In my experience tests work really really well at small scales. The issues with tests tend to occur in large teams, and my lesson from that observation is that we just don't know how to grow teams without introducing entropy and dysfunction. It doesn't make sense to blame testing for that.

Let's expand on that a bit - how effective have you all found different kinds of tests to be?

10:27

For me, for example, I often find benefit in writing tests for small, frequently-reused code like utility functions and data structures, since it helps flesh out edge cases. And I often find larger "integration" tests useful as sanity checks on the system. But a lot of the testing I did at my last job was neither, and I swear we had like a 95%+ false positive rate.

I think it'd be worth carving out the categories real quick to start!

1

It is probably worth addressing a couple broad categories of testing, although I think people obsess over the boundaries more than they should

2

I look at it as the following costs vs benefits costs - writing the test - writing the code so that it is testable - creating paraphernalia (mocks etc.) - running the test (every build?) benefits - verifying alleged invariants so you can detect breaking changes at build instead of run time

10:28

if I can write a test easily, and don't need to change my code to do so, and there is no paraphernalia, and the test runs instantly, and it prevents me from wasting hours in the future, then it's worthwhile

I think it's possibly worth splitting effectiveness by when the tests are written/triggered: - before/while/immediately after writing some section of code: helping to hit all key cases while it's the primary focus - when changing something else and unintentionally breaking old features (avoiding regressions)

1

For the cases where I've written unit tests, I've found them far more useful for the former (as an aid to writing code correctly). Integration tests seem more useful for the latter, as there are more unexpected interactions that break behaviour.

1

also relevant are the times that tests aren't worthwhile...

for automated testing, I tend to think of "unit" and "integration" as the only two broad categories, where "unit" means isolated testing of a small piece of code (e.g. a function or data structure), and "integration" means testing the actual system end-to-end (edited)

1

- unit testing - making sure a function/class/tiny subsystem/etc. doesn't have edge cases or doesn't crash - for my work on Happenlance, I did this rarely in special cases where it was easy to perform. - Integration testing: Stressing major subsystems, either to proactively ensure no regressions or just to reactively catch bugs - I did this in a major way for Happenlance, and I should have done it more! - System testing: "Really Running" the product and reactively catching bugs - I started to lean more heavily in my preferences toward this end over the course of development and post-release.

bvisness

It is probably worth addressing a couple broad categories of testing, although I think people obsess over the boundaries more than they should

The categories of tests for me are really indicative of code in general. If your project is small, things that would be integration tests for others become unit tests for you. Life is good.

Yeah it's different levels of granularity. I've generally found the less granular tests more useful (edited)

10:33

as a failure upstream will cause a large effect downstream in most cases (edited)

Echoing Andrew, the unit tests I wrote for Happenlance were: "Let me make sure i've implemented this memory-XOR function correctly in all cases". "Let me make sure I've written this packet compressor correctly without any dumb bugs". Those tests got #ifdef'd out and I think later deleted over the course of development! Their primary utility was aiding writing. (edited)

2

another few factors in whether to write tests/what type to write are: - is this an "artsy" or a "mathsy" problem I'm solving? - am I writing to fit a pre-existing specification? (e.g. something that has to match with Linux's function call ABI) - what does the dependency tree of this code look like? - small: this is close to a leaf of the call tree/1-off code - broad: lots of things directly depend on this (e.g. infrastructure code - allocators, error handling...) - deep: there's a long chain of dependencies on this thing (e.g. complex app-specific pipeline)s - do I have an "Oracle" that can automatically tell me whether an answer is correct here? (edited)

2

root vs. leaf is definitely something I think about when testing

demetrispanos

I look at it as the following costs vs benefits costs - writing the test - writing the code so that it is testable - creating paraphernalia (mocks etc.) - running the test (every build?) benefits - verifying alleged invariants so you can detect breaking changes at build instead of run time

The benefits I find from testing: * More reliable software. It's not clear that testing reduces bugs. But it certainly reduces regressions if you write a test whenever you find a bug. * Better design. Any time spent making code easier to test results in moving towards the ideal of decoupled code. * Documentation. It's hard to write documentation and keep it up to date. When I'm curious about something that I can't find docs for, tests (along with version control history) is a great source of breadcrumbs on the intent behind a module or subsystem. There are other reasons to write tests, so they often get done even when docs don't. And once they're done they're more likely to stay up to date than documentation. The big cost of testing for me is not the time spent writing the test, that's negligible. Time spent making the code testable is a benefit, not a cost. The big cost is writing the wrong tests, that are dead weight or flaky. To back up a bit, programming in my experience is not an amorphous activity where you're moving a knob from 88% to 90%. It feels like a black and white activity. Either you end up somewhere interesting or you don't. Time matters for this, but it's a satisficing rather than optimizing criterion. You get no points for getting somewhere uninteresting fast. Tests for me increase the odds of getting somewhere interesting. My path becomes more intentional where it used to be more like Brownian motion.

1

What about during more exploratory phases of development? Do tests still benefit you there? Seems like writing tests that may quickly become irrelevant as the program requirements change would just be wasted effort.

10:39

I suppose that is what you mean by "writing the wrong tests" (edited)

The test I found the most valuable for our game was an input replay system - since our gameplay code already needed to be deterministic and isolated for networking to function, it was easy to save & load input histories and resimulate the game at high speed to make sure nothing crashed, or to repro a glitch. Input replay is actually similar to how WhiteBox's integration testing works!

1

10:40

But it's worth pointing out that I wouldn't have been able to write that incredibly valuable integration test had our software architecture not already mandated a clean slice / hermetic seal of the gameplay tick code! (edited)

1

You get no points for getting somewhere uninteresting fast. Tests for me increase the odds of getting somewhere interesting.

purely as a personal experience question, this is the opposite of my experience

10:41

I find testing to be, always, inertia

10:41

sometimes worthwhile inertia, but always inertia

10:42

none of the most creative projects I've ever worked on used testing seriously until very late in the lifecycle

1

Agreed - I only wrote the tests I wrote because I didn't have to sledgehammer my own intuitive goals for the software's architecture just to avail the opportunity to test. I don't think I'll ever want to do that. (edited)

I find the value of testing increases significantly once you are actually shipping and pushing updates. The cost of bugs is much higher once it is affecting actual customers

️ . It was only post-launch that I wrote a crash reporter for Happenlance, and not only was it incredibly useful, but it was fun, and I'm still proud of it :)

10:43

I should've done it prelaunch!

demetrispanos

none of the most creative projects I've ever worked on used testing seriously until very late in the lifecycle

is "creative" similar to my use of "artsy" here (as juxtaposed against "mathsy")? Or do you mean more as "innovative"?

yes, when you don't know the solution in advance

1

"make me a GUI that makes it intuitive and fast to book flights"

10:45

99% of the value of that activity is trying many different ideas

@Andrew (azmr) can you elaborate on how "artsy vs mathsy" might affect your approach to dev or testing?

there are a couple of components: - requirements: precisely defined for "mathsy" - there is an objectively right and wrong answer; squishy for "artsy" - there may be a broad range of acceptable answers - testability: mathsy stuff is often working with more primitive data, so is easier to generate & evaluate tests for; artsy stuff may have to be visualized and subjectively evaluated in context ("artsy" and "mathsy" are really 2 vague points in at least a spectrum, probably a multi-D space; they don't describe all possible categories) (edited)

2

1

Crash reporting can be thought of as testing, but you're testing on the user's computer

1

Phillip Trudeau

Crash reporting can be thought of as testing, but you're testing on the user's computer

this is a very important part of the testing story at Mozilla actually

1

Phil H

What about during more exploratory phases of development? Do tests still benefit you there? Seems like writing tests that may quickly become irrelevant as the program requirements change would just be wasted effort.

Yeah there's definitely some art here. I'm not pro TDD. TDD is a good set of training wheels. As a concrete example, when I published a text editor last year the 1.0 release had 0 tests. Eventually I started having enough bugs that I hate-wrote a bunch of tests. You can see the list here: https://lobste.rs/s/vsb6ue/case_for_models#c_d2mmvi

In one sense, 1.0 is exactly the perfect time to write tests, because you know exactly what the product actually is and what's worthwhile, but it is also exactly the wrong time to write tests, because it's after a bunch of people hit a bunch of bugs

3

well "wrong" relative to what? in my framework it's just a question of how much time/effort you wasted

10:47

if you got to 1.0 faster, with more exploration, nothing was wrong or wasted

1

Right - but for something like a game, it's not ideal to have a bunch of customers who get a crash on startup

winning a boxing match isn't a question of never taking a hit: it's not taking hits that you don't need to take

1

10:48

programming isn't about never having a bug

1

10:48

etc.

yeah bugs are inevitable. Minimizing the potential damage, and the effort spent finding and fixing bugs are my goals when testing (edited)

yeah, we certainly didn't write any tests at all to my knowledge during the initial prototyping phases of development - that seems to me like a huge mistake. To explore is to travel light!

The only times when I feel like testing during active development is if I have to write something particularly tricky, e.g. a data structure

1

10:51

I've been burned many times by poor implementations that break all the interesting stuff I want to do

10:52

I guess broadly a chunk of my testing is just "I want to run a small piece of this program right here and now"

2

Also for refactoring imo. If I have a system that works as expected and I need to re-write it or swap out a dependency, having a suite of tests to verify that the new implementation works the same saves a ton of time

1

My take: If I don't test my code, how do I know it works? (I don't.) If I don't know if it works, why did I bother writing it? To me, testing isn't just automated testing. Manual testing is testing too. I either manually test, automatically test, or both. No testing is not an option.

2

Phillip Trudeau

Right - but for something like a game, it's not ideal to have a bunch of customers who get a crash on startup

Manual testing helps there, as @strager mentioned in the audience. Even though I had 0 tests, I had a long list of manual tests I was running through on a daily basis. (I focused on figuring out an automated test framework when the list exploded 3x.)

good point - by far my most highly valued testing harness is of course Building And Running The Thing In The Debugger Every 2 Minutes

10:53

It's part of why having fast build times is super valuable - iteration, exploration, poking, prodding, stressing

1

Phillip Trudeau

It's part of why having fast build times is super valuable - iteration, exploration, poking, prodding, stressing

yes and fast build is sometimes in conflict with testing

True. Testing is slow!

here's a thought experiment: how many bugs would your compiler have to magically catch automatically for you to accept it being 10x slower? 2x slower? etc.

Just don't ask that question to Rust programmers

Phillip Trudeau

good point - by far my most highly valued testing harness is of course Building And Running The Thing In The Debugger Every 2 Minutes

ahem or having it running live alongside your code

2

The speed of testing will be a good topic to explore I think, because there are some significant workflow and test-quality improvements that come with faster tests

demetrispanos

here's a thought experiment: how many bugs would your compiler have to magically catch automatically for you to accept it being 10x slower? 2x slower? etc.

I think this is also an important question because you need to be actually catching bugs to make the time worthwhile

demetrispanos

yes and fast build is sometimes in conflict with testing

When I was a li'l baby programmer writing my own math.h library, I had a function that unit-tested each function on all 4 billion floats, and it caught a lot of problem inputs that were far away from zero, but it was annoying to wait for! I got a more intuitive sense of my code, and more immediate value, out of writing a graphing program that let me view a graph of each function and sample the value of the function at the mouse cursor. Both were useful, just on different timescales! I'd run the 4-billion-float test, trigger a bug, then seek to that coordinate in the graph to see what's goin on

3

bvisness

The speed of testing will be a good topic to explore I think, because there are some significant workflow and test-quality improvements that come with faster tests

This is a good point. The tradeoff to Building and Running The Thing is that it's fast and easy, but you also tend to get stuck in a rut of testing things the exact same way every time, missing problems that automated testing would have caught -- as WhiteBox's integration testing system has proven to me time and again

As has already been mentioned, adding tests adds friction to changing the code under test, particularly if changing APIs. For WhiteBox code this seems to manifest in us unit testing the leaves of the call tree/infrastructure code (which operates on simple data types), and the root of the tree (fake user input), but basically nothing in the middle levels. Those are left to either asserts or the full-program tests to surface errors. I find that the mid-levels are where most of the code/API volatility is, as new requirements call for refactoring/restructuring. This sandwich structure basically means that writing tests is O(1) rather than O(n_code_edits) (edited)

That's funny; in my game dev experience it seems like a lot of my testing was smack in the middle of the sandwich! (edited)

1

Let's talk more about the costs and benefits of testing - I think that will be helpful because a lot of people underestimate the costs and overestimate the benefits

11:06

For costs, Demetri already mentioned inertia, but I think there are several facets to that

11:06

There's dev time, obviously, although it seems some developers (perhaps Kartik for example?) have workflows that lend themselves to lots of testing on the fly

11:07

compile time and run time too, obviously

11:07

but are there others?

bvisness

but are there others?

procrustean costs (having to make your code testable) and paraphernalia costs (mocks or whatever) (edited)

Unit testing seems like the highest inertia - you are writing multiple nontrivial functions, that (hopefully!) cover a huge input range, for each little quantum of your software

demetrispanos

procrustean costs (having to make your code testable) and paraphernalia costs (mocks or whatever) (edited)

you should go more into what it takes to "make your code testable"

11:08

many people regard it as an unconditional good to do so

11:08

you seemingly do not

being testable is unconditionally good

2

11:08

but you don't get it for free

1

11:09

testability requires "points of legibility" where you can extract state snapshots to be compared against reference cases

2

11:09

but general coding does not require that

Phillip Trudeau

That's funny; in my game dev experience it seems like a lot of my testing was smack in the middle of the sandwich! (edited)

oh interesting! Did you have the same volatility layout there?

the code inside the gameplay tick was volatile, and the code outside the gameplay tick was volatile, but having the gasket between those two layers was a constant - so maybe it's the slice of lettuce near the top of the sandwich instead of the bread per se

so yeah, in Demetri's terms, we had that point of legibility

I've said some of this already, but to lay it out I see the following costs of testing: * False-positive tests that fail flakily when nothing is actually wrong. * Brittle tests where the API keeps changing, as @Andrew (azmr) mentioned. In large teams it's easy to lose one's bearings and write tests that aren't useful, or keep tests around past their usefulness. Those tend to be the hardest decisions, what tests to write and when to throw away a test. * This isn't a drawback of tests, exactly, but the biggest way tests fail to help is by being difficult to write in many situations. Graphics, non-determinism and simulation are some domains where I've been forced to muddle along without tests, and where others who force themselves to write tests often end up with less-than-useful tests.

demetrispanos

testability requires "points of legibility" where you can extract state snapshots to be compared against reference cases

where do you see this causing problems?

Phillip Trudeau

so yeah, in Demetri's terms, we had that point of legibility

I think having tests concentrated at "narrow interfaces between larger components" is good, and it lines up with my preference for log-file debugging

Re: Overestimating the benefits of tests Some things are inherently hard to test. Most of the nasty bugs I dealt with for live services are ones that did not manifest until deployed to a production environment. For example, during a deployment servers would have a mix of build V1 and the new build V2. A user would make a request to V2 and write new binary data to storage. The subsequent request might happen to try to retrieve that data but using a server with build V1 as build V2 was still rolling out. V1 has no knowledge of the new format and would fail to retrieve the data. Without writing a complex testing system to verify a user using a mix of builds, this is a case where a developer just needs to be aware of the consequences of using a new memory layout. Other bugs also followed this pattern of unanticipated scenarios such as operation under high load, multithreading bugs, and unexpected user data. (edited)

Ohhhhh yes, I have found graphics to be hard to test. That could easily send me on a tangent about using external tooling for testing, which can be a dream or a nightmare of its own.

demetrispanos

procrustean costs (having to make your code testable) and paraphernalia costs (mocks or whatever) (edited)

I disagree with this comment but love the metaphor

Code having to stretch or chop because of some external constraint is :chefkiss:

bvisness

where do you see this causing problems?

well if the invariant you want to test is not manifested anywhere in the code, i.e. is an emergent property, now you have to manifest it somehow

11:14

if every test were of the form "verify x >= 0" then we wouldn't have this conversation

Kartik Agaram

I've said some of this already, but to lay it out I see the following costs of testing: * False-positive tests that fail flakily when nothing is actually wrong. * Brittle tests where the API keeps changing, as @Andrew (azmr) mentioned. In large teams it's easy to lose one's bearings and write tests that aren't useful, or keep tests around past their usefulness. Those tend to be the hardest decisions, what tests to write and when to throw away a test. * This isn't a drawback of tests, exactly, but the biggest way tests fail to help is by being difficult to write in many situations. Graphics, non-determinism and simulation are some domains where I've been forced to muddle along without tests, and where others who force themselves to write tests often end up with less-than-useful tests.

False-positive tests that fail flakily when nothing is actually wrong.

This reminds me that I haven't discussed asserts!!

11:15

speaking of verify(x >= 0), haha

we definitely have some overzealous asserts

Yeah, don't let my lack of interest in unit tests imply that I'm not a huge assert zealot

I don't really think of most of my runtime/compile-time asserts as "tests". Rather just verifications of the assumptions I am making in code (edited)

bvisness

you should go more into what it takes to "make your code testable"

to elaborate on this, what test do you propose for verifying that your hash function mixes well on your data set? you can do it, of course, but it's not something you can just say x>=0 from code you already have

1

Debugging and testing have been brought up as distinct, but a large benefit I've found from our integration tests is to repeatedly get the program to a state where something interesting is happening that needs inspection in a debugger without having to manipulate the GUI by hand each time. This really helps to keep all of the problem details and hypotheses in my working memory at the right LOD without evicting them from cache with "and then I have to click this over here...". (edited)

2

1

demetrispanos

testability requires "points of legibility" where you can extract state snapshots to be compared against reference cases

I certainly don't think it's worth making every single line of code testable. Rather the ideal is to have a large project criss-crossed with places where we're cross-checking answers from two sources (tests are really just a second source, along with types and probably other approaches). When I write large projects it's easy to forget to keep this mesh network of cross-checks developing alongside features. Tests remind me to maintain this mesh network.

1

as for whether procrustean costs exist, I'm not sure I understand the potential disagreement ... surely we all agree that there is code that is not testable, and it can be (maybe) made testable?

11:18

(and that takes effort and design, which is a procrustean cost)

The input space might be so large that it's infeasible to test every possible case

1

yeah, with agile-ish / UX-heavy / "goopy" programming, it's very easy to make a lot of code that is very not-systematically-testable, but you're doing that because you want to get to interesting places

11:21

at that point, you either bust out the sledgehammer, or you fall back to wide wide system testing (crash reporter, QA teams, etc) (edited)

11:21

you just shoot the rocket at the moon 10 times and see if you get there on the 11th

11:23

There's no formal proof that the Saturn V can make it to orbit, etc

Re procrustean costs: one way that the WhiteBox codebase lucked out with integration testing is that we had to make our user actions/inputs data-driven so that it could be controlled by an external editor. This meant that our test "harness"(?) basically just acts as another editor sending it commands (really just pushing them into a queue). If we didn't already have this constraint and instead had input handling done inline with other GUI code, I'm not sure how I would have felt about refactoring to force this structure just for testing...

1

Hey, that's another example of tests organically emerging out of an architectural boundary that was already being imposed!

11:25

Tests are like oysters.

11:26

They grow where the water meets the shore

demetrispanos

to elaborate on this, what test do you propose for verifying that your hash function mixes well on your data set? you can do it, of course, but it's not something you can just say x>=0 from code you already have

what do you do in these situations to verify that your hash function is doing its job well? (edited)

I have to build an entire separate testing system to generate inputs and log outputs and perform statistical checks, i.e. code that was not at all necessary for the functioning of the application

gotcha

this is also why "just use murmur3" became a refrain for me :)

demetrispanos

I have to build an entire separate testing system to generate inputs and log outputs and perform statistical checks, i.e. code that was not at all necessary for the functioning of the application

This is one reason I love the domain of games: there's a huge culture of building out huge frameworks of tools. Other industry verticals could learn a thing or two.

That's reminiscent of what happens in profiling, btw - which I suppose is a narrow special-case form of testing

well, it's a form of measurement at least, although I did did once write a test called it('should take less than 1000 years to run this code')

3

1

bvisness

well, it's a form of measurement at least, although I did did once write a test called it('should take less than 1000 years to run this code')

Some of my tests don't have any assertions, just a comment # shouldn't crash.

1

Valid form of testing!

right, you always hear about unit tests and integration tests, but why not unit benchmarks and integration benchmarks?! I know some companies will autofail commits with a certain % slowdown for given inputs

1

3

1

2

11:30

speed is a form of behaviour...

We had that kind of live monitoring for our services. On call engineers would be woken up if latency was too high

2

1

11:31

And we wouldn't deploy new builds if the test servers exhibited different latency behavior

Yeah, once you get into distributed system testing (multi-server/multi-thread/concurrency) then testing becomes a LOT harder!

So, when I think back on the testing at my web dev job, I think the tests had really high costs for next to zero benefit. We had some kind of code coverage requirement, which we achieved, but the tests were almost all using this "snapshot" feature of the Jest JS testing framework, where the framework just saves the output of your program to a file and later asserts that the output is exactly the same.

11:32

This led to the insane false positive rate I mentioned before, because the data being snapshotted was, like, the full DOM produced by a React component

11:33

adding a single CSS class would start failing tests

to combat this, Jest had a convenient feature which would automatically update all your snapshot files to stop failing

6

1

these tests took almost an hour to run and clearly have actually zero benefit

but to me it clearly demonstrates a problem with mere "coverage"

2

11:35

I guess the best you could say is that we had it-doesn't-crash coverage of 70% or so of our code?

11:35

but that's being generous

bvisness

but to me it clearly demonstrates a problem with mere "coverage"

Ugh, "coverage"

Andrew (azmr)

speed is a form of behaviour...

Absolutely. But autofailing on x% slowdown is a blunt instrument. Chrome did this since their launch. The result was that the app grew slower with a slope of x%. In the past one approach I've found for this is to write performance tests in a white-box (ha!) manner. I count some sort of metric while running a test, and verify afterwards that it stays within some bound. For example a sort function may count swaps. Sometimes counting allocations can be useful. Still blunt, but less blunt.

1

I was talking to @Shaw about FastVM a while back and I suggested autofailing commits on less-than-x% speedup. It's not a crazy idea! (edited)

Phillip Trudeau

Yeah, once you get into distributed system testing (multi-server/multi-thread/concurrency) then testing becomes a LOT harder!

yes, again, why I have settled on log files and scripted log analysis as my main tool

elaborating on this, you can write a script that says "there's never an event of type B that happens outside a window of 50ms of an event of type A"

11:37

very annoying to do in conventional testing, trivial to do with logs

demetrispanos

yes, again, why I have settled on log files and scripted log analysis as my main tool

I originally described my white-box tests as making assertions on the log a test emitted. http://akkartik.name/post/tracing-tests

yes, lines up very well with how I think about logging/testing

Phillip Trudeau

Yeah, once you get into distributed system testing (multi-server/multi-thread/concurrency) then testing becomes a LOT harder!

That's why I found the most valuable testing to be fake user accounts making requests from other servers to verify that the actual deployed code worked with it's real dependencies. Never found much use in mocks. We also had a lot of process around rolling out new features slowly to small subsets of users and always having monitoring and logging in place (edited)

and a really solid way to turn new features on and off if things went south

bvisness

to combat this, Jest had a convenient feature which would automatically update all your snapshot files to stop failing

This makes me think that the central failing of "test-orientation" is the impulse to write tests-as-rote. If you're not building testing systems intelligently and deliberately, I don't see any value in them!

1

at the same time, @Phil H was talking in coffee on Thursday about devs who would ship stuff without ever running it at all

Couldn't be me...

1

everybody knows that mere lines-of-code coverage is a flawed metric, but more broadly, I'm curious how people here think about how well their tests "cover" the system

11:38

for any system I test, I have some intuition for how much of the space is covered, even if 100% of the lines are covered

11:38

but is that feeling accurate and can it scale?

11:38

it seems to devolve into the worst kind of "what gets measured gets managed" in practice

1

bvisness

everybody knows that mere lines-of-code coverage is a flawed metric, but more broadly, I'm curious how people here think about how well their tests "cover" the system

I've got nothin' here. I just go with my gut: http://akkartik.name/post/2009-01-24-19-46-26_002-soc. I started writing tests for my text editor last year when I started swearing at myself for my own incompetence.

@Phil H , you mentioned on Thursday that you would have weekly reviews of test and system failures - that might be relevant here?

11:41

what were those conversations like and what were the outcomes?

I would again emphasize the significance of narrow interfaces here, because testing there yields much more benefit than elsewhere

11:43

the purpose of "coverage" is to be able to localize problems to a "covered thing"

11:44

if you just have one test for your whole program, it tells you "there's a problem somewhere in the program, glhf"

so now you want your "covered things" to be useful attentional units

11:44

"there's a problem in the DB driver" ok cool I know where to start poking

the purpose of "coverage"

now we're getting somewhere

bvisness

for any system I test, I have some intuition for how much of the space is covered, even if 100% of the lines are covered

it seems like a more apt (but harder to pin-down) metric is "coverage of potential state-space", which can be approximated as "coverage of potential inputs". This, I think, captures why I lean on root-level end-to-end integration tests: "are we looking right for the common and known-awkward user inputs?". These will occasionally miss some important edge-cases (e.g. ring structures wrapping with a particular offset) that unit tests can then precisely target. (edited)

2

Yeah, and testing at the bottleneck - like the health inspector at the conveyor belt - lets you vet the entire subsystem

right exactly, if one test can eliminate 90% of the code as a possible source of problems, that is good

bvisness

@Phil H , you mentioned on Thursday that you would have weekly reviews of test and system failures - that might be relevant here?

Yeah we had weekly review of the health of the services. The focus would be around post-mortems for any severe incidents that impacted users. We would also look at peak server load to see if we needed to expand and other "key performance metrics" such as latency to see if there were any regressions. For any post-mortems there was usually a chain of things that went wrong for it to get to that point and we assign urgent tasks to cover the holes in our process and testing. It could be anywhere from writing tests, adding live monitoring for something that went un-noticed, adding a new manual verification step to our build verifications etc

one of my robotics team parents was just telling me about changes to quality inspection in the early days of the automotive industry and now I wish I remembered who the people involved were (edited)

demetrispanos

the purpose of "coverage" is to be able to localize problems to a "covered thing"

this is certainly not the vibe I've gotten from the web dev world btw

bvisness

this is certainly not the vibe I've gotten from the web dev world btw

indeed no, but just study it as a "user experience"

11:48

test fails, now what? well, you go look at what is covered by the test

11:48

it is useful as a way to point attention at problems

the actual acted-out behavior is that tests are ways to point programmers at problems

2

11:49

and they are useful to the extent they point programmers reliably at small loci of problems

demetrispanos

so now you want your "covered things" to be useful attentional units

I tend to find that many of the bugs that come up are either caught by asserts (which direct attention directly), or present themselves as user-visible behaviour that I probably wouldn't have foreseen for a module-level test (but I may just be trying to justify not doing the work to add these

) (edited)

this is a much more pragmatic view imo than any philosophy of "correctness"

Andrew (azmr)

it seems like a more apt (but harder to pin-down) metric is "coverage of potential state-space", which can be approximated as "coverage of potential inputs". This, I think, captures why I lean on root-level end-to-end integration tests: "are we looking right for the common and known-awkward user inputs?". These will occasionally miss some important edge-cases (e.g. ring structures wrapping with a particular offset) that unit tests can then precisely target. (edited)

One big value I get from testing is to flip my mindset from code-centered to input-centered. It's a little like translating from time domain to frequency domain in Fourier analysis.

2

What would be an example of testing with an "input-centered" mindset vs. a "code-centered" mindset?

Phil H

Yeah we had weekly review of the health of the services. The focus would be around post-mortems for any severe incidents that impacted users. We would also look at peak server load to see if we needed to expand and other "key performance metrics" such as latency to see if there were any regressions. For any post-mortems there was usually a chain of things that went wrong for it to get to that point and we assign urgent tasks to cover the holes in our process and testing. It could be anywhere from writing tests, adding live monitoring for something that went un-noticed, adding a new manual verification step to our build verifications etc

i think it's worth explicitly mentioning quickly that there's a difference between building out and testing a metric as a programmer, trying to engineer well-designed software, and as an employee at the company, using the metric as a Key Performance Indicator with a particular threshold to be met. The latter case won't always line up with the former case!

Well the KPIs were more about, "is the user's experience degrading" not "is this up to the company's standard". Having systems in place to measure that is invaluable.

1

2

right, you have to be careful about how indirect your proxies are from the things you actually care about!

bvisness

What would be an example of testing with an "input-centered" mindset vs. a "code-centered" mindset?

In a code-centered mindset your answer to "when am I done writing tests" might be "when all lines are covered by tests". And if someone points out that there can still be bugs in different paths through the code, you tend to throw up your hands and say "larger space than atoms in the universe." But with an input mindset you might notice that these few functions taken together are operating over a state space which has these 3 regimes separated by two borders. And you'll tend to naturally focus on testing points near the borders.

5

4

That's an interesting contrast. I think gamedev demands a pretty input-centered mindset. The whole point of a game is to generate this incredibly huge multidimensional state space of interesting things, and the player is continually moving around the space and prodding at the edges. You really can't often take a view of the software as lego blocks of code - you have to be almost hand-in-hand with the player in exploring every nook and facet of the state space. (Well, you're actually one step ahead, exploring it ahead of time). You're rarely concerned with whether or not the software will crash! You're concerned with whether or not there's a compelling experience to be had in that corner of the fractal. If there isn't one right now, you make a choice as a designer to either gate off that corner of it, or to freshen it up with some potted plants. (edited)

1

@strager https://discord.com/channels/239737791225790464/708458209131757598/1112091206646759526 This is why I like TDD. It makes me think about the feature..., not the production code. I naturally want to think about the code ('cus it's fun), so I need help thinking about the feature ('cus it's boring).

There is something to this as a part of the creation process. I have definitely been guilty of having a vague idea of both the problem/input and solution/output and just immediately writing some code to try and bridge the gap without thinking through specifics. Taking the time to simply specify what you actually want to happen - in prose, or diagrams, or tests (when they lend themselves to the problem domain)) - often makes the implementation process much more straightforward. (But the tests here are just a formalised way of thinking through what you want to happen, running them & confirming is mostly a secondary bonus) (edited)

Phillip Trudeau

That's an interesting contrast. I think gamedev demands a pretty input-centered mindset. The whole point of a game is to generate this incredibly huge multidimensional state space of interesting things, and the player is continually moving around the space and prodding at the edges. You really can't often take a view of the software as lego blocks of code - you have to be almost hand-in-hand with the player in exploring every nook and facet of the state space. (Well, you're actually one step ahead, exploring it ahead of time). You're rarely concerned with whether or not the software will crash! You're concerned with whether or not there's a compelling experience to be had in that corner of the fractal. If there isn't one right now, you make a choice as a designer to either gate off that corner of it, or to freshen it up with some potted plants. (edited)

Certainly a goal to aim for. I wonder how a long-lived game platform like say Eve Online or Minecraft deals with maintaining global coherence over a humongous state space over feature development spanning years.

Minecraft

slowly and painfully and often badly

1

12:01

as far as I can tell

My hazy impression as well, but I don't play games much.

Kartik Agaram

In a code-centered mindset your answer to "when am I done writing tests" might be "when all lines are covered by tests". And if someone points out that there can still be bugs in different paths through the code, you tend to throw up your hands and say "larger space than atoms in the universe." But with an input mindset you might notice that these few functions taken together are operating over a state space which has these 3 regimes separated by two borders. And you'll tend to naturally focus on testing points near the borders.

I agree with this sentiment. I'd be curious to hear how you think about or handle testing for a wide array of inputs. It has more often been the unexpected inputs that I don't account for in testing that come back to bite me. Sounds like @Andrew (azmr) has started to think about this with Whitebox fuzzing and generating complex inputs too?

I like, I think it was @Phillip Trudeau's example above, of just testing a function on all 32-bit ints. That sort of thing is possible if you can find an encoding that can traverse a space. Not always possible, but at least it's good to be thinking in this direction in case a solution occurs to you. Wouldn't happen if you always think in code-space. @demetrispanos also mentioned statistical tests above.

I'm not sure I have any solutions for unexpected complex inputs before seeing them in the wild beyond regular fuzzing. It's much easier to capture than an existing rats-nest of pointers than to generate one, particularly one with semantics that the programmer knows but hasn't encoded in e.g. the type system... Property testing probably has a place here, both in its normal form as a way of asserting outputs of fuzzed inputs, and as a non-location-specific generalization of asserts that account for time, which we'll be looking into with WhiteBox https://www.tedinski.com/2018/12/11/fuzzing-and-property-testing.html (edited)

3

I can certainly say that at Mozilla we lean very heavily on fuzzing because of the enormous size of the space

12:03

and we do multiple kinds of fuzzing, both generative and mutation-based

12:03

there is very, very little "unit testing" in the firefox codebase

12:04

but lots and lots and lots of integration tests, which are continuously mutated by fuzzers to try and explore more of the space from those starting points

2

Fuzzing ‼️ During development of the netcode for Sir Happenlance I came across a wonderful utility called clumsy, which can intercept any localhost loopback packets and lets you delay them, jitter them, reorder them, drop 10% of them, etc. Using this, while two instances of the game were running and connected to each other, immediately exposed behaviour about the netcode that I wouldn't have otherwise been easily able to set up (I'd need two people around the world to run the dev build and connect). (edited)

2

yes re: statistical tests there is an entire separate problem of testing things that don't have deterministic answers (for example, ML classifications of inputs) (edited)

12:07

and also yes, fuzzing (broadly interpreted) is great

12:07

generally randomized input testing

The sense I'm getting is that we're all in broad agreement about practices. We just have different boundaries for the word "testing".

12:08

I certainly wish I knew more about fuzzing.

I often combine fuzzing and soaking, so subject some system to 24 hours worth of randomized input to see if it breaks

where there's agreement on practices, though, I hope to distill some of that for readers

12:09

especially as we are now past the two-hour mark already (?!)

@Vegard has done a lot of great work on fuzzing, AFAIK - e.g. http://www.vegardno.net/2018/06/compiler-fuzzing.html

1

So to try to wrap this up, what takeaways do you all have from this discussion?

12:13

The things that have really stuck out to me are Demetri's point about "coverage" and the ultimate purpose of tests being to point programmers to problems, and Kartik's framing of "code-centered testing" and the problems it causes

I'd add that there is likely a judgement-free personality difference in that some people value designing around testing in its own right whereas others (like myself) see it purely transactionally

2

1

12:14

and this is fine, people should do what fits their minds

One of my takeaways is that I've got to figure out a mode of engaging with testing where I can be more systematic about it without reducing my agility

12:15

My brain is input-centered and it could do to add a sprinkling of code-centered

1

I have some further ideas inspired by this conversation about test granularity too, although we didn't really discuss it much

1

I think I'll continue to be just a casual partaker in testing. When friends come over, etc. I'm wary of becoming a high functioning day-tester

5

1

One thing I'd like to add as we wrap up is that bugs are inevitable. Having a good strategy and system in place to minimize the effect of them when shipping code is just as important as testing, whether that's being able to rollback or hotfix it.

1

one thing we didn't discuss is that I also think about testing in terms of consequences/responsibility in the real world, so for example totally separate from software concerns I want testing for things that (for example) push data to customers

4

That's a good point. Every software has bugs, it's just a question of which stage of the pipeline you've evaluated is best for catching them - and that might be after the come out the pipe!

The ideals to strive for are: * small codebase * lots of different ways to slice and dice parts of it to execute. * some amount of cross-checking somehow. Execute something 2 ways, make sure they agree. This is a good lens for debates about types vs tests, in my experience. That's a rabbithole for another day. Tests nudge me toward all 3 tendencies. (edited)

Oh, and bump fuzzing in my todo list.

2

Phillip Trudeau

That's a good point. Every software has bugs, it's just a question of which stage of the pipeline you've evaluated is best for catching them - and that might be after the come out the pipe!

That's related to the granularity stuff I was pondering. Might throw some ideas out there in #fishbowl-audience afterward.

1

Perhaps obvious, but it seems like you need to make sure that you're not adding tests mindlessly. The purpose(s) you have for it may influence the type of tests you (don't) write: - helping you think through the problem spec - checking correspondence to pre-existing spec - acting as example usage code - preventing regressions for previously seen issues - recreating a particular scenario to inspect repeatedly - adding friction intentionally(!) to ensure focus from other devs/future you - making sure that potential inputs don't put your program in an invalid state (i.e. sending out example runs through a minefield of assertions you've laid for them) - ensuring that the thing you're sending to customers does what you say it will

1

4

1

12:28

This then has to be considered in the context of the type of code you're writing https://discord.com/channels/239737791225790464/1112062893517717514/1112071588343459900

1

I didn't really address asserts at all. They feel complementary to tests: * Tests: don't run in dev, allow you to assess portions of a program over time * Assertions: can run in production (nobody ever enables NDEBUG, right?), can only check what's in scope at a specific point in time. Before I understood tests I often tried to periodically go into a mode where the program would perform sanity checks on some internal data structure.

I wish we could keep going but 2.5 hours is a long time to keep everyone! I feel like we've barely scratched the surface. I think we'll definitely have to have another fishbowl that digs deeper into some of the topics you all brought up today.

1

12:22

Thanks all for being here! If anyone would like to keep going, you can continue in #fishbowl-audience - otherwise, enjoy the rest of your Saturday (and your Memorial Day weekend for those in the US!)

2

Thanks so much, everyone! ❤️

1

I said in my last fishbowl that I no longer understand why people attend conferences. This time I am glad to report I have attended 0 conferences in the interim

(edited)

3

we'll drag you to Handmade Seattle one of these years...

1

Realtime fishbowling sounds like it'd be super productive!

ok now we're done for real!

4