How do you think about types?

Maybe I'm over thinking this, but I often feel quite conflicted about the way I approach types. Specifically, I often feel that type systems are trying to do several things at once. They seem to variously try to communicate something about the underlying representation of the data, the operations that can be performed on that data, and the invariants/meaning of that data. I don't think these always sit comfortably side by side.

What I was wondering was how other people like to think about types when creating their own user defined types, and also how they like the programming languages they use to present types.

I ask because I find that questions about types crop up a lot in API design, but I've found far fewer resources about designing good types than I have about designing good functions. To give some examples, how do people feel about using the type system to enforce things like different units, to guard their mutual compatibility/incompatibility? I find it's quite common in things like date/time libraries to have lots of types that are essentially integers, but representing days, months, years etc. etc. Another common example I often wrestle with are containers. Do you prefer container types to be description of how you can interact with them, or of how they are implemented? I'm often frustrated by container libraries that couple semantics and implementation (for example a set implemented using hashes when for my use case, a dynamic array might be faster etc.).

I'd be particularly interested in hearing how people approach these kind of things with regard to API design. For example when you want to expose multiple interfaces (e.g. a map, a set, a stack) with multiple implementations optimised for different use cases, how do you like to present that?
Part of what make type systems hard to design is putting the line between convenience and safety.

Implicit conversion is very convenient, however it leads to surprising stuff like "float a = 1/2;" resulting in a==0 without warning instead of the 0.5 that everyone starting programming expects.

In Java the suggestion is to program to the interface like java.util.List and only worry about the implementation at initialization time.

That said it is impossible for any compiler to divine what the optimal implementation of a data structure would be for your use case. It depends on types of lookups, type of modifications, how well it can expect the data to stay in cache between operations. People make their living doing that for databases. Unifying the interface does make it easier to switch between implementations.

This is a tricky question. In my personal opinion, C does this wrong, and other languages, even more wrong.
Personally, I love low level programming, so I just want a language that treats all variables like arrays of bits.
Like v for variable, and then number for the number of bits it contains. So v32 gives me 32 bits. v7 gives me atleast 7 bits. v87 atleast 87 and so on. Then -1 just is the same as 511 in a v9.
I never use floats thou, but I guess there could be a sign to if a variable should be treated like a float, for those that needs them. like f128 instead of v128.
If you work from the bottom up probably the first obvious use of a type system you will encounter is for calling conventions (whether the argument should end up in a register or the stack, if a register you'll want to know whether it is an integer type, a floating point type, or a vector type). For a lot of things sign and even width do not matter when you are operating on registers, but width becomes something you start to care about when you are storing and loading from memory and especially when you decide you want structs in your language. Sign is something that you could maybe do without, but there are operations where it matters and if you went without signed types you would need to distinguish that in the syntax of the operations that have signed/unsigned variants. IMHO for a systems language the correct approach to designing the type system is to start at the bottom like this and build your way up to something that comfortably automates everything you want automated. Beyond this a type system can be used in a more mathematical sense to provide safety guarantees and fancy features, but I consider that high-level territory

Edit: I forgot to mention but you also would likely want pointers in the type system because pointer types allow for pointer arithmetic.

Variables are basically a convenient abstraction over registers and the stack which is absolutely fundamental to programs being able to be cross-platform - types provide automation on top of that, and potentially safety guarantees beyond that

Edit2: In regards to API design, in my nwr_mem.h library I chose to stick with bare bones C99 types. A lot of the region allocator functions operate on a ptr+len pair, but I do not provide any kind of array abstraction - you just directly pass a pointer and a length. The user is likely to have their own preferred way of representing something like this - they don't want the library author to make a type for it. It is simple to wrap a library with more friendly signatures. If you provide the lowest common denominator you appeal to the largest target audience.

Edited by Neo Ar on
Thanks for all the interesting thoughts on this.

ratchetfreak
Part of what make type systems hard to design is putting the line between convenience and safety.

Yes this is definitely an interesting problem. To go further on this, one thing I often feel is odd is the way that type systems are sometimes used to express properties of the values they contain, but in ways that feel kind of coincidental. For example imagine you want to store a percentage as an integer from 0-100. Would you use an unsigned type? You could do, and by doing so, indicate using the type system the lower bound of the valid values of that variable, but the type system doesn't give you away of specifying the upper bound in the same way.

ratchetfreak
That said it is impossible for any compiler to divine what the optimal implementation of a data structure would be for your use case.

Exactly. So my follow up question to this would be that given that the user of the data structure will have to make the decision themselves, how is that best presented? To take the world of "dictionary" like data structures as an example, in C++, you have to pick differently named containers for different implementations. So if you want a red-black tree, you choose std::map, if you want a hash table, you choose std::unordered_map, if you want a sorted array, maybe you choose boost::flat_map etc. This feels like a slightly uncomfortable middle ground. The difference between these types is in the implementation, but the names don't exactly make this explicit, and instead try to emphasise differences in their semantics/behaviour.

Telash
Personally, I love low level programming, so I just want a language that treats all variables like arrays of bits.
Like v for variable, and then number for the number of bits it contains. So v32 gives me 32 bits. v7 gives me atleast 7 bits. v87 atleast 87 and so on. Then -1 just is the same as 511 in a v9.

I'm really interested by this perspective, and I'd like to ask some follow ups. First, how would you see this interacting with the other aspects of types, like their data format and the operations that can be performed on them? If all the compiler knows is the size of the data, then presumably something like
1
v32 = -1024
wouldn't work, because the compiler wouldn't even know what integer encoding you were using, or what endianness you wanted, right? So how would you communicate things like data format and operations to the compiler? Also how would you expect variables of different sizes to interact, say for example adding a v32 to a v9?

I'm also interested by the fact you say "at least 7 bits", presumably suggesting you'd be happy for the compiler to pad that as necessary for more optimal alignment? Given that there would be that level of slop in the actual sizes of the variables then, what is the advantage of such fine grained control? For example what would be the advantage of being able to ask for a v7, which on most modern computers would presumably be padded to a byte, over say just specifying a Rust-style i8/u8, which you would still know was big enough?

miotatsu
Variables are basically a convenient abstraction over registers and the stack which is absolutely fundamental to programs being able to be cross-platform - types provide automation on top of that, and potentially safety guarantees beyond that

I think this is an excellent point, and also raises a really interesting question. One of the funny things I find about variables is that they don't actually exist, right? As in, in some ways, variables are like a way of pretending you have infinite registers, but of course in fact, a variable may be many things, and indeed may change during the lifetime of the program. It may be some space reserved on the heap, or on the stack, or be a register, or if it can be deduced at compile time, it may even disappear entirely into an instruction. Whilst I think this abstraction works well with many types, like integers and floats, I think vector types have highlighted a rather interesting problem with this abstraction, in that the registers you want to use may no longer correspond to the way you want to think about your data in memory. For example if I want to multiply all the floats in one array with all the floats in another, I probably want to load them up four at a time into xmm0/1 and multiply them as a vector operation, but equally I probably don't want to think about those array of floats as arrays of vec4s, if each float is a separate thing in the array.

miotatsu
If you provide the lowest common denominator you appeal to the largest target audience.

This is definitely a good philosophy to have, but depending on what your library is doing, there is only so far you can take this, right? As in, if what you're writing is a container library, you kind of have to make some of these decisions. I definitely agree with the idea of making sure your library is as flexible as possible with regards to the things it is not implementing though (e.g. memory allocation).
About v32 = -1024

-1024 is the same as 4294966273 on a 32bit int. So it just store it as 4294966273. Only when/if you print it out for a human do you need to care about if its signed or not, so it makes more sense to tell the printf function how to print it out.
For endianness I need atleast 2 variables, so tha is at a higher level.

About the precision, a variable is a bit array no matter how you slice it, so an array therefor is a 2D-array of bits. v3:2000000 would need 6 million bits. v8:2000000 would need 16 million bits. And maybe I just need to store 3 flags for 2 million elements? That may be rare cases, I know I would use it thou. So why not be possible?
Is an interesting and tricky question that I felt just like dropping my 2c because I quite like types.
Types for me are the fundamental aspect of talking with the compiler, obviously there is other things to take into account from convenience to performance but in general I try to put as much as I can in the type system and leverage the compiler as much as I can. When you have a language with rich tooling around types (Swift in my case) is really nice and helpful to code this way. Any refactoring changes or mistakes are most likely catch at compile time which reduces a lot of cognitive overhead to understand what is happening and debugging at runtime a lot of bugs that, if done well, can't really happen. (things like Rust take this the extra mile)
In terms of API design I also found them really useful, if it can be expressed with a type that is well understood is much easier to read and work with the code than just having a random any (being integers without meaning or typeless Object kind of stuff a la Java) type as the interface.
When you introduce generic (parametrised types) into the mix it gets even more powerful, allowing stuff like Phantom Types and enums that define exactly what is possible in the api. All of this can be taken too far with things like "Domain Driven Development" but that's usual with any programming topic :P
This kind of Generic APIs are also interesting performance wise as you can present rich APIs that help understand what is gong on and be safer but also the compiler can optimise them specialising them for each specific usage.
I think what Telash is proposing essentially does specify the upper-bound. A "v7" on modern platforms isn't just padded for optimal alignment, it literally has to be a byte under the hood because a byte is the smallest addressable unit. The advantage of being able to say i7 over i8 is that something like this:
1
(i7)0x80 == 0

would lower to something like this
1
(0x80 & 0x7F) == 0


Personally, I don't find myself ever wishing I could think about types as an array of bits, especially because the machine itself is working in terms of bytes, and to solve the problem of specifying valid bounds within a high-level type there are other approaches. In D we have the invariant keyword in combination with templates to easily enforce bounds on a type:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
struct BoundedInt(int min, int max)
{
    static if (0 >= min && 0 < max)
        private int payload;
    else
        private int payload = min;
    this(int n) { payload = n; }
    int get() { return payload; }
    alias get this;
    void opAssign(int rhs) { payload = rhs; }
    void opOpAssign(string op)(int rhs) { mixin("payload "~op~"= rhs;"); }
    invariant
    {
        assert(payload >= min);
        assert(payload < max);
    }
}


Invariants are used to specify characteristics of a class or struct that always must be true (except while executing a member function).

The invariant is a contract saying that the asserts must hold true. The invariant is checked when a class or struct constructor completes, at the start of the class or struct destructor. For public or exported functions, the order of execution is:

preconditions
invariant
body
invariant
postconditions

The invariant is not checked if the class or struct is implicitly constructed using the default .init value.

The code in the invariant may not call any public non-static members of the class or struct, either directly or indirectly. Doing so will result in a stack overflow, as the invariant will wind up being called in an infinitely recursive manner.

Invariants contain assert expressions, and so when they fail, they throw AssertErrors. Class invariants are inherited, that is, any class invariant is implicitly anded with the invariants of its base classes.

There can be more than one invariant per class or struct.

When compiling for release, the invariant code is not generated, and the compiled program runs at maximum speed. The compiler is free to assume the invariant holds true, regardless of whether code is generated for it or not, and may optimize code accordingly.

Now you construct a bounded int by saying BoundedInt!(-42, 43) n; and use n just like an int, but if you try to assign a value outside of [-42, 43) the invariant will trip in a debug build.

John_Uskglass

miotatsu
If you provide the lowest common denominator you appeal to the largest target audience.

This is definitely a good philosophy to have, but depending on what your library is doing, there is only so far you can take this, right? As in, if what you're writing is a container library, you kind of have to make some of these decisions. I definitely agree with the idea of making sure your library is as flexible as possible with regards to the things it is not implementing though (e.g. memory allocation).

Yes, if you are providing abstractions you have to actually provide them of course, the higher level you are working at, the less what I said is relevant, but it's still important to be aware of so that you don't bloat your library with unnecessary abstractions and limit things by implementing pieces that would better be provided by the user (such as memory allocation strategy, as you said). My nwr_mem.h library provides an nwr_region struct which holds the bookkeeping necessary for the region allocator because that is the abstraction the library is offering the user, but it doesn't provide the things it is not for (such as an array abstraction).

Edited by Neo Ar on