Data coherence at large
A week ago, while coming back from Scalar, I was thinking about coherent data. In particular, I was wondering if it's possible to perform certain simple validations and encode their results in types. Here's what i found.
What is coherent data
The concept of coherent data was introduced to me when I watched Daniel Spiewak's talk about coherence. Data coherence is achieved when we have a single source of truth about our data. Let's look at an example:
val name: Option[String] = Some("Rachel")
val result: String =
if "default"
else name.get
In this code we don't have any sort of coherence in the data. One condition that we check - the emptiness of name
- gives us information that is immediately lost in the rest of the code. Even if we've checked for the emptiness and are sure that the Option isn't empty, there's nothing in the type system or any other feature of the language that would tell us whether we can call Option#get
on it safely. We only know that because we keep in mind that we've already checked for the emptiness ourselves.
Another example involves lists:
val names: List[String] = List("Phoebe", "Joey", "Ross")
val firstNameLength: Option[Int] =
if None
else Some(names.head.length)
Even though we've checked for the list's emptiness, we still have no guarantee that head
, which is in general not a safe method to call on a list, won't throw an exception.
There's no connection between isEmpty
and head
/get
enforced by the compiler. It's just incapable of helping us avoid mistakes like this:
if println(elems.head) //boom!
Is this the way it's meant to be? Is it possible to make the compiler work with us to ensure some guarantees about our data?
In Kotlin, another language that works on the JVM (mostly), there is a feature that solves this particular problem we had with Option: smart casts. But the feature is limited to checking types or nullity, while we're looking for something that'll work in the general case.
Thankfully, there're features in Scala that allow us to reason about our data as coherent: pattern matching and higher-order functions.
Data coherence with pattern matching
Let's rewrite the examples from the previous section using pattern matching:
val name: Option[String] = Some("Rachel")
val result = name match
Much better - now we managed to get both the emptiness check and the extraction in one go (in the pattern match). We're not calling any unsafe methods, and we get additional help from the compiler in the form of exhaustivity checking. What about lists?
val names: List[String] = List("Phoebe", "Joey", "Ross")
val firstNameLength: Option[Int] = names match
Again, we're not calling head
or any other unsafe method. And the check is again combined with the extraction in a single pattern match.
I mentioned higher-order functions, so what about them? Turns out that pattern matches (and functions implemented using them) can often be rewritten using a single call to fold
for the given data type. It's more obvious in the case of Option:
val name: Option[String] = Some("Rachel")
val result = name.fold("default")(identity)
Or Either
:
val name: Either[NameNotFound, String] = Right("Rachel")
val result = name.fold(extractError, identity)
However, fold
doesn't appear to be the right choice if we only care about part of the data (like in the list example, where we only needed the head of the list). In that particular case, a good old headOption
would work just fine.
Data coherence at scale
This is all nice and pretty - the promise of having data that doesn't require us to watch our backs every step we take sounds encouraging. But when the data is part of other data, things start to break very soon.
Suppose we're working with a User
class:
(name: String, lastName: Option[String])
Now imagine we want to run a code path only if the user has a lastName
set. The caveat: we still want to pass the User
instance:
def withValidatedUser(lastName: String, user: User): IO[Int] = ...
val user = User("Katie", Some("Bouman"))
user.lastName match
What's the problem? Well, even though we did the validation in a coherent way using a pattern match, we lose the coherence inside withValidatedUser
: lastName
is now completely separated from the User
object it came from. And now we have two lastName
s: one optional, one required.
def withValidatedUser(lastName: String, user: User): IO[Int] = IO
This is terrible news. It appears like we can't maintain data coherence when the data is part of something else. Or can we?
Surely there are ways to get what we want - one of them is adding a new variant of the User
class, but with a required lastName
:
(name: String, lastName: String)
...but you can probably already imagine how much boilerplate it'd bring to your codebase if you needed a new class for every combination of optional fields if the User
type had more than one:
(
name: String,
lastName: Option[String],
email: Option[String])
(
name: String,
lastName: String,
email: String)
(name: String, lastName: String)
(name: String, email: String)
...
It doesn't seem like a viable solution to the problem. In fact, it'd create more problems than it solved.
I entertained the idea that we could parameterize our original User
with type parameters a bit:
(name: String, lastName: LastName)
//just so that we have some distinction below
type LastName = String
def withValidatedUser(user: User[LastName]): IO[Int] = ...
val user: User[Option[LastName]] = User("Katie", Some("Bouman"))
// try it at home: this could be a fold!
user.lastName match
Now a few things happened:
- We're not passing the
lastName
value separately now lastName
being required is now a type-level prerequisite inwithValidatedUser
- We're copying the
user
value withlastName
substituted with the value extracted fromOption
using a pattern match - We only have one data type that supports all combinations of emptiness/non-emptiness using type parameters.
What does this give us?
We gained type safety in withValidatedUser
- the function now can't be called with a User
whose lastName
hasn't been checked for non-emptiness. It just won't compile if we pass an Option
in that field. One less test case to worry about.
It's also pretty interesting that we can now write functions that require the user to not have a second name:
def withInvalidUser(user: User[Unit]): String = user.name
For me, the most surprising part here was that I couldn't use Nothing
as the type of lastName
- which I wanted to do to guarantee that lastName
just isn't there. However, we can't create values of type Nothing
, and we can't pass them as constructor parameters of a class. I used Unit
instead, which is a type with only one value, which is obviously not the user's last name. Creating a user with LastName = Unit
is also very easy: User("Joe", ())
.
What's the problem with the latest solution?
- We made it more difficult to work with the User type - now everyone who uses that type needs to be aware that the
lastName
field is parameterized. And it's viral - pretty soon all the codebase will be littered with type parameters irrelevant in these regions of code. - We can insert any type we want as
LastName
. It could even beIO[Unit]
. And it's very easy to do so.
Looks like we aren't quite there yet. What can we do to make our type easier to work with?
A different kind of coherence
Our original goal in the exercise was to encode validations and invariants of our data in the data's type. Let's get back to our User
example. This time we'll encode it using higher-kinded types (but with two "variable-effect" fields):
(if you're not familiar with higher-kinded types, I suggest you check out some blog posts. For now, it should be enough to know that a higher-kinded type is a type-level function, or a type that needs to be applied with another type to construct a fully concrete type that can be assigned to a value. For example Option
needs an A
to become Option[A]
, a type that has values.
If we parameterize User
with higher-kinded types, we get this:
(
name: String,
lastName: LastName[String],
email: Email[String]
)
Cool. How do we create a User
with all fields optional now?
val jon: User[Option, Option] =
User("Jon", Some("Snow"), None)
How do we create one with some fields required?
// old trick from scalaz/cats/shapeless
type Id = A
val jon: User[Id, Option] = User("Jon", "Snow", None)
Also cool. We can't assign None
as the value of lastName
if LastName
is Id
. How would we encode the requirement that there's no email now?
type Void = Unit
val donald: User[Id, Void] = User("Donald", "Duck", ())
As you can see, we cheated a bit - the goal of parameterizing our type with higher-kinded types was to ensure that we always have String
(or whatever used to be in an Option
) in an effect (like Option
or Id
), but if the effect is A => Unit
, we have no String
in the result whatsoever. Is it bad?
Maybe, maybe not.
If we have type parameters for our data, we can't make it obligatory to have String
anywhere in the type of Email
. (well, maybe we could, using some more type-level machinery and implicits, but I don't want to go that deep into it).
However, we're making it harder to use a type that doesn't have the String
in it (Option
is a more obvious choice for a type parameter than, say, λ[A => Int]
, which would mean that lastName
is of type Int
). And we still get an escape hatch that allows us to omit some fields in a User
value (by saying that lastName
is of type Unit
).
Instead of a custom Void
type, we could've used Const
:
val john = User("John", "De Goes", Const(()))
john.email.getConst // equals ()
We had at least two problems with the previous solution:
- the type parameters would spread out to every place in the codebase where
User
appears. - we didn't have type-level hints as to what type we should use.
I believe the second problem is not an issue anymore (see above argument about Unit
), but the first one remains: everywhere we had
def foo: User => A
, we'll now have
def foo[F[_], G[_]]: User[F, G] => A
. What can we do to make this a little more pleasant, and to avoid spreading every single type parameter to pieces of code that don't care about the contents of our parameterized fields?
Variance and higher kinded types
Thankfully, Scala has quite powerful support for variance annotations. We can use it to our advantage: to make our type easier to work with.
Suppose we are writing a function that takes a user but only uses their first name - which isn't type parameterized. Let's recall the definition of User
:
(
name: String,
lastName: LastName[String],
email: Email[String]
)
Let's add some variance annotations...
(
name: String,
lastName: LastName[String],
email: Email[String]
)
Making our type covariant in both type parameters should allow us to pass a User[F, G]
for any F
and G
where a User[Any, Any]
(User.Arb
) is required. Let's see if it works:
def nameTwice(user: User.Arb): String = user.name ++ user.name
> nameTwice(User("Mike", None, "mike@evilmail.com"))
res0: String = "MikeMike"
Looks like it does. However, maybe hardcoding Any, Any
isn't the best solution there - what if at some point we actually want to perform some validation and use the parameterized fields? Maybe a better encoding would be this:
//class User defined as above
Now we could write functions taking User.Canonical
as the base case, and only customize the functions that we want to work with certain types of User
specifically.
def worksWithAllUsers(user: User.Canonical): (String, String) =
match
def withFullUser(user: User[Id, Id]): (String, String) =
def withPartialUser(user: User[Id, Any]): (String, String) =
Other possible use cases
What other invariants can we encode?
Option[A]
can becomeA
orUnit
.Either[A, B]
can becomeB
orA
.List[A]
can becomeUnit
(empty list) or(A, List[A])
(NonEmptyList[A]
).
(note how we're deconstructing the data types, which happen to be ADTs, into coproducts)
If we're really trying to experiment, we can go the extra mile:
List[A]
can be checked for length and encoded asSized[List[A], 5]
.User[IO[A]]
can becomeIO[User[A]]
(sounds likeTraverse
, doesn't it?) - we can run the IO outside and keep the result to avoid unnecessary recalculation.User[Stream[IO, A]]
can becomeIO[User[List[A]]]
- we can run the stream and work with it as a list, once it's all consumed (that is, if it fits into the memory).
There's a lot we can do, really:
String
->NonEmptyString
,regex"(a-Z)+"
,IPv4
,INetAddress
refined typesInt
->PosInt
/EvenInt
refined types, smaller primitives (Byte
, etc.)
Summary
There are a few good reasons for trying to make our data coherent, including but not limited to using the techniques mentioned in this post:
- additional type safety by enforcing constraints locally on the type level
- increased ease of putting code under test - if your function doesn't depend on
lastName
, you can pass()
in that field safely
However, there are difficulties associated with all that:
- type parameters creeping into all your functions (possibly can be avoided by introducing a canonical type and only being specific when it's needed)
- it's questionable whether parameterizing data with types scales (e.g. if we have
case class Users[Collection[_], LastName[_]](users: Collection[User[LastName]])
, is it going to be easy to change?) - the benefit might not be that significant after all
- it was mentioned in the replies that type-parameterizing e.g. JSON models can break tooling like circe, magnolia, etc. - most likely because they don't support GADTs
- possibly more problems that I haven't figured out yet.
I haven't seen this approach to abstracting on data used anywhere, so I don't know what the best practice is, and whether the idea is feasible for use in real projects, but it certainly seems worth investigating.
Maybe type-parameterized data would work better with lenses (e.g. from Monocle) to make the work with copying more pleasant and less boilerplatey?
As some of our validations involved e.g. breaking up Option
into a tagged coproduct of Unit
and A
(which it is), maybe the techniques mentioned in this post could be used together with optics like Prisms (which are meant for working with coproducts) to form more powerful abstractions?
Maybe we could have a Monad
/Traverse
/whatever instance for a type like case class User[LastName[_], Email[_]](...)
for each of its fields to get additional functionality for free using the functions defined on the appropriate typeclass?
As you can see, more insight into the possibilities is needed to determine if the concept can be used more widely in our code.
Parting words
Thank you for reading.
These ideas are very fresh for me, and I haven't spent a lot of time researching them yet. In the future, I hope to spend more time in this area and to develop a more formal or constrained description of the ideas mentioned here.
Most importantly, I hope to find out whether these ideas actually help achieve more type safety without sacrificing maintainability in real world programming.
Let me know what you think about the ideas presented in this post, and whether you enjoyed reading it!
Thanks to Igal Tabachnik (@hmemcpy) and Valérian (@etaty) who found and reported a few bugs in this post. I should get around to using mdoc already...