January 19, 2021

Database-Linked Dataclasses

The most common use of a class in Python is acting as a data container. Throughout the development of our backend libraries at Narrative Science, we’ve seen several generations of code that attempts to make modeling this data easy.

Finding a simple to use abstraction for these type of classes allows us to focus less on writing mundane classes and more on our business logic.

In this post we will outline the evolution of this code and the pros and drawbacks of each approach.

V1: Basic classes

We have all been here; you write a class, and implement all the dunder methods needed by that class. Simple and straight forward.

Example:

class User():
    def __init__(self, fname, lname, groups):
        if not isinstance(groups, list):
            raise ValueError("groups is not a list!")
        if not isinstance(fname, str):
            raise ValueError("fname is not a str!")
        if not isinstance(lname, str):
            raise ValueError("lname is not a str!")

        self.fname = fname
        self.lname = lname
        self.groups = groups
    
    def __repr__(self):
        return f"User({self.fname},{self.lname})"

Pros:

This is very explicit, you can see everything this class does
Basic, most likely taught in beginner Python courses

Cons:

Type Checking
Notice all the defensive code needed to handle the cases when we receive arguments with incorrect types. Now imagine this example accepted 20 properties. More than likely the author of this code just wouldn’t even bother to write this kind of type checking since it’s tedious and time consuming. However not doing so can create complex error conditions in downstream code.
Database
Very commonly, these type of objects directly represent data from the database.
As it is currently written, you would need another translation layer outside the object to do your queries and create the objects as needed.
Testing
You will need to write tests to make sure this code correctly rejects invalid inputs.
Instantiating
In order to create an instance of this object, you need to pass every value into the constructor:
```
new_user = User("frank", "smith", ["group1", "group2"])
```
It’s worth nothing that it is possible to do some fancy Python magic to create this object out of a dictionary:
```
user_data = get_user_data()
new_user = User(**user_data)
```

But what if I want to dump the properties of this object back to JSON so we could send it over HTTP? You would have to write a method that takes every property and populates a dictionary.

V2: Database linked classes

The idea here is: what if we created objects that were aware of the database?

Example:

from sqlalchemy import Table, Column, VARCHAR, UUID, JSONB
from ns_database import DatabaseObject

article_table = Table(
    "article",
    Column("id", UUID, primary_key=True),
    Column("name", VARCHAR(45), nullable=False),
    Column("slug", VARCHAR(45), nullable=False),
)

class Article(DatabaseObject):
    table = article_table
    primary_key = app_table.id

new_article = Article.create(id=UUID4(), name="Foo Bar", slug="foo-bar")
new_article.save()

This essentially creates a class that is now able to push and pull its data from a database and functions like a normal object inside of Python.

The way this DatabaseObject class works is that the properties for the class are determined by the SQL Alchemy table.

DatabaseObject also implements a handful of convenient methods to help getting data in and out of the database such as: create, get, save, update, and delete. Each will perform the appropriate SQL query.

Pros:

We have reduced the amount of tedium from boilerplate code.
Users have a database aware object without needing to write any SQL of their own.

Cons:

We still have the issue of having to make sure our types are correct when accepting them as parameters.
We are unable to set default values for properties, so we must always provide all properties, which is tedious. This is because we’re using SQL alchemy as the source of the property names and don’t have a mechanism for passing other values in.
What if someone inputs a string longer than 45 characters? We need to write code to check for that.
There is an assumption that we must know the type in Python that corresponds to the SQLAlchemy type.
Alongside the tests needed for the DatabaseObject class to ensure the database accessor methods work correctly, we will also need to still have tests that ensure that these classes reject invalid inputs and don’t just throw an exception from SQL alchemy.

V3: Along came dataclasses

Basically, a dataclass is a simple abstraction for classes that store data. It performs automatic type checking, serialization and deserialization, and implements things like __eq__, and gives a standard __repr__. This is basically what we wanted from Generation 1.

There have been many libraries before this to implement the data-class such as attrs, marshmallow, and schematics. It is worth noting that Python 3.7 introduced an official the Dataclass to the core library as well.

The library we chose to use was Pydantic because it worked best with our implementation of FastAPI.

Example:

from enum import Enum
from typing import Optional
from pydantic import BaseModel, Field

class User(BaseModel):
    name: str = Field(max_length=45)
    department: Department = Field(default=Department.engineering)
    nickname: Optional[str]

class Department(str, Enum):
    engineering = "engineering"
    finance = "finanace"

Pros:

Automatic type checking
Allow us to use other data structures like Enums or other models as types
Ability to set contraints such as string length
Ability to set default values
Greatly reduced boilerplate vs Generation 1
Easily readable
Integration with code editors intellisense
No need to write additional tests

You can leverage other Python built-ins to do even more powerful things:

class User(BaseModel):
    first_name: str
    last_name: str

    @property
    def email(self):
        return f"{self.firstname[:1]}{self.last_name}@example.com"

    def send_message(self, msg):
        some_email_function(self.email, msg)

...

user = User("John", "Smith")
user.send_message("hey dude!")

Cons:

These models are not database aware like Generation 2.

V4: Adding a database to Pydantic

Building on our idea from before about turning an ordinary class into a database aware object; we combined the benefits of Pydantic's input validation and serialization with the SQL Alchemy based database class from Generation 2.

The result is that now we have functional ORM that works with FastAPI!

Example:

from enum import Enum
from typing import Optional
from pydantic import BaseModel, Field

from ns_database import database_model, Databases

class UserIn(BaseModel):
    name: str = Field(max_length=45)
    department: Department = Field(default=Department.engineering)
    nickname: Optional[str]

class Department(str, Enum):
    engineering = "engineering"
    finance = "finanace"

@database_model(table_name="users", database_name=Databases.main)
class User(UserIn):
    id: Optional[UUID4] = Field(primary_key=True, fetch_on_create=True)
    created_at: Optional[datetime] = Field(fetch_on_create=True)
    updated_at: Optional[datetime] = Field(fetch_on_create=True

We add a class decorator that will use the properties of the model as fields in the database. This means that database models will be 1:1 representations of the database itself.

Using Pydantic, we are able to add additional keyword arguments to the Field function.

Fetch on create / update

We want our database to be the ultimate source of authority, so we rely on it to do the computation for certain fields.

This is done through setting fetch_on_create and fetch_on_update which will defer to the database’s generated value when a record is created or updated respectively.

All that needs to happen is your database is setup to have column defaults such as static values, subroutines, or sequences.

Primary key fields

Since all tables will have a primary key, each database_model object must have a primary key.

Currently we are not able to have compound keys meaning your key must be a single type. This may be a situation where we can create a special type for it to allow it to be more semantic.

This primary key field is also used by our data accessor method to know how to lookup a record.

Pros:

These database models get all the benefits from Generation 2 and 3!

Cons:

Lack of foreign key relationships may be misleading to developers. To many developers, this code is their interface to the table schema. Since we use Alembic to manage our database schema and updates; there is actually no need to declare fields inside our database_model to be foreign keys. These are only used by SQL Alchemy in order to create your table schema. This does create a situation where our code and schema may not be in-sync. It is also worth noting that those foreign key constraints are still enforced by the database.
Database managed fields have to be “optional”. One limitation of Pydantic is that we have to make any field that the database will manage (such as the primary_key) as Optional type. The Pydantic validation code runs before we ever make our fetch from the database. This is syntactically a bit confusing.

Closing thoughts

These new database model classes have greatly improved the readability of our code and reduced the amount of boilerplate we need to write for new objects which ultimately means that we spend more time worrying about our own business logic vs writing all the low level stuff classes usually need.

As with most projects in software engineering, we kept making marginal improvements to the ideas until we got something awesome! I’m excited to see how this idea continues to grow to further help us achieve these same goals.

You can find the full code for this here: https://github.com/NarrativeScience/pynocular