Python: Using Dataclasses to Model Your Data | Official Pythian®® Blog

Written by Evan Seabrook | Oct 7, 2021 4:00:00 AM

Here at Pythian, we love our data. Our code is no exception (pun sort of intended), so I’ll be covering dataclasses in Python today.

The problem

As a Python developer, you’ve almost certainly run into code that looks like the following:

def add_user(user: dict):
  name = user['name']
  birthday = user['birthday']
  gender = "Undisclosed"

  try:
     gender = user['gender']
  except KeyError:
     pass

If you’re really lucky, there will be a docstring for this function that outlines the structure of the parameter user, saving you from having to dig through the function and identify the possible keys that exist in parameter user.

The problem here is twofold:

Dictionaries in python are mutable and can have arbitrary schemas.
1. This in itself isn’t a problem and can be a good thing, depending on your needs. Its usage, however, is really only enabled by the quality of the second point, which is:
You must rely on the documentation to know the structure, and the documentation must stay updated as the structure evolves.

The solution – using dataclasses

Now that we’re up to speed on how relying on dictionaries to represent our data causes a problem, let’s look at a less ambiguous solution:

from dataclasses import dataclass

@dataclass
class User(object):
  name: str
  birthday: str
  gender: str = 'Undisclosed'

def add_user(user: User):
  name = user.name
  birthday = user.birthday
  gender = user.gender

The first piece is defining the user class: We’ve created our properties, assigned a default value to one of them, and slapped a @dataclass decorator up top. By using this decorator, we:

Give our user class the following constructor (this isn’t perfect — more on this later):

def __init__(self, name, birthday, gender):
  self.name = name
  self.birthday = birthday
  self.gender = gender

2. Give our user class a __repr__ method, which automatically makes our object’s properties discoverable when printed/cast to a string.

3. Document the structure of our user using Python, rather than just docstrings.

That last point is the biggest advantage, and it covers the primary problem we’re trying to solve. By making a class/type for our user object, we’ve unambiguously defined all of the keys (now properties) of a user and what type that property is. From a readability standpoint, type hinting is great to have. Type hinting has the side effect of letting your IDE know what to expect, too.

Note: While dataclasses are great at making data more discoverable and consistent, they are not a substitute for documentation.

I had mentioned during point 1 that the constructor that the dataclass annotation gives us isn’t perfect—you may have noticed that the types are missing.

Thankfully, dataclasses give us a hook system that lets us validate after the object has been initialized by defining a __post_init__ method:

def __post_init__(self):
  for (name, field_type) in self.__annotations__items():
     if not isinstance(self.__dict__[name], field_type):
        given_type = type(self.__dict__[name])
        raise TypeError(f"The field `{name}` must be `{field_type}` (found `{given_type}`).")

Dataclass checks to see if __post_init__ has been defined, and if it has, it automatically runs after the object has been initialized.

You can also use @dataclass(init=False) if you want to define your own, more strongly typed constructor.

Dataclasses also have a few other niceties, such as an overloaded equals operator so you can compare your models. For comprehensive documentation on dataclasses, check out Python’s official documentation.

Hopefully you found this helpful! Please leave any thoughts or questions in the comments below. You can subscribe to more of these blogs at the top of the page or here.

Data Engineering Services

Ready to optimize your Data Engineering for the future?

View full post