Here at Pythian, we love our data. Our code is no exception (pun sort of intended), so I’ll be covering dataclasses in Python today.
As a Python developer, you’ve almost certainly run into code that looks like the following:
def add_user(user: dict): name = user['name'] birthday = user['birthday'] gender = "Undisclosed" try: gender = user['gender'] except KeyError: pass
If you’re really lucky, there will be a docstring for this function that outlines the structure of the parameter user, saving you from having to dig through the function and identify the possible keys that exist in parameter user.
The problem here is twofold:
Now that we’re up to speed on how relying on dictionaries to represent our data causes a problem, let’s look at a less ambiguous solution:
from dataclasses import dataclass @dataclass class User(object): name: str birthday: str gender: str = 'Undisclosed' def add_user(user: User): name = user.name birthday = user.birthday gender = user.gender
The first piece is defining the user class: We’ve created our properties, assigned a default value to one of them, and slapped a @dataclass decorator up top. By using this decorator, we:
def __init__(self, name, birthday, gender): self.name = name self.birthday = birthday self.gender = gender
2. Give our user class a __repr__ method, which automatically makes our object’s properties discoverable when printed/cast to a string.
3. Document the structure of our user using Python, rather than just docstrings.
That last point is the biggest advantage, and it covers the primary problem we’re trying to solve. By making a class/type for our user object, we’ve unambiguously defined all of the keys (now properties) of a user and what type that property is. From a readability standpoint, type hinting is great to have. Type hinting has the side effect of letting your IDE know what to expect, too.
Note: While dataclasses are great at making data more discoverable and consistent, they are not a substitute for documentation.
I had mentioned during point 1 that the constructor that the dataclass annotation gives us isn’t perfect—you may have noticed that the types are missing.
Thankfully, dataclasses give us a hook system that lets us validate after the object has been initialized by defining a __post_init__ method:
def __post_init__(self):
for (name, field_type) in self.__annotations__items():
if not isinstance(self.__dict__[name], field_type):
given_type = type(self.__dict__[name])
raise TypeError(f"The field `{name}` must be `{field_type}` (found `{given_type}`).")
Dataclass checks to see if __post_init__ has been defined, and if it has, it automatically runs after the object has been initialized.
You can also use @dataclass(init=False) if you want to define your own, more strongly typed constructor.
Dataclasses also have a few other niceties, such as an overloaded equals operator so you can compare your models. For comprehensive documentation on dataclasses, check out Python’s official documentation.
Hopefully you found this helpful! Please leave any thoughts or questions in the comments below. You can subscribe to more of these blogs at the top of the page or here.
Ready to optimize your Data Engineering for the future?