Normalizing Data¶
Basic Use¶
Let’s say you have this very simple table:
Name | Birthday |
---|---|
John Doe | 1995 |
123 | 2003OCt |
Jane Doe | 1964 |
River Song | 2019 |
You want to ensure the first field (Name) returns only strings and the Birthday column returns integers.
from excelerator import TableReader
from excelerator import normalize as n
tr = TableReader(
path='path/to/excel.xlsx',
sheetname='names and birthdays',
fields='Name Birthday'.split(),
normalize=[n.STRING(), n.INTEGER()],
)
fields = tr.get_fields()
fields['Name']
returns ['John Doe', '123', 'Jane Doe', 'River Song']
fields['Birthday']
returns [1995, 2003, 1964, 2019]
Note
Here’s a common “gotcha”: Make sure to instantiate the normalization classes.
That is, normalize=[n.STRING(), n.INTEGER()]
instead of normalize=[n.STRING, n.INTEGER]
Create Custom Normalizing Classes¶
But let’s say we don’t want the full string from Names
, but just the first name.
We could subclass either NormalizeBase
or one of its subclasses. Let’s subclass STRING
.
# Continuing our code from above...
class FirstString(n.STRING)
norm_func = n.STRING().normalize
# Note the lack of parentheses after normalize
# We do this here instead of in the normalize method
# so that n.STRING gets instantiated only once.
def normalize(self, value):
# This is the function that gets called to norm your data.
strings = norm_func(value).split()
return strings[0]
tr.normalize = [n.FirstString(), n.INTEGER()]
fields = tr.get_fields()
fields['Name']
returns ['John', '123', 'Jane', 'River']
Pretty easy, right?