Normalizing Data¶

Basic Use¶

Let’s say you have this very simple table:

Name	Birthday
John Doe	1995
123	2003OCt
Jane Doe	1964
River Song	2019

You want to ensure the first field (Name) returns only strings and the Birthday column returns integers.

from excelerator import TableReader
from excelerator import normalize as n

tr = TableReader(
    path='path/to/excel.xlsx',
    sheetname='names and birthdays',
    fields='Name Birthday'.split(),
    normalize=[n.STRING(), n.INTEGER()],
)
fields = tr.get_fields()

fields['Name'] returns ['John Doe', '123', 'Jane Doe', 'River Song']

fields['Birthday'] returns [1995, 2003, 1964, 2019]

Note

Here’s a common “gotcha”: Make sure to instantiate the normalization classes. That is, normalize=[n.STRING(), n.INTEGER()] instead of normalize=[n.STRING, n.INTEGER]

Create Custom Normalizing Classes¶

But let’s say we don’t want the full string from Names, but just the first name.

We could subclass either NormalizeBase or one of its subclasses. Let’s subclass STRING.

# Continuing our code from above...

class FirstString(n.STRING)

    norm_func = n.STRING().normalize
    # Note the lack of parentheses after normalize
    # We do this here instead of in the normalize method
    # so that n.STRING gets instantiated only once.

    def normalize(self, value):
        # This is the function that gets called to norm your data.
        strings = norm_func(value).split()
        return strings[0]

tr.normalize = [n.FirstString(), n.INTEGER()]
fields = tr.get_fields()

fields['Name'] returns ['John', '123', 'Jane', 'River']

Pretty easy, right?