Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Safely saving pandas DataFrame to SQL table via Django #125

Open
pandichef opened this issue Aug 13, 2020 · 5 comments
Open

Safely saving pandas DataFrame to SQL table via Django #125

pandichef opened this issue Aug 13, 2020 · 5 comments

Comments

@pandichef
Copy link

pandichef commented Aug 13, 2020

I recently wrote the following for a project, but django-pandas actually seems like a better home for it. The idea is basically to implement to_sql via the Django ORM to get all the validation goodness of the ORM. (A related feature would be another set of tools to validate that a DataFrame is consistent with the Django model schema. That is the 2nd example below)

Is anyone here interested in a PR for this?

def to_django(df, DjangoModel, if_exists="fail"):
    """Uses bulk_create to insert data to Django table
    if_exists: see pd.DataFrame.to_sql API

    Ref: https://www.webforefront.com/django/multiplemodelrecords.html
    """
    import numpy as np

    if if_exists not in ["fail", "replace", "append"]:
        raise Exception("if_exists must be fail, replace or append")

    if if_exists == "replace":
        DjangoModel.objects.all().delete()
    elif if_exists == "fail":
        if DjangoModel.objects.all().count() > 0:
            raise ValueError("Data already exists in this table")
    else:
        pass

    dct = df.replace({np.nan: None}).to_dict(
        "records"
    )  # replace NaN with None since Django doesn't understand NaN

    bulk_list = []
    for x in dct:
        bulk_list.append(DjangoModel(**x))
    DjangoModel.objects.bulk_create(bulk_list)
    print("Successfully saved DataFrame to Django table.")
def check_dataframe_columns(
    df: pd.DataFrame, DjangoModel: Type[models.Model], strict: bool = False
):
    """Raises KeyError if DataFrame doesn't match a Django model

    Parameters
    ----------
    df: A pandas DataFrame object
    DjangoModel: A Django model
    strict: If strict is True, then the DataFrame must contain the exact set of columns
        as the Django mode
    """
    # todo: add column type checking
    error_list = []
    dataframe_columns = set(df.columns)
    django_columns = set(map(lambda x: x.name, DjangoModel._meta.fields))
    if not dataframe_columns.issubset(django_columns):
        unknown_columns = list(dataframe_columns - django_columns)
        unknown_columns.sort()  # to be repeatable
        error_list.append(
            f"DataFrame contains unknown column(s): {', '.join(unknown_columns)}"
        )
    if strict and not dataframe_columns == django_columns:
        missing_columns = list(django_columns - dataframe_columns)
        missing_columns.sort()  # to be repeatable
        error_list.append(
            f"With strict=True, DataFrame is missing column(s): {', '.join(missing_columns)}"
        )

    if error_list:
        raise KeyError(" | ".join(error_list))
@chrisdev
Copy link
Owner

@pandichef this is cool! But how do we cover the case when we create the the pandas dataframe based on a join (related field)

@pandichef
Copy link
Author

In my use case, I have a single foreign key called pool. So I set the value manually i.e.,

df['pool_id'] = 9

If you want to handle choices fields or foreign key fields more formally, I think it can be done. For model choices, you just need to check that the value is valid or throw an exception. For foreign keys, I'd add a boolean parameter to to_django named something like autogen_fk_object. If True, to_django will create a new foreign key object if the key doesn't already exist; if False, it throws an error. Something like Factory Boy's SubFactory class comes to mind.

@selfcontrol7
Copy link

This is an amazing code. Thanks a lot as it helped me with a project.
Great job mate!

@venqics
Copy link

venqics commented Jan 31, 2022

How can we add some logic to avoid duplicate entries for the incoming data. For example I am able to save the excel file in my database using to_sql. I have to validate that no two duplicate entries can be created for a column in this case I have title column and validate if two products have same title or name, the entry does not get added to the database.

@PJaramilloV
Copy link

This works wonders, thank you for posting!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants