Skip to content

Commit

Permalink
Merge pull request #2 from RGOODSFR/FactoryBoy-optim
Browse files Browse the repository at this point in the history
FactoryBoy optim
  • Loading branch information
SebCorbin authored Nov 15, 2024
2 parents 5f8bc3c + fc39e1a commit 6624094
Show file tree
Hide file tree
Showing 2 changed files with 154 additions and 0 deletions.
75 changes: 75 additions & 0 deletions content/0005_en_optimized_bulk_creation_factory_boy.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,75 @@
Title: Factory-Boy : Optimize bulk creation
Date: 2024-09-12 16:42
Id: 0005
Slug: optimize-bulk-creation-factory_boy
Lang: en
Category: development
Tags: django, tests
Summary: Speed up large dataset creation in factory boy

# The problem

When creating large dataset using [factory_boy](https://pypi.org/project/factory-boy/), you may find yourself using [`MyFactory.create_batch()`](https://factoryboy.readthedocs.io/en/stable/reference.html#factory.create_batch) which is great for specifyng a size, but falls short in terms of performance when using factories based on Django models.

Indeed, here's the related source code:
```python
@classmethod
def create_batch(cls, size, **kwargs):
"""Create a batch of instances of the given class, with overridden attrs.
Args:
size (int): the number of instances to create
Returns:
object list: the created instances
"""
return [cls.create(**kwargs) for _ in range(size)]
```

This means that an instance is generated and created for each iteration, resulting in numerous SQL queries, especially if your factory uses `SubFactory` (related model factories).

# The solution

To prevent too much SQL queries, it would be better to use `bulk_create` from the Django manager.

A simple solution can be to generate the instances and then saving them, you can also pass parameters (`notifications_enabled` for example):
```python
class ContactFactory(DjangoModelFactory):
class Meta:
model = Contact

# Create a thousand contacts
contact_list = ContactFactory.simple_generate_batch(
create=False, size=1000, notifications_enabled=True
)
contact_list = Contact.objects.bulk_create(contact_list)
```

But what if our factory has a `SubFactory`? You would certainly hit a N+1 problem. To overcome it, you may bulk create sequentially, while retaining primary keys:

```python
class ContactFactory(DjangoModelFactory):
class Meta:
model = Contact

class NotificationFactory(DjangoModelFactory)
contact = factory.SubFactory(ContactFactory)

class Meta:
model = Notification

size = 1000
# Create contacts
contact_list = ContactFactory.simple_generate_batch(
create=False, size=size, notifications_enabled=True
)
contact_list = Contact.objects.bulk_create(contact_list)

# Create a notification for each contact
obj_list = NotificationFactory.simple_generate_batch(
create=False, size=size, contact=None
)
for pos, obj in enumerate(obj_list):
obj.contact_id = contact_list[pos].pk
Notification.objects.bulk_create(obj_list)
```
79 changes: 79 additions & 0 deletions content/0005_fr_optimized_bulk_creation_factory_boy.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,79 @@
Title : Factory-Boy : Optimiser la création en masse
Date : 2024-09-12 16:42
Id : 0005
Slug : optimiser-creation-en-masse-factory_boy
Lang : fr
Category : développement
Tags : django, tests
Summary : Accélérer la création de grands ensembles de données avec factory_boy


# Le problème

Lorsque vous créez de grands ensembles de données en utilisant [factory_boy](https://pypi.org/project/factory-boy/), vous pouvez vous retrouver à utiliser [`MyFactory.create_batch()`](https://factoryboy.readthedocs.io/en/stable/reference.html#factory.create_batch), ce qui est excellent pour spécifier la taille de la liste, mais est limité en termes de performance lorsqu'il s'agit de factory basées sur des modèles Django.

En effet, voici le code source concerné :

```python
@classmethod
def create_batch(cls, size, **kwargs):
"""Create a batch of instances of the given class, with overridden attrs.
Args:
size (int): the number of instances to create
Returns:
object list: the created instances
"""
return [cls.create(**kwargs) for _ in range(size)]
```

Cela signifie qu'une instance est générée et créée pour chaque itération, entraînant de nombreuses requêtes SQL, surtout si votre factory utilise `SubFactory` (lorsqu'un modèle est associé via une `ForeignKey`).

# La solution

Pour éviter trop de requêtes SQL, il serait préférable d'utiliser `bulk_create` depuis le manager Django.

Une solution simple consiste à générer les instances puis à les sauvegarder. Vous pouvez également passer des paramètres (comme `notifications_enabled` par exemple) :

```python
class ContactFactory(DjangoModelFactory):
class Meta:
model = Contact

# Créer mille contacts
contact_list = ContactFactory.simple_generate_batch(
create=False, size=1000, notifications_enabled=True
)
contact_list = Contact.objects.bulk_create(contact_list)
```

Mais que faire si notre factory a une `SubFactory` ? Vous rencontrerez certainement un problème de N+1. Pour y remédier, vous pouvez créer en masse séquentiellement, en conservant les clés primaires :


```python
class ContactFactory(DjangoModelFactory):
class Meta:
model = Contact

class NotificationFactory(DjangoModelFactory)
contact = factory.SubFactory(ContactFactory)

class Meta:
model = Notification

size = 1000
# Création des contacts
contact_list = ContactFactory.simple_generate_batch(
create=False, size=size, notifications_enabled=True
)
contact_list = Contact.objects.bulk_create(contact_list)

# Créer une notification pour chaque contact
obj_list = NotificationFactory.simple_generate_batch(
create=False, size=size, contact=None
)
for pos, obj in enumerate(obj_list):
obj.contact_id = contact_list[pos].pk
Notification.objects.bulk_create(obj_list)
```

0 comments on commit 6624094

Please sign in to comment.