👪 Adding the flux2stock script

This script converts a stock + flux into a new updated stock.
etalab · Dec 28, 2016 · ac55750 · ac55750
1 parent 3842b2c
commit ac55750
Show file tree

Hide file tree

Showing 3 changed files with 111 additions and 1 deletion.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -1,5 +1,10 @@
 # Changelog
 
+## 2.3.0 — 2016-12-29 — 👪 Flux2stock script
+
+* This script converts a stock + flux into a new updated stock.
+
+
 ## 2.2.0 — 2016-11-11 — 🛍 Display diffs
 
 * Serve the differences between two states for a given SIRET

diff --git a/README.md b/README.md
@@ -78,7 +78,7 @@ $ redis-server
 
 We assume that you have access to the source files for data:
 
-* `sirc-266_266_13705_201606_L_P_20161010_121909418.csv.bz2` is the unqiue stock file with 12 millions lines
+* `sirc-266_266_13705_201606_L_P_20161010_121909418.csv.bz2` is the unique stock file with 12 millions lines
 * `MisesajourQuotidiennes/sirc-266_266_13706_2016183_E_Q_20161020_131153997.csv` is one of the 42 daily update files with about 10000 lines each
 
 During the hackathon, you will also have access to 2 databases pre-loaded with
@@ -718,6 +718,28 @@ Your needs will drive our future developments on the subject so your
 feedback is increadibly valuable to us! 👍
 
 
+## Tools
+
+### Flux2Stock
+
+The aim of that script (available at the root of the directory) is to
+create a new stock file from a previous stock file and incremental daily
+files since then.
+
+You can use it that way:
+
+```shell
+$ python flux2stock.py stock-t.zip stock-t+2.csv flux-t+1.zip flux-t+2.zip
+```
+
+Here `stock-t.zip` is the initial stock, `stock-t+2.csv` is the name of
+the newly generated stock and `flux-t+1.zip flux-t+2.zip [...]` are
+daily updates since the initial stock creation.
+
+The generation of a new stock takes aproximatively 15 minutes on a
+recent computer. The RAM consumption should stay low.
+
+
 ## Contributing
 
 We’re really happy to accept contributions from the community, that’s the main reason why we open-sourced it! There are many ways to contribute, even if you’re not a technical person.

diff --git a/flux2stock.py b/flux2stock.py
@@ -0,0 +1,83 @@
+"""
+This script converts a stock + flux into a new updated stock.
+
+Before hacking, please benchmark the current script with the real stock.
+You should keep the duration and the RAM consumption as low as possible.
+"""
+
+import csv
+import io
+import os
+import sys
+from zipfile import ZipFile
+
+
+def _parse_zip_csv_file(filename):
+    """Yield each row from a ziped CSV file coming from INSEE."""
+    base_name, ext = os.path.splitext(os.path.basename(filename))
+    with ZipFile(filename) as zip_file:
+        with zip_file.open(base_name + '.csv') as csv_file:
+            csvio = io.TextIOWrapper(csv_file, encoding='cp1252')
+            reader = csv.DictReader(csvio, delimiter=';')
+            for i, row in enumerate(reader):
+                # Not proud to pass fieldnames to each iteration.
+                # Better than a global var?
+                yield i, row, reader.fieldnames
+
+
+def parse_fluxs(sources):
+    """For each line from sources, create a dict with SIRET as key."""
+    return {
+        # SIREN + NIC = SIRET.
+        row['SIREN'] + row['NIC']: row
+        for source in sources
+        for i, row, _ in _parse_zip_csv_file(source)
+    }
+
+
+def filter_stock(stock_in, modifications):
+    """Return modified entries and not deleted ones."""
+    for i, row, fieldnames in _parse_zip_csv_file(stock_in):
+        entry = modifications.get(row['SIREN'] + row['NIC'], row)
+        is_deleted = 'VMAJ' in entry and entry['VMAJ'] == 'E'
+        if not is_deleted:
+            yield i, entry, fieldnames
+
+
+def write_stock(stock_out, filtered_stock, modifications):
+    """
+    Generate the new stock file with modified and created entries.
+
+    We mimick the initial stock with encoding, quotes and delimiters.
+    """
+    with open(stock_out, 'w', encoding='cp1252') as csv_file:
+        _, first_row, fieldnames = next(filtered_stock)
+        # `extrasaction` is set to `ignore` to be able to pass more keys
+        # to the `writerow` method coming from the flux.
+        writer = csv.DictWriter(
+            csv_file, fieldnames=fieldnames, delimiter=';',
+            quoting=csv.QUOTE_ALL, extrasaction='ignore')
+        writer.writeheader()
+        # Because we already iterate once to retrieve fieldnames.
+        writer.writerow(first_row)
+        for i, row, _ in filtered_stock:
+            writer.writerow(row)
+        # Finally, append creations.
+        for siret, row in modifications.items():
+            is_created = row['VMAJ'] == 'C'
+            if is_created:
+                writer.writerow(row)
+
+
+if __name__ == '__main__':
+    if len(sys.argv) < 4:
+        BASE_USAGE = 'python flux2stock.py stock-t.zip '
+        print('Usages:')
+        print(BASE_USAGE + 'stock-t+1.csv flux-t+1.zip')
+        print(BASE_USAGE + 'stock-t+2.csv flux-t+1.zip flux-t+2.zip')
+    stock_in = sys.argv[1]
+    stock_out = sys.argv[2]
+    fluxs_zip = sys.argv[3:]
+    modifications = parse_fluxs(fluxs_zip)
+    filtered_stock = filter_stock(stock_in, modifications)
+    write_stock(stock_out, filtered_stock, modifications)